/write-skill-test¶
Workflow Diagram¶
RED-GREEN-REFACTOR implementation for skill testing. Establishes baseline agent behavior without the skill (RED), writes a minimal skill addressing observed failures (GREEN), then closes loopholes by adding counters for new rationalizations (REFACTOR).
flowchart TD
Start([Start]) --> RED[RED Phase]
RED --> DesignScenarios[Design 3+ pressure\nscenarios]
DesignScenarios --> SpawnBaseline[Spawn subagents\nWITHOUT skill]
SpawnBaseline --> DocVerbatim[Document verbatim\nrationalizations]
DocVerbatim --> IdentifyPatterns[Identify patterns\nacross runs]
IdentifyPatterns --> SaveBaseline[Save baseline docs]
SaveBaseline --> GREEN[GREEN Phase]
GREEN --> CreateSkill[Create SKILL.md\nfrom schema]
CreateSkill --> AddressFailures[Address specific\nbaseline failures]
AddressFailures --> SchemaCheck{Schema\ncompliant?}
SchemaCheck -- No --> FixSchema[Fix schema issues]
FixSchema --> SchemaCheck
SchemaCheck -- Yes --> RerunScenarios[Rerun scenarios\nWITH skill]
RerunScenarios --> AgentComplies{Agent\ncomplies?}
AgentComplies -- No --> ReviseSkill[Revise skill]
ReviseSkill --> RerunScenarios
AgentComplies -- Yes --> REFACTOR[REFACTOR Phase]
REFACTOR --> ReviewResults[Review GREEN results\nfor new rationalizations]
ReviewResults --> NewRats{New\nrationalizations?}
NewRats -- Yes --> AddCounters[Add explicit counters]
AddCounters --> BuildTable[Build rationalization\ntable]
BuildTable --> CreateRedFlags[Create red flags list]
CreateRedFlags --> ReTest[Re-test all scenarios]
ReTest --> NewRats
NewRats -- No --> QualityChecks{Quality checks\npass?}
QualityChecks -- No --> FixQuality[Fix quality issues]
FixQuality --> QualityChecks
QualityChecks -- Yes --> Deploy[Deploy: commit + push]
Deploy --> Done([Done])
style Start fill:#4CAF50,color:#fff
style Done fill:#4CAF50,color:#fff
style RED fill:#f44336,color:#fff
style GREEN fill:#4CAF50,color:#fff
style REFACTOR fill:#2196F3,color:#fff
style SchemaCheck fill:#f44336,color:#fff
style AgentComplies fill:#f44336,color:#fff
style NewRats fill:#FF9800,color:#fff
style QualityChecks fill:#f44336,color:#fff
style DesignScenarios fill:#2196F3,color:#fff
style SpawnBaseline fill:#2196F3,color:#fff
style DocVerbatim fill:#2196F3,color:#fff
style IdentifyPatterns fill:#2196F3,color:#fff
style SaveBaseline fill:#2196F3,color:#fff
style CreateSkill fill:#2196F3,color:#fff
style AddressFailures fill:#2196F3,color:#fff
style FixSchema fill:#2196F3,color:#fff
style RerunScenarios fill:#2196F3,color:#fff
style ReviseSkill fill:#2196F3,color:#fff
style ReviewResults fill:#2196F3,color:#fff
style AddCounters fill:#2196F3,color:#fff
style BuildTable fill:#2196F3,color:#fff
style CreateRedFlags fill:#2196F3,color:#fff
style ReTest fill:#2196F3,color:#fff
style FixQuality fill:#2196F3,color:#fff
style Deploy fill:#2196F3,color:#fff
Legend¶
| Color | Meaning |
|---|---|
| Green (#4CAF50) | Skill invocation |
| Blue (#2196F3) | Command/action |
| Orange (#FF9800) | Decision point |
| Red (#f44336) | Quality gate |
Command Content¶
# RED-GREEN-REFACTOR Skill Testing
## Invariant Principles
1. **No skill without a failing test first** - Writing a skill before observing baseline agent behavior is a violation; delete and start over
2. **Pressure scenarios must combine multiple pressures** - Single-pressure tests do not reveal rationalization patterns; combine time pressure, ambiguity, and temptation
3. **Verbatim evidence, not summaries** - Document exact agent quotes and choices during baseline testing; paraphrasing obscures the failure modes the skill must address
<ROLE>
Skill Tester + TDD Practitioner. Your job is to rigorously test, write, and bulletproof skills using the RED-GREEN-REFACTOR cycle. A skill that agents skip or rationalize around is a failure, regardless of how well-written it appears.
</ROLE>
## Iron Law
```
NO SKILL WITHOUT FAILING TEST FIRST
```
Applies to NEW skills AND EDITS. Write skill before testing? Delete it. Start over. Edit skill without testing? Same violation.
## Phase Sequence
### RED: Write Failing Test (Baseline)
Run pressure scenarios with a subagent WITHOUT the target skill loaded. Observe natural behavior before writing anything.
1. Design 3+ scenarios combining multiple pressures:
| Pressure Combo | Example |
|----------------|---------|
| Time + complexity | "implement this quickly, it's blocking production" |
| Ambiguity + defaults | "the spec is unclear, use your best judgment" |
| Conflicting constraints | "make it fast AND thorough" |
| Social pressure | "the team is waiting, just get something working" |
2. Spawn one subagent per scenario WITHOUT the skill; capture verbatim: exact rationalization quotes, decision points where agent deviated, which pressures triggered violations, patterns across scenarios
3. Save baseline documentation for GREEN phase comparison
**Fractal exploration (optional):** For complex multi-phase skills, invoke fractal-thinking with intensity `pulse` and seed: "What scenarios would tempt an agent to skip [skill-name]?" Use the synthesis to expand the pressure scenario list.
### GREEN: Write Minimal Skill
Address ONLY the specific rationalizations observed in RED. No hypothetical content.
Create `SKILL.md` per schema:
- [ ] Name uses only letters, numbers, hyphens
- [ ] YAML frontmatter: `name` and `description` (<1024 chars)
- [ ] Description starts "Use when..." — triggers only, NO workflow; third person
- [ ] Keywords throughout (errors, symptoms, tools)
- [ ] Overview with core principle; When to Use section with symptoms; Quick Reference table; Common Mistakes section
- [ ] Address specific baseline failures from RED only
- [ ] One excellent example (not multi-language)
Run the SAME scenarios WITH the skill loaded. Agent must comply before proceeding to REFACTOR. If not compliant, revise the skill and re-run scenarios before proceeding.
### REFACTOR: Close Loopholes
Agent found new rationalization? Add explicit counter. Re-test until bulletproof.
1. For each new rationalization: add explicit counter; document in rationalization table
2. Build red flags list from all test iterations
3. Re-run all pressure scenarios
4. Repeat until no new rationalizations appear
5. Final verification: agent complies under ALL pressure combinations
## Bulletproofing Discipline Skills
| Excuse | Reality |
|--------|---------|
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
| "I'll test after" | Tests passing immediately prove nothing. |
| "Skill is obviously clear" | Clear to you does not equal clear to other agents. Test it. |
| "It's just a reference" | References can have gaps. Test retrieval. |
| "Testing is overkill" | Untested skills have issues. Always. |
| "I'm confident it's good" | Overconfidence guarantees issues. Test anyway. |
| "No time to test" | Deploying untested wastes more time fixing later. |
**Red flags (agents self-check):**
- Code before test
- "I already manually tested it"
- "Tests after achieve the same purpose"
- "It's about spirit not ritual"
- "This is different because..."
**All of these mean: Delete code. Start over with TDD.**
<FORBIDDEN>
- Writing a skill before running baseline scenarios
- Running single-pressure tests (must combine multiple pressures)
- Paraphrasing agent behavior instead of capturing verbatim quotes
- Adding hypothetical content in GREEN not observed in RED
- Skipping re-test after each REFACTOR iteration
- Declaring bulletproof before agent complies under ALL pressure combinations
</FORBIDDEN>
## Skill Creation Checklist
**Use TodoWrite to create todos for EACH item.**
**RED:**
- [ ] Create 3+ combined-pressure scenarios
- [ ] Run WITHOUT skill — document baseline verbatim
- [ ] Identify rationalization patterns
**GREEN:**
- [ ] Name: letters, numbers, hyphens only
- [ ] YAML frontmatter with name + description (<1024 chars)
- [ ] Description: "Use when..." — triggers only, NO workflow, third person
- [ ] Keywords throughout (errors, symptoms, tools)
- [ ] Overview with core principle; When to Use; Quick Reference table; Common Mistakes
- [ ] Address specific RED failures only
- [ ] One excellent example (not multi-language)
- [ ] Run scenarios WITH skill — verify compliance
**REFACTOR:**
- [ ] Identify new rationalizations from testing
- [ ] Add explicit counters; document in rationalization table
- [ ] Build red flags list
- [ ] Re-test until bulletproof
**Quality Checks:**
- [ ] Quick reference table for scanning
- [ ] Common mistakes section
- [ ] Small flowchart only if decision non-obvious
- [ ] No narrative storytelling
- [ ] Supporting files only for tools or heavy reference
**Deploy:**
- [ ] Commit skill to git
- [ ] Push to fork if configured
- [ ] Consider PR if broadly useful
<FINAL_EMPHASIS>
A skill written before baseline testing has already failed. The Iron Law is not a suggestion — it is the entire point. No rationalization justifies skipping RED phase. Delete. Start over. Test first.
</FINAL_EMPHASIS>