Skip to content

test-driven-development

Red-green-refactor TDD workflow that writes failing tests first, implements the minimum code to pass, then refactors. Enforces strict test-first discipline where every test must be observed failing before implementation begins. Invocable with /test-driven-development or triggered automatically when you say "TDD" or "write tests first".

Auto-invocation: Your coding assistant will automatically invoke this skill when it detects a matching trigger.

Use when user explicitly requests test-driven development. Triggers: 'TDD', 'write tests first', 'red green refactor', 'test-first', 'start with the test'. Also invoked by develop and executing-plans for implementation tasks. NOT for: full feature work (use develop, which includes TDD internally).

Origin

This skill originated from obra/superpowers.

Workflow Diagram

Red-Green-Refactor cycle enforcing strict test-first discipline. Tests must fail before implementation, implementation must be minimal, and refactoring must preserve green state. All assertions must be Level 4+ on the Assertion Strength Ladder with named mutations.

flowchart TD
    Start([Start]) --> Analyze[Analyze: What Behavior?]
    Analyze --> IronLawCheck{Code Exists Already?}
    IronLawCheck -->|Yes| Delete[Delete Code, Start Over]
    Delete --> RED
    IronLawCheck -->|No| RED

    RED[RED: Write Failing Test]
    RED --> RedDesc[One Behavior, Clear Name]
    RedDesc --> RunRed[Run Test Suite]
    RunRed --> RedGate{Test Failed?}
    RedGate -->|Passes Immediately| RedFail[Fix Test: Wrong Behavior]
    RedFail --> RED
    RedGate -->|Error, Not Failure| RedError[Fix Error, Re-Run]
    RedError --> RunRed
    RedGate -->|Failed as Expected| GREEN

    GREEN[GREEN: Write Minimal Code]
    GREEN --> GreenDesc[Simplest Code to Pass]
    GreenDesc --> RunGreen[Run Test Suite]
    RunGreen --> GreenGate{All Tests Pass?}
    GreenGate -->|No, New Test Fails| FixCode[Fix Code, Not Test]
    FixCode --> RunGreen
    GreenGate -->|No, Other Tests Fail| FixOther[Fix Regressions Now]
    FixOther --> RunGreen
    GreenGate -->|Yes, All Pass| REFACTOR

    REFACTOR[REFACTOR: Clean Up]
    REFACTOR --> RefactorDesc[Remove Duplication, Improve Names]
    RefactorDesc --> RunRefactor[Run Test Suite]
    RunRefactor --> RefactorGate{Still All Green?}
    RefactorGate -->|No| UndoRefactor[Undo Refactoring]
    UndoRefactor --> REFACTOR
    RefactorGate -->|Yes| MoreBehavior{More Behavior Needed?}

    MoreBehavior -->|Yes| RED
    MoreBehavior -->|No| SelfCheck[Self-Check Checklist]
    SelfCheck --> SelfGate{All Items Checked?}
    SelfGate -->|No| StartOver[Skipped TDD: Start Over]
    StartOver --> RED
    SelfGate -->|Yes| End([End])

    style Start fill:#4CAF50,color:#fff
    style End fill:#4CAF50,color:#fff
    style RED fill:#2196F3,color:#fff
    style RedDesc fill:#2196F3,color:#fff
    style RunRed fill:#2196F3,color:#fff
    style RedFail fill:#2196F3,color:#fff
    style RedError fill:#2196F3,color:#fff
    style GREEN fill:#2196F3,color:#fff
    style GreenDesc fill:#2196F3,color:#fff
    style RunGreen fill:#2196F3,color:#fff
    style FixCode fill:#2196F3,color:#fff
    style FixOther fill:#2196F3,color:#fff
    style REFACTOR fill:#2196F3,color:#fff
    style RefactorDesc fill:#2196F3,color:#fff
    style RunRefactor fill:#2196F3,color:#fff
    style UndoRefactor fill:#2196F3,color:#fff
    style Analyze fill:#2196F3,color:#fff
    style Delete fill:#2196F3,color:#fff
    style SelfCheck fill:#2196F3,color:#fff
    style StartOver fill:#2196F3,color:#fff
    style IronLawCheck fill:#FF9800,color:#fff
    style MoreBehavior fill:#FF9800,color:#fff
    style RedGate fill:#f44336,color:#fff
    style GreenGate fill:#f44336,color:#fff
    style RefactorGate fill:#f44336,color:#fff
    style SelfGate fill:#f44336,color:#fff

Legend

Color Meaning
Green (#4CAF50) Skill invocation
Blue (#2196F3) Command/action
Orange (#FF9800) Decision point
Red (#f44336) Quality gate

Cross-Reference

Node Source Reference
Analyze: What Behavior? Reasoning Schema (lines 62-67)
Code Exists Already? The Iron Law (lines 52-58): "Code before test? Delete. Start over."
Delete Code, Start Over The Iron Law (line 58)
RED: Write Failing Test RED phase (lines 78-80)
One Behavior, Clear Name RED guidance (line 80)
Run Test Suite Verify RED (lines 116-129)
Test Failed? Gate at lines 123-129: confirm failure, not error
Fix Test: Wrong Behavior Line 128: "Test passes? Testing existing behavior. Fix test."
Fix Error, Re-Run Line 129: "Test errors? Fix error, re-run until it fails correctly."
GREEN: Write Minimal Code GREEN phase (lines 131-133)
Simplest Code to Pass Line 133: "Simplest code to pass. No features, no refactoring."
All Tests Pass? Verify GREEN (lines 167-181)
Fix Code, Not Test Line 180: "Test fails? Fix code, not test."
Fix Regressions Now Line 181: "Other tests fail? Fix now."
REFACTOR: Clean Up REFACTOR phase (lines 183-185)
Remove Duplication, Improve Names Line 185: "Remove duplication, improve names, extract helpers."
Still All Green? REFACTOR gate (line 185): "Keep tests green."
More Behavior Needed? Repeat section (lines 187-189)
Self-Check Checklist Self-Check (lines 371-385)
All Items Checked? Line 385: "If ANY unchecked: Skipped TDD. Start over."

Skill Content

# Test-Driven Development

<ROLE>
Quality Engineer with zero-defect mindset. Reputation depends on shipping code that works, not code that "should work."
</ROLE>

## Invariant Principles

1. **Failure Proves Testing** - Test passing immediately proves nothing. Only watching failure proves test detects what it claims.
2. **Order Creates Trust** - Tests-first answer "what should this do?" Tests-after answer "what does this do?" Fundamentally different questions.
3. **Minimal Sufficiency** - Write exactly enough code to pass. YAGNI violations compound into untested complexity.
4. **Deletion Over Adaptation** - Code written before tests is contaminated. Keeping "as reference" means testing after. Delete means delete.

**Violating the letter of the rules is violating the spirit of the rules.**

## Inputs

| Input | Required | Description |
|-------|----------|-------------|
| Feature/bugfix description | Yes | What behavior to implement or fix |
| Existing test patterns | No | Project's testing conventions and frameworks |
| API contracts | No | Expected interface signatures |

## Outputs

| Output | Type | Description |
|--------|------|-------------|
| Failing test | File | Test demonstrating missing behavior |
| Minimal implementation | File | Code passing the test |
| Test execution evidence | Inline | Observed failure before green |

## When to Use

**Always:** new features, bug fixes, refactoring, behavior changes.

**Exceptions (ask human partner; no human available? default: apply TDD):**
- Throwaway prototypes
- Generated code
- Configuration files

Thinking "skip TDD just this once"? Stop. That's rationalization.

## The Iron Law

```
NO PRODUCTION CODE WITHOUT FAILING TEST FIRST
```

Code before test? Delete. Start over. No "reference," no "adapting," no looking at it.

## Reasoning Schema

<analysis>
Before writing ANY code:
- What behavior needs verification?
- What assertion proves that behavior?
- What's the simplest API shape?
</analysis>

<reflection>
After EACH phase:
- RED: Did test fail? Why? Expected failure mode?
- GREEN: Minimal code? No extra features?
- REFACTOR: Still green? Behavior unchanged?
</reflection>

## Red-Green-Refactor

### RED: Write Failing Test

One behavior. Clear name. Real code (mocks only if unavoidable; unavoidable = external I/O, time, hardware — not laziness).

<Good>
```typescript
test('retries failed operations 3 times', async () => {
  let attempts = 0;
  const operation = () => {
    attempts++;
    if (attempts < 3) throw new Error('fail');
    return 'success';
  };

  const result = await retryOperation(operation);

  expect(result).toBe('success');
  expect(attempts).toBe(3);
});
```
Clear name, tests real behavior, one thing
</Good>

<Bad>
```typescript
test('retry works', async () => {
  const mock = jest.fn()
    .mockRejectedValueOnce(new Error())
    .mockRejectedValueOnce(new Error())
    .mockResolvedValueOnce('success');
  await retryOperation(mock);
  expect(mock).toHaveBeenCalledTimes(3);
});
```
Vague name, tests mock not code
</Bad>

### Verify RED: Watch It Fail

**MANDATORY. Never skip.**

```bash
npm test path/to/test.test.ts
```

Confirm:
- Fails (not errors)
- Failure message expected
- Fails because feature missing (not typos)

Test passes? Testing existing behavior. Fix test.
Test errors? Fix error, re-run until it fails correctly.

### GREEN: Minimal Code

Simplest code to pass. No features, no refactoring, no "improvements."

<Good>
```typescript
async function retryOperation<T>(fn: () => Promise<T>): Promise<T> {
  for (let i = 0; i < 3; i++) {
    try {
      return await fn();
    } catch (e) {
      if (i === 2) throw e;
    }
  }
  throw new Error('unreachable');
}
```
Just enough to pass
</Good>

<Bad>
```typescript
async function retryOperation<T>(
  fn: () => Promise<T>,
  options?: {
    maxRetries?: number;
    backoff?: 'linear' | 'exponential';
    onRetry?: (attempt: number) => void;
  }
): Promise<T> {
  // YAGNI
}
```
Over-engineered
</Bad>

### Verify GREEN: Watch It Pass

**MANDATORY.**

```bash
npm test path/to/test.test.ts
```

Confirm:
- Test passes
- Other tests still pass
- Output pristine (no errors, warnings)

Test fails? Fix code, not test.
Other tests fail? Fix now.

### REFACTOR: Clean Up

After green only. Remove duplication, improve names, extract helpers. Keep tests green. Don't add behavior.

Complete when: duplication removed, names clear, all tests still green, no new behavior added.

### Repeat

Next failing test for next feature. The cycle continues until all behavior is implemented.

## Good Tests

| Quality | Good | Bad |
|---------|------|-----|
| **Minimal** | One thing. "and" in name? Split it. | `test('validates email and domain and whitespace')` |
| **Clear** | Name describes behavior | `test('test1')` |
| **Shows intent** | Demonstrates desired API | Obscures what code should do |
| **Fast** | Completes in <1s. No I/O or heavy fixtures in unit tests. | 5-second test with database setup for pure logic |
| **Content-validating** | Asserts exact values, full objects, every field | `assert len(result) > 0`, `assert result is not None` |

## Assertion Quality

<CRITICAL>
### The Full Assertion Principle

Every assertion MUST assert exact equality against the COMPLETE expected output. This applies to ALL output -- static, dynamic, or partially dynamic. There are no categories of output exempt from this rule.

```python
# CORRECT: exact equality on complete static output
assert result == expected_complete_output

# CORRECT: exact equality with dynamically constructed expected value
assert message == f"Today's date is {datetime.date.today().isoformat()}"

# BANNED: partial assertion on any output -- dynamic content is no excuse
assert "substring" in result          # Hides structural errors, missing content, extra garbage
assert datetime.date.today().isoformat() in message  # Dynamic value is no excuse for partial check
assert "foo" in result and "bar" in result  # Still partial, still BANNED
assert len(result) > 0                # Meaningless

# BANNED: mock.ANY hides argument values
mock_fn.assert_called_with(mock.ANY, mock.ANY)
```

ALL output demands complete verification: writers, serializers, formatters, code generators, query builders, template renderers, config builders, functions with dynamic content. For multi-line output, use triple-quoted strings or dedent helpers. Output length is NEVER a justification for partial assertions.

When output contains dynamic values (timestamps, derived strings), construct the complete expected value using the same logic, then assert `==`. Do not assert partial membership of the dynamic value.

For mock assertions: assert EVERY call with ALL args, verify call count. `mock.ANY` is BANNED -- construct the expected argument dynamically if needed.

Normalization is LAST RESORT only -- for truly unknowable values (random UUIDs, OS-assigned PIDs, memory addresses). Never use normalization to avoid constructing a complete expected value.
</CRITICAL>

Tests must validate CONTENT, not just EXISTENCE. Every assertion must answer: "If the value was garbage, would this catch it?"

**Reference:** Read `patterns/assertion-quality-standard.md` for the full Assertion Strength Ladder, Full Assertion Principle, Bare Substring Problem analysis, and Broken Implementation Test. Every assertion must be Level 4+ on the ladder.

### Rules

| Rule | BANNED Pattern | CORRECT Pattern |
|------|----------------|-----------------|
| **No existence-only assertions** | `assert len(result) > 0` (BANNED Level 1) | `assert result == [expected_item_1, expected_item_2]` |
| **No count-only assertions** | `assert len(result) == 3` (BANNED Level 2) | `assert result == [item_1, item_2, item_3]` |
| **No none-checks without content** | `assert response is not None` (BANNED Level 1) | `assert response == expected_response` |
| **No file existence-only checks** | `assert output_file.exists()` (BANNED Level 1) | `assert output_file.read_text() == "expected content"` |
| **No `mock.ANY` ever** | `assert_called_with(mock.ANY, mock.ANY)` (BANNED -- proves nothing) | `assert_called_with("expected_arg", expected_obj)` (construct dynamically if needed) |
| **Assert every mock call, all args** | `mock_fn.assert_called()` or `mock_fn.assert_called_once()` (no argument check) | `mock_fn.assert_has_calls([call(arg1, arg2), ...])` + verify `call_count` |
| **No partial checks on any output** | `assert "key" in result` (BANNED Level 2) | `assert result == {"key": "expected_value", ...}` (construct dynamically for dynamic output) |
| **Every field of every object** | `assert result.status == "ok"` (BANNED Level 3 without justification) | `assert result == ExpectedObject(status="ok", data=..., meta=...)` |
| **No bare substring on string output** | `assert "data" in output` (BANNED Level 2) | `assert output == expected` (exact equality) |
| **No multiple partials as substitute** | `assert "foo" in r and "bar" in r` (BANNED: still partial) | `assert r == expected_complete_string` |
| **No tautological assertions** | `assert result == func(same_input)` (BANNED: tests nothing) | Compute expected value independently |

### Pychoir Matchers

Pychoir matchers (including custom subclasses) are the ONE exception to the `mock.ANY` ban. Use them ONLY when the value genuinely cannot be known ahead of time (timestamps, UUIDs, auto-incremented IDs). Each use requires a stated justification in a comment.

```python
# GOOD: UUID genuinely unknowable, justification stated
from pychoir import IsInstance
assert result == {"id": IsInstance(str), "name": "expected_name"}  # id is server-generated UUID

# BAD: Using matcher to avoid computing expected value
assert result == {"count": IsInstance(int), "items": IsInstance(list)}  # Lazy - compute the expected values
```

### ESCAPE Analysis (Mandatory Per Test Function)

Before marking ANY test function complete, fill in this template:

```
ESCAPE: [test_function_name]
  CLAIM: What does this test claim to verify?
  PATH:  What code actually executes?
  CHECK: What do the assertions verify?
  MUTATION: For each assertion, name the specific production code mutation it catches.
  ESCAPE: What specific broken implementation would still pass this test?
  IMPACT: What breaks in production if that broken implementation ships?
```

The MUTATION field is a forcing function: for each assertion in the test, you must name a specific, plausible production code change that would cause that assertion to fail. If you cannot name one, the assertion is too weak and must be strengthened before proceeding. This is the Broken Implementation Test from `patterns/assertion-quality-standard.md`.

The ESCAPE field must describe a SPECIFIC broken implementation, not a generic statement. "Nothing reasonable" IS valid when justified, but requires explanation of why the assertions are comprehensive enough.

**Red flags in ESCAPE analysis:**
- ESCAPE field says "none" or "nothing" without justification
- ESCAPE field describes something the test DOES catch (means you didn't think hard enough)
- ESCAPE field is copy-pasted between tests (each test has unique escape paths)

## Evidence Requirements

| Claim | Required Evidence |
|-------|-------------------|
| "Test works" | Observed failure output with expected message |
| "Feature complete" | All tests pass, watched each fail first |
| "Refactor safe" | Tests stayed green throughout |

## Anti-Patterns

<FORBIDDEN>
All of the following mean: Delete code. Start over with TDD.

- Code before test
- Test passes immediately (without watching it fail)
- Can't explain why test failed
- "Just this once" / "already manually tested"
- "Keep as reference" / "adapt existing"
- "Tests after achieve same goals"
- "TDD is dogmatic, being pragmatic"
- "It's about spirit not ritual"
- "Already spent X hours, deleting is wasteful"
- "This is different because..."
- Tests added "later" / tests after implementation
- Rationalizing "I already manually tested it"
- "Need to explore first" (throw away exploration, start with TDD)
</FORBIDDEN>

## Common Rationalizations

| Excuse | Reality |
|--------|---------|
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
| "I'll test after" | Tests passing immediately prove nothing. |
| "Tests after achieve same goals" | Tests-after = "what does this do?" Tests-first = "what should this do?" You test what you built, not what's required. You verify remembered edge cases, not discovered ones. |
| "Already manually tested" | Ad-hoc is not systematic. No record, can't re-run. Forgot cases are the ones that break under pressure. |
| "Deleting X hours is wasteful" | Sunk cost fallacy. Two paths: delete and rewrite with TDD (X more hours, high confidence) OR keep and add tests after (30 min, low confidence, likely bugs). |
| "Keep as reference, write tests first" | You'll adapt it. That's testing after. Delete means delete. |
| "Need to explore first" | Fine. Throw away exploration, start with TDD. |
| "Test hard = design unclear" | Listen to test. Hard to test = hard to use. |
| "TDD will slow me down" | TDD is faster than debugging. Finds bugs before commit, prevents regressions, documents behavior, enables refactoring. |
| "Manual test faster" | Manual doesn't prove edge cases. You'll re-test every change. |
| "Existing code has no tests" | You're improving it. Add tests for existing code. |

## Self-Check

Before marking complete:
- [ ] Every function has test
- [ ] Watched each test fail before implementing
- [ ] Failed for expected reason (feature missing, not typo)
- [ ] Wrote minimal code to pass
- [ ] All tests pass, output pristine
- [ ] Tests use real code (mocks only if unavoidable)
- [ ] Edge cases and errors covered
- [ ] Every assertion validates CONTENT, not just existence/count
- [ ] Every assertion is Level 4+ on the Assertion Strength Ladder (`patterns/assertion-quality-standard.md`)
- [ ] ESCAPE analysis completed for every test function (including MUTATION field)
- [ ] Every assertion has a named mutation that would cause it to fail
- [ ] No mock.ANY in assertions (use pychoir matchers with justification comment instead)
- [ ] Every mock call asserted with ALL arguments; call count verified
- [ ] No `len() > 0` or `len() == N` without content verification
- [ ] No bare substring checks on string output (dynamic content is no excuse -- construct full expected value)

If ANY unchecked: Skipped TDD. Start over.

## When Stuck

| Problem | Solution |
|---------|----------|
| Don't know how to test | Write wished-for API. Assertion first. Ask human. |
| Test too complicated | Design too complicated. Simplify interface. |
| Must mock everything | Code too coupled. Dependency injection. |
| Test setup huge | Extract helpers. Still complex? Simplify design. |

## Bug Fix Pattern

**Bug:** Empty email accepted

**RED**
```typescript
test('rejects empty email', async () => {
  const result = await submitForm({ email: '' });
  expect(result.error).toBe('Email required');
});
```

**Verify RED**
```bash
$ npm test
FAIL: expected 'Email required', got undefined
```

**GREEN**
```typescript
function submitForm(data: FormData) {
  if (!data.email?.trim()) {
    return { error: 'Email required' };
  }
  // ...existing logic
}
```

**Verify GREEN**
```bash
$ npm test
PASS
```

**REFACTOR**
Extract validation for multiple fields if needed.

Never fix bugs without a test.

## Testing Anti-Patterns

When adding mocks or test utilities, avoid common pitfalls:

| Anti-Pattern | Problem | Solution |
|--------------|---------|----------|
| Testing mock behavior | Proves mock works, not code | Use real dependencies when possible |
| Test-only methods | Production code polluted for tests | Refactor design for testability |
| Blind mocking | Don't understand what's mocked | Trace dependency chain first |
| Over-mocking | Tests pass but behavior broken | Mock boundaries only (external dependencies: network, DB, filesystem), not internals |

## Test Speed & Scope

Fast tests enable tight red-green-refactor cycles. Slow tests break flow.

- **Isolate expensive resources**: Mock GPU, network, and DB calls in unit tests. Real resources belong in integration tests only.
- **Smallest possible inputs**: 4x4 matrices, not 1024x1024. Save large inputs for performance/integration tests.
- **Never sleep in tests**: Poll with short intervals, or mock the time-dependent component.
- **Lightweight fixtures**: If a fixture takes longer than the test itself, it is too heavy for a unit test.

Apply marks proactively when writing new tests. A test that calls a GPU kernel is a GPU test even if it is fast today. Common marks: `slow`, `gpu`/`hardware`, `network`/`external`, `integration`, `smoke`.

Scope test runs to changes: if `src/auth/login.py` changed, run `tests/test_login.py`, not the entire suite. Run the full suite once at the end of a work unit, not after every edit.

## Final Rule

```
Production code -> test exists and failed first
Otherwise -> not TDD
```

No exceptions without your human partner's permission.

<FINAL_EMPHASIS>
The test must fail first. You must watch it fail. The code must be minimal. There are no shortcuts. Every rationalization is a trap. Delete code written before tests. Start over with TDD.
</FINAL_EMPHASIS>