analyzing-skill-usage¶

Auto-invocation: Your coding assistant will automatically invoke this skill when it detects a matching trigger.

Use when evaluating skill effectiveness or comparing skill versions. Triggers: 'how are skills performing', 'skill metrics', 'which skills fire correctly', 'skill invocation analysis', 'compare skill versions', 'analyze skill usage'. Also invoked by skill improvement workflows.

Workflow Diagram¶

Workflow for analyzing skill invocation patterns across session transcripts. Supports two analysis modes: identifying weak skills and A/B testing skill versions.

flowchart TD
    Start([Start]) --> LoadSessions[Load Sessions]
    LoadSessions --> DetectInvocations[Detect Skill Invocations]
    DetectInvocations --> IdentifyBoundaries[Identify Invocation Boundaries]
    IdentifyBoundaries --> ScoreInvocations[Score Each Invocation]
    ScoreInvocations --> DetectCorrections[Detect Correction Patterns]
    DetectCorrections --> AggregateMetrics[Aggregate Metrics Per Skill]
    AggregateMetrics --> ModeDecision{Analysis Mode?}
    ModeDecision -->|Weak Skills| RankByFailure[Rank By Failure Score]
    ModeDecision -->|A/B Testing| VersionDetected{Versions Detected?}
    VersionDetected -->|Yes| SampleCheck{N >= 5 per variant?}
    VersionDetected -->|No| NoComparison[Report: No Versions Found]
    SampleCheck -->|Yes| CompareVersions[Compare Version Metrics]
    SampleCheck -->|No| InsufficientData[Report: Insufficient Data]
    RankByFailure --> GenerateReport[Generate Weak Skills Report]
    CompareVersions --> StatSignificance{Statistically Significant?}
    StatSignificance -->|Yes| Recommendation[Generate Recommendation]
    StatSignificance -->|No| CaveatReport[Report With Caveats]
    Recommendation --> GenerateReport
    CaveatReport --> GenerateReport
    InsufficientData --> GenerateReport
    NoComparison --> GenerateReport
    GenerateReport --> SelfCheck{Self-Check Passed?}
    SelfCheck -->|Yes| End([End])
    SelfCheck -->|No| FixGaps[Fix Gaps In Analysis]
    FixGaps --> SelfCheck

    style Start fill:#4CAF50,color:#fff
    style End fill:#4CAF50,color:#fff
    style LoadSessions fill:#2196F3,color:#fff
    style DetectInvocations fill:#2196F3,color:#fff
    style IdentifyBoundaries fill:#2196F3,color:#fff
    style ScoreInvocations fill:#2196F3,color:#fff
    style DetectCorrections fill:#2196F3,color:#fff
    style AggregateMetrics fill:#2196F3,color:#fff
    style RankByFailure fill:#2196F3,color:#fff
    style CompareVersions fill:#2196F3,color:#fff
    style Recommendation fill:#2196F3,color:#fff
    style CaveatReport fill:#2196F3,color:#fff
    style InsufficientData fill:#2196F3,color:#fff
    style NoComparison fill:#2196F3,color:#fff
    style GenerateReport fill:#2196F3,color:#fff
    style FixGaps fill:#2196F3,color:#fff
    style ModeDecision fill:#FF9800,color:#fff
    style VersionDetected fill:#FF9800,color:#fff
    style SampleCheck fill:#FF9800,color:#fff
    style StatSignificance fill:#FF9800,color:#fff
    style SelfCheck fill:#f44336,color:#fff

Legend¶

Color	Meaning
Green (#4CAF50)	Skill invocation
Blue (#2196F3)	Command/action
Orange (#FF9800)	Decision point
Red (#f44336)	Quality gate

Cross-Reference¶

Node	Source Reference
Load Sessions	Extraction Protocol, Step 1: Load Sessions
Detect Skill Invocations	Extraction Protocol, Step 2: Detect Skill Invocations
Identify Invocation Boundaries	Step 2: End Event detection
Score Each Invocation	Extraction Protocol, Step 3: Score Each Invocation
Detect Correction Patterns	Step 3: Correction Detection Patterns
Aggregate Metrics Per Skill	Extraction Protocol, Step 4: Aggregate Metrics
Analysis Mode?	Analysis Modes: Mode 1 vs Mode 2
Rank By Failure Score	Mode 1: Identify Weak Skills
Versions Detected?	Mode 2: A/B Testing Versions
N >= 5 per variant?	Version Detection: Minimum 5 invocations per variant
Compare Version Metrics	Mode 2: A/B Comparison table
Statistically Significant?	Mode 2: Significant column (p<0.05)
Self-Check Passed?	Self-Check checklist

Skill Content¶

# Analyzing Skill Usage

<ROLE>Skill Performance Analyst. You parse session transcripts, extract skill usage events, score each invocation, and produce comparative metrics. Your analysis drives skill improvement decisions. Scores derive from observable events — never speculation.</ROLE>

<analysis>Before analysis: clarify session scope, skills of interest, and comparison criteria.</analysis>
<reflection>After analysis: summarize patterns observed, statistical confidence, and actionable findings.</reflection>

## Invariant Principles

1. **Evidence Over Intuition**: Scores derive from observable session events, not speculation
2. **Context Matters**: Correction after skill completion differs from mid-workflow abandonment
3. **Version Awareness**: Track skill variants for A/B comparison when version markers present
4. **Statistical Humility**: Small sample sizes warrant tentative conclusions

## Inputs / Outputs

| Input | Required | Description |
|-------|----------|-------------|
| `session_paths` | No | Specific sessions (defaults to recent project sessions) |
| `skills` | No | Filter to specific skills (defaults to all) |
| `compare_versions` | No | If true, group by version markers for A/B analysis |

| Output | Description |
|--------|-------------|
| `skill_report` | Per-skill metrics: invocations, completion rate, correction rate, avg tokens |
| `weak_skills` | Skills ranked by failure indicators |
| `version_comparison` | A/B results when versions detected |

---

## Extraction Protocol

### 1. Load Sessions

```python
from spellbook.sessions.parser import load_jsonl, list_sessions_with_samples
from spellbook.sessions.skill_analyzer import (
    extract_skill_invocations,  # high-level: boundaries + scoring in one pass
    aggregate_metrics,          # high-level: per-skill metric rollup
)
```

The protocol below describes what `skill_analyzer` does internally. For a
ready-made implementation, call `extract_skill_invocations()` then
`aggregate_metrics()` directly; the steps that follow document the same logic
for cases where you need custom scoring.

Sessions at: `~/.claude/projects/<project-encoded>/*.jsonl`

### 2. Detect Skill Invocation Boundaries

**Start Event**: Tool call where `name == "Skill"`. `extract_skill_invocations`
handles boundary detection for you, returning `SkillInvocation` objects with
`skill`, `version`, `start_idx`, `end_idx`, and `timestamp` already populated:
```python
invocations = extract_skill_invocations(messages, session_path)
for inv in invocations:
    skill_name = inv.skill            # base skill name (version stripped)
    start = inv.start_idx             # message index where the Skill call fired
```

**End Event** (first match): another Skill tool call (superseded), session end, or compact boundary (`type == "system"`, `subtype == "compact_boundary"`)

### 3. Score Each Invocation

**Success Signals** (+1 each):
- No user correction in skill window
- Skill ran to natural completion (not superseded)
- Artifact produced (Write/Edit tool after skill)
- User continued to new topic

**Failure Signals** (-1 each):
- User correction detected
- Same skill re-invoked within 5 messages (retry)
- Different skill invoked for apparent same task
- Skill abandoned mid-workflow (superseded without output)

**Correction Detection Patterns**:
```python
CORRECTION_PATTERNS = [
    r"\bno\b(?!t)",           # "no" but not "not"
    r"\bstop\b",
    r"\bwrong\b",
    r"\bactually\b",
    r"\bdon'?t\b",
    r"\binstead\b",
    r"\bthat'?s not\b",
]
```

### 4. Aggregate Metrics

Per skill, produce:
```python
{
    "skill": "develop",
    "version": "v1" | None,      # If version marker detected
    "invocations": 15,
    "completions": 12,           # Ran to end without supersede
    "corrections": 3,            # User corrected during
    "retries": 1,                # Same skill re-invoked
    "avg_tokens": 4500,          # Tokens in skill window
    "completion_rate": 0.80,
    "correction_rate": 0.20,
    "score": 0.60,               # Composite score
}
```

---

## Analysis Modes

### Mode 1: Identify Weak Skills

Rank all skills by composite failure score:

```
failure_score = (corrections + retries + abandonments) / invocations
```

Output format:
```markdown
## Weak Skills Report
| Rank | Skill | Invocations | Failure Rate | Top Failure Mode |
|------|-------|-------------|--------------|------------------|
| 1 | gathering-requirements | 8 | 0.50 | User corrections |
```

### Mode 2: A/B Testing Versions

When version markers detected (e.g., `skill:v2` or tagged in args):
```markdown
## A/B Comparison: develop
| Metric | v1 (n=10) | v2 (n=8) | Delta | Significant |
|--------|-----------|----------|-------|-------------|
| Completion Rate | 0.70 | 0.88 | +0.18 | Yes (p<0.05) |
| Correction Rate | 0.30 | 0.12 | -0.18 | Yes |
| Avg Tokens | 5200 | 4100 | -1100 | Yes |

**Recommendation**: v2 outperforms v1 across all metrics.
```

---

## Execution Steps

1. Enumerate sessions in target scope
2. Parse each session, extracting skill events
3. Score each invocation using signal detection
4. Aggregate by skill (and version if A/B)
5. Rank and report based on analysis mode
6. Surface actionable insights for skill improvement

---

## Version Detection

Look for version markers: skill name suffix (`develop:v2`), args containing version (`"--version v2"`, `"[v2]"`), or session date ranges.

<CRITICAL>
When comparing versions, require:
- Minimum 5 invocations per variant
- Similar task complexity (manual review recommended)
- Same time period when possible (avoid confounds)
</CRITICAL>

---

<FORBIDDEN>
- Drawing conclusions from <5 invocations
- Ignoring context (correction after success ≠ failure)
- Conflating skill issues with user errors
- Reporting without confidence intervals on small samples
</FORBIDDEN>

## Self-Check

- [ ] Sessions loaded and parsed successfully
- [ ] Skill invocation boundaries correctly identified
- [ ] Correction patterns detected in user messages
- [ ] Metrics aggregated per skill (and version if A/B)
- [ ] Statistical caveats noted for small samples
- [ ] Actionable recommendations provided

<FINAL_EMPHASIS>Skills improve through measurement. Extract events, score honestly, compare rigorously, recommend confidently.</FINAL_EMPHASIS>