🧪 Skills
Agent Regression Guard
Prevent quality regressions after agent changes. Run targeted before/after checks for prompt, model, config, and tool updates; return pass rate, failure clus...
v0.3.0
Description
name: agent-regression-guard description: >- Prevent quality regressions after agent changes. Run targeted before/after checks for prompt, model, config, and tool updates; return pass rate, failure clusters, risk level, and Go/Conditional-Go/Rollback verdict with prioritized fixes. Use when users ask "did this update break anything", "run regression checks", "compare before vs after", or "is this safe to deploy".
Agent Regression Guard
Objective
Catch breakage early when an agent is updated, and make release decisions with evidence instead of intuition.
Quickstart (5 minutes)
python3 scripts/regression_score.py \
--before references/baseline.sample.json \
--after references/after.sample.json \
--risk medium \
--strict \
--out /tmp/regression_report.json
Expected output: deterministic report with deltas, failure clusters, gate reasons, and verdict.
Use Cases
- Prompt revision before production rollout
- Model switch (A -> B)
- Toolchain/config changes
- Post-incident verification after hotfixes
Typical trigger phrases:
- "run regression checks"
- "did this update break anything?"
- "compare before vs after"
- "is this safe to deploy?"
Inputs
Collect or infer:
- Change summary (what changed)
- Baseline outputs or baseline metrics
- 10-30 representative test cases
- Risk level (low/medium/high)
- Success thresholds (pass rate, critical-case tolerance)
If test cases are missing, generate a minimal set and label it as synthetic.
Workflow
- Build regression suite
- Include high-frequency and high-risk cases first.
- Tag each case by task type and risk.
- Run before/after comparison (deterministic)
- Score quality on correctness, relevance, actionability.
- Record tool reliability failures separately.
- Use
scripts/regression_score.py --before <baseline.json> --after <updated.json> --strict.
- Detect failure clusters
- Factual regressions
- Instruction-following drift
- Tool-call breakage
- Safety/risk regressions
- Produce decision
- Go
- Conditional Go (with required fixes)
- Rollback
- Define remediation loop
- Top fixes by impact/effort
- Re-test set for each fix
Deterministic Gates
- Hard gate: if critical-case pass rate < threshold => no Go.
- Hard gate: if tool reliability < threshold => no Go.
- Hard gate: if any high-risk severe regression => Rollback.
- Synthetic policy: in high-risk mode without real baseline evidence => no Go.
Required Output Structure
- Change Summary
- What changed
- Expected benefit
- Risk level
- Regression Scorecard
- Overall pass rate
- Critical-case pass rate
- Tool reliability pass rate
- By task-type breakdown
- Failure Clusters
- Cluster name
- Frequency
- User impact
- Probable cause
- Top 5 Fixes
- Fix action
- Impact estimate
- Effort (S/M/L)
- Owner
- Validation case IDs
- Exact fix command(s)
- Release Verdict
- Go / Conditional Go / Rollback
- Conditions
- Next checkpoint date
- Deterministic scorer output snapshot
- Before/After Delta
- pass-rate delta
- critical delta
- by-task delta
- cluster delta
Quality Rules
- Do not claim improvement without before/after evidence.
- Critical-case failures outweigh average scores.
- High-risk workflows require explicit human approval.
- Prefer rollback over uncertain production exposure.
- Always include deterministic scorer evidence when available.
Reference
- Read
references/regression-templates.mdfor reusable test suite templates and release gates. - Read
references/ops-report-template.mdfor consistent release report formatting.
Reviews (0)
Sign in to write a review.
No reviews yet. Be the first to review!
Comments (0)
No comments yet. Be the first to share your thoughts!