🧪 Skills

Agent Regression Guard

Prevent quality regressions after agent changes. Run targeted before/after checks for prompt, model, config, and tool updates; return pass rate, failure clus...

v0.3.0
❤️ 0
⬇️ 54
👁 1
Share

Description


name: agent-regression-guard description: >- Prevent quality regressions after agent changes. Run targeted before/after checks for prompt, model, config, and tool updates; return pass rate, failure clusters, risk level, and Go/Conditional-Go/Rollback verdict with prioritized fixes. Use when users ask "did this update break anything", "run regression checks", "compare before vs after", or "is this safe to deploy".

Agent Regression Guard

Objective

Catch breakage early when an agent is updated, and make release decisions with evidence instead of intuition.

Quickstart (5 minutes)

python3 scripts/regression_score.py \
  --before references/baseline.sample.json \
  --after references/after.sample.json \
  --risk medium \
  --strict \
  --out /tmp/regression_report.json

Expected output: deterministic report with deltas, failure clusters, gate reasons, and verdict.

Use Cases

  • Prompt revision before production rollout
  • Model switch (A -> B)
  • Toolchain/config changes
  • Post-incident verification after hotfixes

Typical trigger phrases:

  • "run regression checks"
  • "did this update break anything?"
  • "compare before vs after"
  • "is this safe to deploy?"

Inputs

Collect or infer:

  • Change summary (what changed)
  • Baseline outputs or baseline metrics
  • 10-30 representative test cases
  • Risk level (low/medium/high)
  • Success thresholds (pass rate, critical-case tolerance)

If test cases are missing, generate a minimal set and label it as synthetic.

Workflow

  1. Build regression suite
  • Include high-frequency and high-risk cases first.
  • Tag each case by task type and risk.
  1. Run before/after comparison (deterministic)
  • Score quality on correctness, relevance, actionability.
  • Record tool reliability failures separately.
  • Use scripts/regression_score.py --before <baseline.json> --after <updated.json> --strict.
  1. Detect failure clusters
  • Factual regressions
  • Instruction-following drift
  • Tool-call breakage
  • Safety/risk regressions
  1. Produce decision
  • Go
  • Conditional Go (with required fixes)
  • Rollback
  1. Define remediation loop
  • Top fixes by impact/effort
  • Re-test set for each fix

Deterministic Gates

  • Hard gate: if critical-case pass rate < threshold => no Go.
  • Hard gate: if tool reliability < threshold => no Go.
  • Hard gate: if any high-risk severe regression => Rollback.
  • Synthetic policy: in high-risk mode without real baseline evidence => no Go.

Required Output Structure

  1. Change Summary
  • What changed
  • Expected benefit
  • Risk level
  1. Regression Scorecard
  • Overall pass rate
  • Critical-case pass rate
  • Tool reliability pass rate
  • By task-type breakdown
  1. Failure Clusters
  • Cluster name
  • Frequency
  • User impact
  • Probable cause
  1. Top 5 Fixes
  • Fix action
  • Impact estimate
  • Effort (S/M/L)
  • Owner
  • Validation case IDs
  • Exact fix command(s)
  1. Release Verdict
  • Go / Conditional Go / Rollback
  • Conditions
  • Next checkpoint date
  • Deterministic scorer output snapshot
  1. Before/After Delta
  • pass-rate delta
  • critical delta
  • by-task delta
  • cluster delta

Quality Rules

  • Do not claim improvement without before/after evidence.
  • Critical-case failures outweigh average scores.
  • High-risk workflows require explicit human approval.
  • Prefer rollback over uncertain production exposure.
  • Always include deterministic scorer evidence when available.

Reference

  • Read references/regression-templates.md for reusable test suite templates and release gates.
  • Read references/ops-report-template.md for consistent release report formatting.

Reviews (0)

Sign in to write a review.

No reviews yet. Be the first to review!

Comments (0)

Sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Compatible Platforms

Pricing

Free

Related Configs