name: agent-regression-guard description: >- Prevent quality regressions after agent changes. Run targeted before/after checks for prompt, model, config, and tool updates; return pass rate, failure clusters, risk level, and Go/Conditional-Go/Rollback verdict with prioritized fixes. Use when users ask "did this update break anything", "run regression checks", "compare before vs after", or "is this safe to deploy".

Agent Regression Guard

Objective

Catch breakage early when an agent is updated, and make release decisions with evidence instead of intuition.

Quickstart (5 minutes)

python3 scripts/regression_score.py \
  --before references/baseline.sample.json \
  --after references/after.sample.json \
  --risk medium \
  --strict \
  --out /tmp/regression_report.json

Expected output: deterministic report with deltas, failure clusters, gate reasons, and verdict.

Use Cases

Prompt revision before production rollout
Model switch (A -> B)
Toolchain/config changes
Post-incident verification after hotfixes

Typical trigger phrases:

"run regression checks"
"did this update break anything?"
"compare before vs after"
"is this safe to deploy?"

Inputs

Collect or infer:

Change summary (what changed)
Baseline outputs or baseline metrics
10-30 representative test cases
Risk level (low/medium/high)
Success thresholds (pass rate, critical-case tolerance)

If test cases are missing, generate a minimal set and label it as synthetic.

Workflow

Build regression suite

Include high-frequency and high-risk cases first.
Tag each case by task type and risk.

Run before/after comparison (deterministic)

Score quality on correctness, relevance, actionability.
Record tool reliability failures separately.
Use scripts/regression_score.py --before <baseline.json> --after <updated.json> --strict.

Detect failure clusters

Factual regressions
Instruction-following drift
Tool-call breakage
Safety/risk regressions

Produce decision

Go
Conditional Go (with required fixes)
Rollback

Define remediation loop

Top fixes by impact/effort
Re-test set for each fix

Deterministic Gates

Hard gate: if critical-case pass rate < threshold => no Go.
Hard gate: if tool reliability < threshold => no Go.
Hard gate: if any high-risk severe regression => Rollback.
Synthetic policy: in high-risk mode without real baseline evidence => no Go.

Required Output Structure

Change Summary

What changed
Expected benefit
Risk level

Regression Scorecard

Overall pass rate
Critical-case pass rate
Tool reliability pass rate
By task-type breakdown

Failure Clusters

Cluster name
Frequency
User impact
Probable cause

Top 5 Fixes

Fix action
Impact estimate
Effort (S/M/L)
Owner
Validation case IDs
Exact fix command(s)

Release Verdict

Go / Conditional Go / Rollback
Conditions
Next checkpoint date
Deterministic scorer output snapshot

Before/After Delta

pass-rate delta
critical delta
by-task delta
cluster delta

Quality Rules

Do not claim improvement without before/after evidence.
Critical-case failures outweigh average scores.
High-risk workflows require explicit human approval.
Prefer rollback over uncertain production exposure.
Always include deterministic scorer evidence when available.

Reference

Read references/regression-templates.md for reusable test suite templates and release gates.
Read references/ops-report-template.md for consistent release report formatting.

Agent Regression Guard

Description

Agent Regression Guard

Objective

Quickstart (5 minutes)

Use Cases

Inputs

Workflow

Deterministic Gates

Required Output Structure

Quality Rules

Reference

Reviews (0)

Comments (0)

Compatible Platforms

Links

Pricing

Related Configs

self-improving-agent

Self Improving Agent

Find Skills

Summarize