🧪 Skills
Agent Evals Lab
Evaluate agent quality and reliability with practical scorecards: accuracy, relevance, actionability, risk flags, tool-call failures, regression checks, and...
v0.3.0
Description
name: agent-evals-lab description: >- Evaluate agent quality and reliability with practical scorecards: accuracy, relevance, actionability, risk flags, tool-call failures, regression checks, and prioritized fix plans. Use when users ask to audit agent quality, compare prompt/config/model changes, investigate failures, or validate performance after updates.
Agent Evals Lab
Objective
Turn subjective “agent feels better/worse” into measurable quality signals and actionable fixes.
Quickstart (5 minutes)
python3 scripts/eval_score.py \
--input references/eval-cases.sample.json \
--risk medium \
--strict \
--out /tmp/evals_report.json
Expected output: deterministic scorecard with Go/Conditional Go/No-Go verdict, gate reasons, and by-task deltas.
Use Cases
- Audit current agent quality before production rollout
- Compare baseline vs changed prompt/config/toolchain
- Catch regressions after updates
- Prioritize highest-impact fixes for next sprint
Typical trigger phrases:
- "evaluate this agent" / "audit agent quality"
- "did the last prompt change improve results?"
- "compare model A vs model B"
- "why is this workflow failing?"
- "run regression checks after update"
- "is this ready for production?"
Inputs
Collect or infer:
- Agent purpose and target tasks
- 10-30 representative test cases (prompt + expected outcome)
- Constraints (latency/cost/risk tolerance)
- Environment notes (models/tools/channels)
If test cases are missing, generate a minimal starter set and label as synthetic.
Evaluation Dimensions (required)
Score each case on:
- Correctness
- Relevance
- Actionability
- Risk flags (safety, compliance, irreversible-action risk)
- Tool reliability (wrong tool, failed execution, silent fallback)
Use 1-5 scale + short evidence note per dimension.
Execution Workflow
- Build evaluation set
- Use real cases first, then synthetic gaps.
- Tag each case by task type and risk level.
- Run baseline evaluation (deterministic)
- Capture outputs + tool behavior.
- Score all required dimensions.
- Run
scripts/eval_score.py --input <cases.json> --risk <low|medium|high> --strict.
- Identify failure clusters
- Factual errors
- Reasoning gaps
- Tool-call failure patterns
- Over/under-asking clarifications
- Hallucinated confidence
- Propose fixes
- Prompt/process/tool changes
- Rank by expected impact vs effort
- Re-run focused regression set
- Validate top fixes on high-risk/high-frequency cases
Deterministic Gates
- Hard gate: high-risk workflows cannot be Go if critical minimum score < threshold.
- Hard gate: tool reliability average below threshold => no Go.
- Hard gate: synthetic-only evidence in high-risk mode => no Go.
- Strict mode applies deterministic thresholds before final recommendation.
Required Output Format
- Executive Summary
- Overall score snapshot
- Top strengths
- Top failure modes
- Scorecard
- Dimension averages
- By task-type breakdown
- By risk-level breakdown
- Deterministic scorer output snapshot
- Failure Map
- Cluster name
- Frequency
- User impact
- Root-cause hypothesis
- Top 5 Fixes (prioritized)
- Change
- Expected impact
- Effort (S/M/L)
- Owner
- Validation test
- Exact implementation command(s) where applicable
- Regression Plan (1-2 weeks)
- Cases to rerun
- Success thresholds
- Rollback trigger
- Go/No-Go Recommendation
- Go / Conditional Go / No-Go
- Conditions and next checkpoint date
- Before/After Delta
- overall delta
- critical delta
- tool reliability delta
- by-task delta
Quality Rules
- Prefer measured evidence over intuition.
- Separate facts, inferences, and recommendations.
- Never claim improvement without before/after evidence.
- For high-risk workflows, require explicit human-in-the-loop checkpoints.
- Include deterministic aggregate evidence before final Go/No-Go when case data is available.
Reference
- Read
references/eval-templates.mdfor reusable case templates and scoring rubrics. - Read
references/ops-report-template.mdfor the release memo format.
Reviews (0)
Sign in to write a review.
No reviews yet. Be the first to review!
Comments (0)
No comments yet. Be the first to share your thoughts!