Instrument Python LLM apps, build golden datasets, write eval-based tests, run them, and root-cause failures — covering the full eval-driven development cycl...
Validate OpenClaw skills during authoring. Use when creating, revising, or preparing a skill for release and you need to scaffold `evals/` files, check readi...
Autonomous engine that systematically evaluates and ranks agent skills across models using rubric grading, error taxonomy, and improvement feedback loops.
AI Agent Skill unit testing framework. A framework-agnostic toolkit for discovering, scaffolding, selecting, evaluating, and reporting on AI skills. Use this...
Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building eval...
Evaluate your RAG pipeline quality using Ragas metrics (faithfulness, answer relevancy, context precision).
Shadow-test local Ollama models against a cloud baseline with a multi-judge ensemble. Automatically promotes models when statistically proven equivalent — re...
Evaluate agent quality and reliability with practical scorecards: accuracy, relevance, actionability, risk flags, tool-call failures, regression checks, and...
Compare model candidates using weighted metrics and deterministic ranking outputs. Use for benchmark leaderboards and model promotion decisions.
Create new skills, modify and improve existing skills, and measure skill performance with eval-driven iteration. Use when users want to create a skill from s...
Create, evaluate, improve, benchmark, and publish OpenClaw skills. Use when building a new skill from scratch, iterating on an existing skill, running evals...
Three-agent pipeline orchestrator (Kalshalyst, Eval, Executor) for automated Kalshi prediction market trading with validation loops and retry logic
Simplified CLI tools for camoufox anti-detection browser automation. Provides fox-open, fox-scrape, fox-eval, fox-close, and fox-bilibili-stats commands for...
Access and interact with AI group interview simulations: browse jobs, create/join rooms, speak, advance interviews, upload resumes, and view history and eval...
OpenClaw continuity kernel for fail-open llm_input injection, deterministic runtime contracts, and shadow-mode eval receipts.
Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, update or optimize...
Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize a...
Design, test, review, and maintain agent skills for OpenClaw systems using multi-agent iterative refinement. Orchestrates Designer, Reviewer, and Tester suba...
Graph-based reasoning with thought combination and feedback loops. Explores multiple solution paths simultaneously, combines insights, and synthesizes optima...
Analyzes an existing Claude Code skill and designs an optimal rules/ file structure. Covers three operations: (1) compressing SKILL.md by moving verbose cont...
Version tracking for Agent Skills bundles and their associated files across sessions, surfaces, and platforms. Use when creating, editing, versioning, valida...
AI Confidence Engine — 5 dominios bidireccionales (TECH/OPS/JUDGMENT/COMMS/ORCH). Agent + User scoring. Triggers: puntúa, auto-score, task-complete, idea-val...