Auto-diagnose agent failures, extract reusable recovery patterns, and create local skills to fix recurring blockers while keeping all data private and local.
Post-mortem analysis for AI agent failures. Capture state, reconstruct timelines, identify root causes. When your agent breaks, know what happened, why, and...
Diagnose and fix Kubernetes pods, CrashLoopBackOff, Pending, DNS, networking, storage, and rollout failures with kubectl.
Instrument Python LLM apps, build golden datasets, write eval-based tests, run them, and root-cause failures — covering the full eval-driven development cycl...
Use when designing chaos experiments, implementing failure injection frameworks, or conducting game day exercises. Invoke for chaos experiments, resilience t...
--- name: test-writer-fixer description: "Use this agent when code changes have been made and you need to write new tests, run existing tests, analyze failures, and fix them while maintaining test int
--- name: cron-health-check displayName: Cron Health Check | OpenClaw Skill description: Monitors OpenClaw cron job health, identifies failures, timeouts, and delivery issues. version: 1.0.0 --- # Cr
Apply engineering judgment across systems, constraints, trade-offs, failure modes, and verification before acting.
Diagnose and triage cron job failures. Checks job states, identifies error patterns, prioritizes by criticality, generates health reports. Triggers on: cron...
Prevent quality regressions after agent changes. Run targeted before/after checks for prompt, model, config, and tool updates; return pass rate, failure clus...
Evaluate agent quality and reliability with practical scorecards: accuracy, relevance, actionability, risk flags, tool-call failures, regression checks, and...
--- name: test-sentinel description: Writes and runs tests (unit, integration, E2E), performs linting, and auto-fixes failures user-invocable: true --- # Test Sentinel You are a QA engineer responsi
Design and apply replication, partitioning, consensus, failure recovery, and message ordering patterns for reliable, scalable distributed systems.
Diagnose and fix bugs using runtime execution traces. Use when debugging errors, analyzing failures, or finding root causes in Python, Node.js, or Java appli...
Audit GitHub Actions workflow conclusion volatility to surface unstable pipelines before they become chronic failures.
Audit GitHub merge queue workflow health with failure-rate, queue-latency, and stale-success risk scoring.
Multi-queue task orchestration system. Tasks are routed to queues by model source, with support for task dependencies, context passing, and failure handling....
Use when: hardening OpenClaw cron/background workers (POSIX shells: bash/sh) against brittle quoting, cwd/env drift, and false pipeline failures (SIGPIPE, pi...
Detect branch-level GitHub Actions reliability drift by comparing failure and runtime deltas against a mainline baseline.
Installs a macOS or Linux service that probes the OpenClaw gateway every 2 minutes and auto-recovers it on failure, sending Telegram alerts.
Query and control Databricks jobs via text by checking status, listing recent runs, finding failures, and triggering pipelines using the REST API.
General-purpose self-healing loop that learns from past failures, retries safely, and records reusable fixes.
Audit pull-request and merge-queue GitHub Actions reliability by scoring failure rate, queue latency, and stale-success risk for merge gates.