Post-mortem analysis for AI agent failures. Capture state, reconstruct timelines, identify root causes. When your agent breaks, know what happened, why, and...
Diagnose and fix Kubernetes pods, CrashLoopBackOff, Pending, DNS, networking, storage, and rollout failures with kubectl.
Use when designing chaos experiments, implementing failure injection frameworks, or conducting game day exercises. Invoke for chaos experiments, resilience t...
Instrument Python LLM apps, build golden datasets, write eval-based tests, run them, and root-cause failures — covering the full eval-driven development cycl...
--- name: cron-health-check displayName: Cron Health Check | OpenClaw Skill description: Monitors OpenClaw cron job health, identifies failures, timeouts, and delivery issues. version: 1.0.0 --- # Cr
Apply engineering judgment across systems, constraints, trade-offs, failure modes, and verification before acting.
Diagnose and triage cron job failures. Checks job states, identifies error patterns, prioritizes by criticality, generates health reports. Triggers on: cron...
Prevent quality regressions after agent changes. Run targeted before/after checks for prompt, model, config, and tool updates; return pass rate, failure clus...
Evaluate agent quality and reliability with practical scorecards: accuracy, relevance, actionability, risk flags, tool-call failures, regression checks, and...
Design and apply replication, partitioning, consensus, failure recovery, and message ordering patterns for reliable, scalable distributed systems.
Audit GitHub Actions workflow conclusion volatility to surface unstable pipelines before they become chronic failures.
Diagnose and fix bugs using runtime execution traces. Use when debugging errors, analyzing failures, or finding root causes in Python, Node.js, or Java appli...
--- name: test-sentinel description: Writes and runs tests (unit, integration, E2E), performs linting, and auto-fixes failures user-invocable: true --- # Test Sentinel You are a QA engineer responsi
Audit GitHub merge queue workflow health with failure-rate, queue-latency, and stale-success risk scoring.
Installs a macOS or Linux service that probes the OpenClaw gateway every 2 minutes and auto-recovers it on failure, sending Telegram alerts.
Multi-queue task orchestration system. Tasks are routed to queues by model source, with support for task dependencies, context passing, and failure handling....
Detect branch-level GitHub Actions reliability drift by comparing failure and runtime deltas against a mainline baseline.
Query and control Databricks jobs via text by checking status, listing recent runs, finding failures, and triggering pipelines using the REST API.
Use when: hardening OpenClaw cron/background workers (POSIX shells: bash/sh) against brittle quoting, cwd/env drift, and false pipeline failures (SIGPIPE, pi...
Audit pull-request and merge-queue GitHub Actions reliability by scoring failure rate, queue latency, and stale-success risk for merge gates.
General-purpose self-healing loop that learns from past failures, retries safely, and records reusable fixes.
Schedule OpenClaw tasks using natural language with full cron lifecycle, timezone support, failure alerts, and execution logs without needing cron syntax.
Self-healing monitoring system for OpenClaw gateway. Auto-detects failures, fixes crashes, and sends Telegram alerts.