🧪 Skills
Delx Ops Guardian
Automatically detects, assesses, and safely mitigates incidents in OpenClaw production agents, providing detailed reports and verified recovery.
v1.0.2
Description
name: delx-ops-guardian summary: Incident handling and operational recovery guardrails for OpenClaw production agents. owner: davidmosiah status: active
Delx Ops Guardian
Use this skill when handling incidents, degraded automations, or gateway/memory instability in production.
Aliases
emergency_recoveryhandle_incidentcron_guardmemory_guardgateway_guard
Scope (strict)
This skill is runbook-only and must operate under least privilege.
Allowed read sources:
- OpenClaw cron state (
openclaw cron list --json) - Service health/status (
systemctl is-active <service>) - Recent logs for incident window (
journalctl -u <service> --since ... --no-pager) - Workspace incident artifacts (
/root/.openclaw/workspace/docs/ops/,/root/.openclaw/workspace/memory/)
Allowed remediation actions (safe set):
- Retry a failed job once when failure is transient.
- Controlled restart of the impacted service only (
openclaw-gateway,openclaw, or explicitly named target from incident evidence). - Disable/enable only the directly impacted cron job when loop-failing.
- Add/adjust guardrails in runbook/config docs (non-secret, reversible).
Disallowed actions:
- No credential rotation/deletion.
- No firewall/network policy mutations.
- No package installs/upgrades during incident handling.
- No bulk cron rewrites unrelated to the incident.
- No edits to unrelated services/components.
Approval policy (human-in-the-loop)
Require explicit human approval before:
- Restarting any production service more than once.
- Editing cron schedules/timezones.
- Disabling a job for more than one cycle.
- Any action with user-visible impact beyond the failing component.
Core workflow
- Detect and classify severity (
info,degraded,critical). - Collect evidence first (status, logs, last run, error streak).
- Propose smallest remediation from allowed set.
- Execute only approved/safe remediation.
- Verify stabilization window (at least one successful cycle).
- Publish concise incident report.
Safety rules
- Never hide persistent failures as success.
- Never expose secrets/tokens in logs or reports.
- Prefer reversible actions and document rollback path.
- Keep blast radius minimal and explicitly stated.
Output contract
Always include:
- Incident id/time window
- Root signal and blast radius
- Actions executed (and approvals)
- Evidence (status, key metric, short log excerpt)
- Final state (
resolved,degraded,open) - Next check time
Example intents
- "Gateway is flapping, recover safely."
- "Cron timed out, stabilize and prove fix."
- "Memory guard firing repeatedly, root-cause and patch."
Reviews (0)
Sign in to write a review.
No reviews yet. Be the first to review!
Comments (0)
No comments yet. Be the first to share your thoughts!