🧪 Skills

Token Guard

Prevents LLM API 429 errors by estimating tokens, tracking quotas, throttling requests, detecting duplicates, caching responses, and auto-fallback by model.

v1.5.0
❤️ 0
⬇️ 568
👁 1
Share

Description

TokenGuard — LLM API 429 Prevention Engine

Version: 1.5.0
Author: Aoineco & Co.
License: MIT
Tags: rate-limit, 429, token-management, cost-optimization, llm-guard, high-performance

Description

Prevents LLM API 429 (Rate Limit / Resource Exhausted) errors by intercepting requests before they're sent. Designed for users on free/low-cost API plans who need maximum intelligence per dollar.

Core philosophy: "Intelligence is measured not by how much you spend, but by how little you need."

Problem

When using LLM APIs (especially Google Gemini Flash with 1M TPM limit):

  • Large documents (docx, PDFs) can consume the entire minute quota in one request
  • Failed requests still count toward token usage
  • Retry loops after 429 errors waste more tokens → death spiral
  • No built-in way to detect runaway/duplicate requests

Features

Feature Description
Pre-flight Token Estimation Estimates token count before API call (CJK-aware, no tiktoken dependency)
Real-time Quota Tracking Tracks per-model per-minute token usage with sliding window
Smart Throttle Auto-waits when quota > 80%, blocks at > 95%
Duplicate Detection Blocks identical requests within 60s window (3+ = runaway)
Response Caching Caches successful responses for duplicate requests
Auto Model Fallback Switches to cheaper/available model when primary is exhausted
429 Error Parser Extracts exact retry delay from Google/Anthropic error responses
Batch vs Mistake Detection Distinguishes intentional bulk processing from error loops

Supported Models

Pre-configured quotas for:

  • gemini-3-flash (1M TPM)
  • gemini-3-pro (2M TPM)
  • claude-haiku (50K TPM)
  • claude-sonnet (200K TPM)
  • claude-opus (200K TPM)
  • gpt-4o (800K TPM)
  • deepseek (1M TPM)

Custom quotas can be added for any model.

Usage

from token_guard import TokenGuard

guard = TokenGuard()

# Before every API call:
decision = guard.check(prompt_text, model="gemini-3-flash")

if decision.action == "proceed":
    response = call_your_api(prompt_text)
    guard.record_usage(decision.estimated_tokens, model="gemini-3-flash")
    guard.cache_response(prompt_text, response)

elif decision.action == "wait":
    time.sleep(decision.wait_seconds)
    # retry

elif decision.action == "fallback":
    response = call_your_api(prompt_text, model=decision.fallback_model)

elif decision.action == "block":
    print(f"Blocked: {decision.reason}")

# If you get a 429 error:
guard.record_429("gemini-3-flash", retry_delay=53.0)

Integration with OpenClaw

Add to your agent's config or use as a middleware:

skills:
  - token-guard

The agent can invoke TokenGuard before any LLM API call to prevent quota exhaustion.

File Structure

token-guard/
├── SKILL.md          # This file
└── scripts/
    └── token_guard.py  # Main engine (zero external dependencies)

Status Output Example

{
  "models": {
    "gemini-3-flash": {
      "tpm_limit": 1000000,
      "used_this_minute": 750000,
      "remaining": 250000,
      "usage_pct": "75.0%",
      "status": "🟢 OK"
    }
  },
  "stats": {
    "total_checks": 42,
    "tokens_saved": 128000,
    "blocks": 3,
    "fallbacks": 2
  }
}

Zero Dependencies

Pure Python 3.10+. No pip install needed. No tiktoken, no external API calls. Designed for the $7 Bootstrap Protocol — every byte counts.

Reviews (0)

Sign in to write a review.

No reviews yet. Be the first to review!

Comments (0)

Sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Compatible Platforms

Pricing

Free

Related Configs