🧪 Skills

PinchBench

Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks. Use when testing model capabilities, comparing models, submitting b...

v1.0.0
❤️ 0
⬇️ 400
👁 1
Share

Description


name: pinchbench description: Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks. Use when testing model capabilities, comparing models, submitting benchmark results to the leaderboard, or checking how well your OpenClaw setup handles calendar, email, research, coding, and multi-step workflows. metadata: author: pinchbench version: "1.0.0" homepage: https://pinchbench.com repository: https://github.com/pinchbench/skill

PinchBench Benchmark Skill

PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Results are collected on a public leaderboard at pinchbench.com.

Prerequisites

  • Python 3.10+
  • uv package manager
  • OpenClaw instance (this agent)

Quick Start

cd <skill_directory>

# Run benchmark with a specific model
uv run benchmark.py --model anthropic/claude-sonnet-4

# Run only automated tasks (faster)
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite automated-only

# Run specific tasks
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite task_01_calendar,task_02_stock

# Skip uploading results
uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload

Available Tasks (23)

Task Category Description
task_00_sanity Basic Verify agent works
task_01_calendar Productivity Calendar event creation
task_02_stock Research Stock price lookup
task_03_blog Writing Blog post creation
task_04_weather Coding Weather script
task_05_summary Analysis Document summarization
task_06_events Research Conference research
task_07_email Writing Email drafting
task_08_memory Memory Context retrieval
task_09_files Files File structure creation
task_10_workflow Integration Multi-step API workflow
task_11_clawdhub Skills ClawHub interaction
task_12_skill_search Skills Skill discovery
task_13_image_gen Creative Image generation
task_14_humanizer Writing Text humanization
task_15_daily_summary Productivity Daily digest
task_16_email_triage Email Inbox triage
task_17_email_search Email Email search
task_18_market_research Research Market analysis
task_19_spreadsheet_summary Analysis Spreadsheet analysis
task_20_eli5_pdf_summary Analysis PDF simplification
task_21_openclaw_comprehension Knowledge OpenClaw docs comprehension
task_22_second_brain Memory Knowledge management

Command Line Options

Option Description
--model Model identifier (e.g., anthropic/claude-sonnet-4)
--suite all, automated-only, or comma-separated task IDs
--output-dir Results directory (default: results/)
--timeout-multiplier Scale task timeouts for slower models
--runs Number of runs per task for averaging
--no-upload Skip uploading to leaderboard
--register Request new API token for submissions
--upload FILE Upload previous results JSON

Token Registration

To submit results to the leaderboard:

# Register for an API token (one-time)
uv run benchmark.py --register

# Run benchmark (auto-uploads with token)
uv run benchmark.py --model anthropic/claude-sonnet-4

Results

Results are saved as JSON in the output directory:

# View task scores
jq '.tasks[] | {task_id, score: .grading.mean}' results/0001_anthropic-claude-sonnet-4.json

# Show failed tasks
jq '.tasks[] | select(.grading.mean < 0.5)' results/*.json

# Calculate overall score
jq '{average: ([.tasks[].grading.mean] | add / length)}' results/*.json

Adding Custom Tasks

Create a markdown file in tasks/ following TASK_TEMPLATE.md. Each task needs:

  • YAML frontmatter (id, name, category, grading_type, timeout)
  • Prompt section
  • Expected behavior
  • Grading criteria
  • Automated checks (Python grading function)

Leaderboard

View results at pinchbench.com. The leaderboard shows:

  • Model rankings by overall score
  • Per-task breakdowns
  • Historical performance trends

Reviews (0)

Sign in to write a review.

No reviews yet. Be the first to review!

Comments (0)

Sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Compatible Platforms

Pricing

Free

Related Configs