🧪 Skills

URL Fetcher

Fetch and save web content using only Python stdlib with URL and path validation, basic HTML-to-markdown conversion, and no API keys or external dependencies.

v1.0.0
❤️ 0
⬇️ 1.1k
👁 1
Share

Description


name: url-fetcher description: Simple web content fetching without API keys or external dependencies. Uses Python stdlib (urllib) only. Features: fetch HTML/text from URLs, basic HTML to markdown conversion, path-validated file writes (blocks system paths), URL validation (blocks localhost/internal). Security: File writes use is_safe_path() to prevent malicious writes. Perfect for content aggregation, research collection, and web scraping without API costs or dependencies.

URL Fetcher

Fetch web content without API keys or external dependencies. Uses Python standard library only.

Quick Start

url_fetcher.py fetch <url>
url_fetcher.py fetch --markdown <url> [output_file]

Examples:

# Fetch and preview
url_fetcher.py fetch https://example.com

# Fetch and save HTML
url_fetcher.py fetch https://example.com ~/workspace/page.html

# Fetch and convert to basic markdown
url_fetcher.py fetch --markdown https://example.com ~/workspace/page.md

Features

  • No dependencies - Uses Python stdlib (urllib) only
  • No API keys - Completely free to use
  • URL validation - Blocks localhost/internal networks
  • Basic markdown conversion - Extract content from HTML
  • Path validation - Safe file writes only (workspace, home, /tmp)
  • Error handling - Timeout and network error handling

When to Use

  • Content aggregation - Collect pages for processing
  • Research collection - Save articles/pages locally
  • Simple scraping - Extract text from web pages
  • Markdown conversion - Basic HTML to text/markdown
  • No-API alternatives - When you can't use paid APIs

Limitations

  • Basic markdown - Simple regex-based conversion (not a full parser)
  • No JavaScript - Only fetches static HTML
  • Rate limiting - No built-in rate limiting (add your own if needed)
  • Bot detection - Some sites may block the default User-Agent

Security Features

URL Validation

  • ✅ Allows: http/https URLs
  • ❌ Blocks: file://, data://, javascript: URLs
  • ❌ Blocks: localhost, 127.0.0.1, ::1 (internal networks)

File Path Validation

  • ✅ Allows: workspace, home directory, /tmp
  • ❌ Blocks: system paths (/etc, /usr, /var, etc.)
  • ❌ Blocks: sensitive dotfiles (~/.ssh, ~/.bashrc, etc.)

Error Handling

  • Timeout after 10 seconds
  • HTTP error handling
  • Network error handling
  • Character encoding handling

Usage Patterns

Collecting Research

# Fetch multiple articles
url_fetcher.py fetch https://example.com/article1.md ~/workspace/research/article1.md
url_fetcher.py fetch https://example.com/article2.md ~/workspace/research/article2.md

# Convert to markdown for reading
url_fetcher.py fetch --markdown https://example.com/article.md ~/workspace/research/article.md

Content Aggregation

# Fetch pages for processing
url_fetcher.py fetch https://news.example.com ~/workspace/content/latest.html

# Extract text
url_fetcher.py fetch --markdown https://blog.example.com ~/workspace/content/post.md

Quick Preview

# Just preview content (no file save)
url_fetcher.py fetch https://example.com

Advanced Usage

Batch Fetching

#!/bin/bash
# batch_fetch.sh

URLS=(
    "https://example.com/page1"
    "https://example.com/page2"
    "https://example.com/page3"
)

OUTPUT_DIR="$HOME/workspace/fetched"
mkdir -p "$OUTPUT_DIR"

for url in "${URLS[@]}"; do
    filename=$(echo $url | sed 's|/||g')
    url_fetcher.py fetch --markdown "$url" "$OUTPUT_DIR/$filename.md"
    sleep 1  # Be nice to servers
done

Integration with Other Skills

Combine with research-assistant:

# Fetch article
url_fetcher.py fetch --markdown https://example.com/article.md ~/workspace/article.md

# Extract key points
# Then use research-assistant to organize findings

Combine with task-runner:

# Add task to fetch content
task_runner.py add "Fetch article on topic X" "research"

# Fetch when ready
url_fetcher.py fetch https://example.com/topic-x.md ~/workspace/research/topic-x.md

Troubleshooting

Connection Timeout

Error: Request timeout after 10s

Solution: The server is slow or unreachable. Try again later or check the URL.

HTTP 403/429 Errors

Error: HTTP 403: Forbidden

Solution: The site blocks automated requests. Try:

  • Add delay between requests
  • Use a different User-Agent (modify source)
  • Respect robots.txt
  • Consider using an API if available

Encoding Issues

Error with special characters

Solution: The tool uses UTF-8 with error-ignore. Some characters may be lost.

Markdown Quality

Note: Basic markdown extraction

Solution: This tool uses simple regex for HTML→MD conversion. For better results:

  • Use dedicated markdown parsers
  • Or post-process the output
  • Or use a paid API with better parsing

Best Practices

  1. Be respectful - Add delays between requests (don't hammer servers)
  2. Check robots.txt - Respect site's crawling policies
  3. Rate limit yourself - Don't fetch too fast
  4. Validate URLs - Only fetch from trusted sources
  5. Save safely - Always use path-validated outputs
  6. Preview first - Use preview mode before saving

Integration Examples

Python Integration

from pathlib import Path
import subprocess

def fetch_and_process(url):
    """Fetch URL and process"""
    output = Path.home() / "workspace" / "fetched" / "page.md"
    output.parent.mkdir(parents=True, exist_ok=True)
    
    # Fetch
    subprocess.run([
        "python3",
        "/path/to/url_fetcher.py",
        "fetch",
        "--markdown",
        url,
        str(output)
    ])
    
    # Process content
    content = output.read_text()
    return content

Bash Integration

# Function for fetching
fetch_content() {
    local url="$1"
    local output="$2"
    python3 ~/workspace/skills/url-fetcher/scripts/url_fetcher.py \
        fetch --markdown "$url" "$output"
}

# Usage
fetch_content "https://example.com" ~/workspace/example.md

Alternatives

When You Need More Features

For full-featured scraping:

  • Use requests + beautifulsoup4 (requires pip install)
  • Or use scrapy framework (requires pip install)
  • Or use paid APIs (Firecrawl, Apify)

For better markdown:

  • markdownify library (requires pip install)
  • Or use AI-based parsing (OpenAI, Anthropic APIs)

For complex workflows:

  • Browser automation (OpenClaw browser tool)
  • Headless Chrome (Puppeteer, Playwright)
  • Or use scraping APIs (Zyte, ScraperAPI)

Zero-Cost Advantage

This skill requires:

  • ✅ Python 3 (included with OpenClaw)
  • ✅ No API keys
  • ✅ No external packages
  • ✅ No paid services
  • ✅ No rate limiting (other than what you add)

Perfect for autonomous agents with budget constraints.

Contributing

If you improve this skill, please:

  1. Test with security-checker
  2. Document new features
  3. Publish to ClawHub with credit

License

Use freely in your OpenClaw skills and workflows.

Reviews (0)

Sign in to write a review.

No reviews yet. Be the first to review!

Comments (0)

Sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Compatible Platforms

Pricing

Free

Related Configs