URL Fetcher
Fetch and save web content using only Python stdlib with URL and path validation, basic HTML-to-markdown conversion, and no API keys or external dependencies.
Description
name: url-fetcher description: Simple web content fetching without API keys or external dependencies. Uses Python stdlib (urllib) only. Features: fetch HTML/text from URLs, basic HTML to markdown conversion, path-validated file writes (blocks system paths), URL validation (blocks localhost/internal). Security: File writes use is_safe_path() to prevent malicious writes. Perfect for content aggregation, research collection, and web scraping without API costs or dependencies.
URL Fetcher
Fetch web content without API keys or external dependencies. Uses Python standard library only.
Quick Start
url_fetcher.py fetch <url>
url_fetcher.py fetch --markdown <url> [output_file]
Examples:
# Fetch and preview
url_fetcher.py fetch https://example.com
# Fetch and save HTML
url_fetcher.py fetch https://example.com ~/workspace/page.html
# Fetch and convert to basic markdown
url_fetcher.py fetch --markdown https://example.com ~/workspace/page.md
Features
- No dependencies - Uses Python stdlib (urllib) only
- No API keys - Completely free to use
- URL validation - Blocks localhost/internal networks
- Basic markdown conversion - Extract content from HTML
- Path validation - Safe file writes only (workspace, home, /tmp)
- Error handling - Timeout and network error handling
When to Use
- Content aggregation - Collect pages for processing
- Research collection - Save articles/pages locally
- Simple scraping - Extract text from web pages
- Markdown conversion - Basic HTML to text/markdown
- No-API alternatives - When you can't use paid APIs
Limitations
- Basic markdown - Simple regex-based conversion (not a full parser)
- No JavaScript - Only fetches static HTML
- Rate limiting - No built-in rate limiting (add your own if needed)
- Bot detection - Some sites may block the default User-Agent
Security Features
URL Validation
- ✅ Allows: http/https URLs
- ❌ Blocks: file://, data://, javascript: URLs
- ❌ Blocks: localhost, 127.0.0.1, ::1 (internal networks)
File Path Validation
- ✅ Allows: workspace, home directory, /tmp
- ❌ Blocks: system paths (/etc, /usr, /var, etc.)
- ❌ Blocks: sensitive dotfiles (~/.ssh, ~/.bashrc, etc.)
Error Handling
- Timeout after 10 seconds
- HTTP error handling
- Network error handling
- Character encoding handling
Usage Patterns
Collecting Research
# Fetch multiple articles
url_fetcher.py fetch https://example.com/article1.md ~/workspace/research/article1.md
url_fetcher.py fetch https://example.com/article2.md ~/workspace/research/article2.md
# Convert to markdown for reading
url_fetcher.py fetch --markdown https://example.com/article.md ~/workspace/research/article.md
Content Aggregation
# Fetch pages for processing
url_fetcher.py fetch https://news.example.com ~/workspace/content/latest.html
# Extract text
url_fetcher.py fetch --markdown https://blog.example.com ~/workspace/content/post.md
Quick Preview
# Just preview content (no file save)
url_fetcher.py fetch https://example.com
Advanced Usage
Batch Fetching
#!/bin/bash
# batch_fetch.sh
URLS=(
"https://example.com/page1"
"https://example.com/page2"
"https://example.com/page3"
)
OUTPUT_DIR="$HOME/workspace/fetched"
mkdir -p "$OUTPUT_DIR"
for url in "${URLS[@]}"; do
filename=$(echo $url | sed 's|/||g')
url_fetcher.py fetch --markdown "$url" "$OUTPUT_DIR/$filename.md"
sleep 1 # Be nice to servers
done
Integration with Other Skills
Combine with research-assistant:
# Fetch article
url_fetcher.py fetch --markdown https://example.com/article.md ~/workspace/article.md
# Extract key points
# Then use research-assistant to organize findings
Combine with task-runner:
# Add task to fetch content
task_runner.py add "Fetch article on topic X" "research"
# Fetch when ready
url_fetcher.py fetch https://example.com/topic-x.md ~/workspace/research/topic-x.md
Troubleshooting
Connection Timeout
Error: Request timeout after 10s
Solution: The server is slow or unreachable. Try again later or check the URL.
HTTP 403/429 Errors
Error: HTTP 403: Forbidden
Solution: The site blocks automated requests. Try:
- Add delay between requests
- Use a different User-Agent (modify source)
- Respect robots.txt
- Consider using an API if available
Encoding Issues
Error with special characters
Solution: The tool uses UTF-8 with error-ignore. Some characters may be lost.
Markdown Quality
Note: Basic markdown extraction
Solution: This tool uses simple regex for HTML→MD conversion. For better results:
- Use dedicated markdown parsers
- Or post-process the output
- Or use a paid API with better parsing
Best Practices
- Be respectful - Add delays between requests (don't hammer servers)
- Check robots.txt - Respect site's crawling policies
- Rate limit yourself - Don't fetch too fast
- Validate URLs - Only fetch from trusted sources
- Save safely - Always use path-validated outputs
- Preview first - Use preview mode before saving
Integration Examples
Python Integration
from pathlib import Path
import subprocess
def fetch_and_process(url):
"""Fetch URL and process"""
output = Path.home() / "workspace" / "fetched" / "page.md"
output.parent.mkdir(parents=True, exist_ok=True)
# Fetch
subprocess.run([
"python3",
"/path/to/url_fetcher.py",
"fetch",
"--markdown",
url,
str(output)
])
# Process content
content = output.read_text()
return content
Bash Integration
# Function for fetching
fetch_content() {
local url="$1"
local output="$2"
python3 ~/workspace/skills/url-fetcher/scripts/url_fetcher.py \
fetch --markdown "$url" "$output"
}
# Usage
fetch_content "https://example.com" ~/workspace/example.md
Alternatives
When You Need More Features
For full-featured scraping:
- Use
requests+beautifulsoup4(requires pip install) - Or use
scrapyframework (requires pip install) - Or use paid APIs (Firecrawl, Apify)
For better markdown:
markdownifylibrary (requires pip install)- Or use AI-based parsing (OpenAI, Anthropic APIs)
For complex workflows:
- Browser automation (OpenClaw browser tool)
- Headless Chrome (Puppeteer, Playwright)
- Or use scraping APIs (Zyte, ScraperAPI)
Zero-Cost Advantage
This skill requires:
- ✅ Python 3 (included with OpenClaw)
- ✅ No API keys
- ✅ No external packages
- ✅ No paid services
- ✅ No rate limiting (other than what you add)
Perfect for autonomous agents with budget constraints.
Contributing
If you improve this skill, please:
- Test with security-checker
- Document new features
- Publish to ClawHub with credit
License
Use freely in your OpenClaw skills and workflows.
Reviews (0)
No reviews yet. Be the first to review!
Comments (0)
No comments yet. Be the first to share your thoughts!