🧪 Skills
Document Workflow
一键实现学术论文的搜索、下载、分块提取文本及结构化总结,支持按年份和引用数筛选。
v1.0.3
Description
name: document-workflow description: One-click academic paper research workflow. Use when: (1) searching papers by topic, (2) downloading PDFs from arXiv or journals, (3) extracting text in chunks, (4) summarizing papers by structured schema. Supports Tavily (default, ~100% PDF availability) and Semantic Scholar (fallback, richer metadata).
Document Workflow
Execute academic paper research: Search → Download → Extract → Summarize
Quick Start
One-Click Workflow
python -m skills.document-workflow.scripts.research \
--query "world model autonomous driving" \
--output "./papers" \
--max_papers 3 \
--year_from 2024 \
--download \
--extract
Output:
papers/
├── search_results.json
├── paper_1/
│ ├── metadata.json
│ ├── pdf/paper.pdf
│ └── chunk_1_10.json
└── paper_2/
Scripts
search_papers
Search papers with Tavily priority:
python -m skills.document-workflow.scripts.search_papers \
--query "world model" \
--max_results 5 \
--year_from 2024
Parameters:
--query(required): Search keyword--max_results: Number of results (default: 5)--year_from/--year_to: Year filter--use_semantic_scholar: Force Semantic Scholar instead of Tavily
Returns: JSON array with title, authors, pdf_url, citationCount, venue
Search behavior:
- Default: Tavily searches arXiv (~100% PDF availability)
- Fallback: Semantic Scholar (richer metadata, lower PDF availability)
download_paper
Download PDF from URL or search output:
# From search output (pipeline)
python -m skills.document-workflow.scripts.search_papers --query "..." --max_results 1 | \
python -m skills.document-workflow.scripts.download_paper \
--json_input - \
--papers_dir "./papers" \
--theme "World_Models"
# Direct URL
python -m skills.document-workflow.scripts.download_paper \
--url "https://arxiv.org/pdf/2503.20523.pdf" \
--theme "World_Models"
Parameters:
--json_input: Accept search_papers JSON (-for stdin)--url: Direct PDF URL--papers_dir: Base directory (default:C:\Users\Lenovo\Desktop\papers)--theme: Theme subdirectory--timeout: Download timeout seconds (default: 120)
Behavior:
- Save PDF to
{papers_dir}/{theme}/{title}_{year}.pdf - Auto-save
metadata.jsonalongside PDF
pdf_reader
Extract text from PDF in chunks:
python -m skills.document-workflow.scripts.pdf_reader \
"paper.pdf" \
--start 1 --end 10 \
--output "chunk1.json"
Parameters:
pdf_file: PDF file path--start/--end: Page range (1-indexed, default: 1-10)--output: Output JSON file
Why chunked extraction:
- Large papers (50+ pages) exceed context window
- Extract 5-10 pages per chunk
- JSON output is reusable across sessions
summarize
Summarize extracted chunks by schema:
python -m skills.document-workflow.scripts.summarize \
--chunks "chunk_*.json" \
--output "summary.json"
Output schema:
{
"paper_title": "Paper Title",
"authors": ["Author List"],
"source": "Journal/Conference",
"task_definition": {
"domain": "Research Domain",
"task": "Specific Task",
"problem_statement": "Problem Being Solved",
"key_contributions": ["Contribution 1", "Contribution 2"]
},
"experiments": {
"datasets": ["Datasets"],
"baselines": ["Baseline Methods"],
"metrics": [{"name": "Metric", "description": "Description"}],
"results": [{"setting": "Setting", "metric": "Metric", "proposed_method": "Ours", "best_baseline": "Best Baseline"}],
"key_findings": ["Key Findings"]
}
}
research
Execute complete workflow:
python -m skills.document-workflow.scripts.research \
--query "world model" \
--output "./papers" \
--max_papers 3 \
--year_from 2024 \
--download \
--extract
Parameters:
--query(required): Search query--output: Output directory--max_papers: Max papers to process (default: 3)--year_from: Filter from year (default: 2024)--download: Download PDFs--extract: Extract text in chunks--chunk_size: Pages per chunk (default: 10)
Configuration
Environment Variables
export SEMANTIC_SCHOLAR_API_KEY="your-api-key"
If not set, uses built-in key.
Default Paths
- Download directory:
C:\Users\Lenovo\Desktop\papers - Modify via
--papers_dirparameter
Best Practices
- Specify year: Use
--year_from 2024to avoid outdated papers - Chunked extraction: 5-10 pages per chunk avoids context overflow
- Reuse extracted JSON: Extracted chunks are reusable across sessions
- Check citation count: Use Semantic Scholar for high-citation papers
- Save metadata: Download auto-saves
metadata.json
Troubleshooting
Download failed:
- Verify PDF URL is valid
- Increase
--timeoutfor large files
Extraction garbled:
- Ensure UTF-8 encoding
- Windows: run
chcp 65001first
No search results:
- Try broader keywords
- Expand year range
References
- Output schema: See
references/output_schema.jsonfor detailed schema definition - Search comparison: Tavily (~100% PDF) vs Semantic Scholar (rich metadata, ~30% PDF)
Scripts Directory
scripts/
├── research.py # One-click workflow
├── search_papers.py # Tavily + Semantic Scholar search
├── search_tavily.py # Standalone Tavily search
├── download_paper.py # PDF download
├── pdf_reader.py # Chunked text extraction
├── summarize.py # Schema-based summarization
└── auto_research.py # Automated workflow
Reviews (0)
Sign in to write a review.
No reviews yet. Be the first to review!
Comments (0)
No comments yet. Be the first to share your thoughts!