🧪 Skills
Document Handler
Read, extract text and metadata, and convert documents in formats like PDF, DOCX, XLSX, PPTX, EPUB, RTF, and OpenDocument.
v1.0.0
Description
name: document-handler description: Read, extract, and convert document files (PDF, DOCX, XLSX, PPTX, EPUB, RTF, ODT, ODS, ODP). Use when working with any document format: extracting text, metadata, converting formats, or processing content. Triggers on mentions of document files, file paths with document extensions, or requests to read/convert documents.
Document Handler
Extract text, metadata, and content from any document format.
Supported Formats
| Format | Extensions | Text Extract | Metadata | Convert |
|---|---|---|---|---|
| ✅ pdftotext | ✅ pdfinfo | ✅ pdftoppm | ||
| Word | .docx | ✅ unzip + xml | ✅ | ✅ |
| Excel | .xlsx | ✅ unzip + xml | ✅ | ✅ |
| PowerPoint | .pptx | ✅ unzip + xml | ✅ | ✅ |
| EPUB | .epub | ✅ unzip + html | ✅ | ✅ |
| RTF | .rtf | ✅ textutil | ✅ | ✅ |
| OpenDocument | .odt, .ods, .odp | ✅ unzip + xml | ✅ | ✅ |
Quick Commands
# Extract text
pdftotext -layout input.pdf output.txt
# Get metadata
pdfinfo input.pdf
# Convert to images (for OCR or viewing)
pdftoppm -png input.pdf output_prefix
# Extract specific pages
pdftotext -f 5 -l 10 -layout input.pdf output.txt
DOCX/XLSX/PPTX (Office Open XML)
# Extract text from DOCX
unzip -p input.docx word/document.xml | sed 's/<[^>]*>//g' | tr -s ' \n'
# Extract text from XLSX (all sheets)
unzip -p input.xlsx xl/sharedStrings.xml | sed 's/<[^>]*>//g' | tr -s '\n'
# Extract text from PPTX
unzip -p input.pptx ppt/slides/*.xml | sed 's/<[^>]*>//g' | tr -s ' \n'
# Get metadata
unzip -p input.docx docProps/core.xml
RTF (macOS)
# Convert RTF to plain text
textutil -convert txt input.rtf -output output.txt
# Convert RTF to HTML
textutil -convert html input.rtf -output output.html
EPUB
# Extract and read EPUB content
unzip -l input.epub # List contents
unzip -p input.epub "*.html" | lynx -stdin -dump # Text via lynx
unzip -p input.epub "*.xhtml" | sed 's/<[^>]*>//g' # Raw text
OpenDocument (ODT/ODS/ODP)
# Extract text from ODT
unzip -p input.odt content.xml | sed 's/<[^>]*>//g' | tr -s ' \n'
# Extract from ODS
unzip -p input.ods content.xml | sed 's/<[^>]*>//g'
# Get metadata
unzip -p input.odt meta.xml
Scripts
extract_document.sh
Extracts text and metadata from any supported document format.
~/Dropbox/jarvis/skills/document-handler/scripts/extract_document.sh <file>
Output:
- Text content to stdout
- Metadata as JSON comments
pdf_to_images.sh
Converts PDF pages to images for OCR or visual processing.
~/Dropbox/jarvis/skills/document-handler/scripts/pdf_to_images.sh <pdf> <output_dir> [dpi]
Workflow
- Identify format — Check file extension
- Extract text — Use appropriate tool
- Get metadata — Author, date, pages, etc.
- Process content — Summarize, search, transform
Notes
- PDFs with scanned images need OCR (pdftoppm + tesseract)
- Encrypted PDFs require password
- Complex formatting may be lost in text extraction
- For tables in PDFs, consider tabula or camelot
Reviews (0)
Sign in to write a review.
No reviews yet. Be the first to review!
Comments (0)
No comments yet. Be the first to share your thoughts!