🧪 Skills

Vector Text Fixer

Fix garbled text in PDF/SVG vector graphics for final editing in AI. Detect, replace and repair garbled text in vector graphic files while maintaining origin...

v0.1.0
❤️ 0
⬇️ 15
👁 1
Share

Description


name: vector-text-fixer description: Fix garbled text in PDF/SVG vector graphics for final editing in AI. Detect, replace and repair garbled text in vector graphic files while maintaining original formatting and layout. version: 1.0.0 category: Visual tags:

  • pdf
  • svg
  • vector
  • text-fix
  • garbled-text
  • document-repair
  • encoding author: AIPOCH license: MIT status: Draft risk_level: Medium skill_type: Tool/Script owner: AIPOCH reviewer: '' last_updated: '2026-02-06'

Vector Text Fixer

Fixes garbled text in PDF/SVG vector graphics to make them editable in AI tools.

Features

  • Garbled Text Detection: Automatically identifies garbled text in PDF/SVG files
  • Smart Repair: Infers original text content based on context
  • Batch Processing: Supports batch processing of multiple files in a folder
  • Format Preservation: Repaired files maintain original vector format and layout
  • AI-assisted Editing: Outputs intermediate format that can be imported into AI editors

Supported Scenarios

1. PDF Garbled Text Repair

  • Box/question mark issues caused by font embedding problems
  • Garbled text caused by encoding conversion errors
  • Abnormal characters generated by missing font substitution
  • Multi-language mixed encoding issues

2. SVG Garbled Text Repair

  • Text entity encoding errors
  • Special character escaping issues
  • Display abnormalities caused by invalid font references
  • XML encoding declaration errors

Usage

Command Line

# Fix a single PDF file
python scripts/main.py --input document.pdf --output fixed.pdf

# Fix a single SVG file
python scripts/main.py --input diagram.svg --output fixed.svg

# Batch process folder
python scripts/main.py --batch ./input_folder --output ./output_folder

# Interactive repair (manually specify replacement content)
python scripts/main.py --input doc.pdf --interactive

# Export as editable format (JSON)
python scripts/main.py --input doc.pdf --export-json editable.json

Python API

from scripts.main import VectorTextFixer

# Create fixer instance
fixer = VectorTextFixer()

# Fix PDF
result = fixer.fix_pdf("input.pdf", "output.pdf")

# Fix SVG
result = fixer.fix_svg("input.svg", "output.svg")

# Batch processing
results = fixer.batch_fix("./input_folder", "./output_folder")

# Get text map (for AI editing)
text_map = fixer.extract_text_map("input.pdf")

Input Parameters

Parameter Type Required Description
--input str Yes* Input file path (PDF or SVG)
--batch str No Batch processing input folder
--output str Yes* Output file/folder path
--interactive bool No Enable interactive repair mode
--export-json str No Export editable JSON format
--encoding str No Specify source file encoding (default: auto-detect)
--font-substitution dict No Font replacement mapping
--repair-level str No Repair level: minimal, standard, aggressive (default: standard)

*At least one of --input and --batch is required

Output Format

Repaired PDF/SVG

  • Maintains original vector format
  • Garbled text replaced with readable content
  • Fonts and layout remain unchanged

JSON Export Format

{
  "file_type": "pdf",
  "pages": [
    {
      "page_num": 1,
      "text_blocks": [
        {
          "id": "tb_001",
          "bbox": [100, 200, 300, 220],
          "original_text": "�����",
          "detected_encoding": "UTF-8",
          "confidence": 0.3,
          "suggested_fix": "Sample Text"
        }
      ]
    }
  ],
  "fonts_used": ["Arial", "SimSun"],
  "repair_summary": {
    "total_blocks": 15,
    "fixed_blocks": 12,
    "skipped_blocks": 3
  }
}

Garbled Text Detection Rules

The tool uses the following rules to detect garbled text:

  1. Replacement Character Detection: Identifies U+FFFD (�) and box characters
  2. Control Character Filtering: Excludes non-printing control characters
  3. Encoding Consistency: Detects anomalies caused by mixed encodings
  4. Font Fallback Detection: Identifies substitution characters generated due to missing fonts
  5. Probability Model: Garbled text probability assessment based on character frequency

Repair Strategies

Minimal

  • Only repairs obvious errors (replacement characters, null bytes)
  • Maintains maximum integrity of original text
  • Suitable for minor garbled text issues

Standard

  • Repairs common encoding issues
  • Smart font replacement
  • Balances repair rate and accuracy

Aggressive

  • Comprehensive text re-encoding
  • Uses OCR-assisted recognition
  • Suitable for severely garbled documents

Examples

Fix Single Page PDF

Input:

python scripts/main.py --input report.pdf --output fixed_report.pdf

Output:

✓ Processing: report.pdf
✓ Detected 5 garbled text blocks
✓ Fixed 4 blocks automatically
⚠ 1 block requires manual review
✓ Output saved: fixed_report.pdf
✓ Report saved: fixed_report_repair_log.json

Export Editable JSON

Input:

python scripts/main.py --input diagram.svg --export-json editable.json

Output JSON Structure:

{
  "file_type": "svg",
  "svg_info": {
    "width": 800,
    "height": 600,
    "viewBox": "0 0 800 600"
  },
  "text_elements": [
    {
      "id": "text_1",
      "x": 100,
      "y": 200,
      "font_family": "Arial",
      "font_size": 14,
      "original": "�����",
      "user_editable": "",
      "confidence": 0.25
    }
  ]
}

Dependencies

pdfplumber>=0.10.0      # PDF parsing
PyMuPDF>=1.23.0         # PDF processing (fitz)
cairosvg>=2.7.0         # SVG conversion
beautifulsoup4>=4.12.0  # SVG parsing
fonttools>=4.40.0       # Font processing
chardet>=5.0.0          # Encoding detection
Pillow>=10.0.0          # Image processing

Limitations

  • Encrypted PDFs require password unlock before processing
  • Severely damaged vector files may not be fully repairable
  • Some rare fonts may not map correctly
  • Scanned PDFs require OCR recognition first

Version Information

  • Version: 1.0.0
  • Last Updated: 2026-02-06
  • Status: Ready for use

Risk Assessment

Risk Indicator Assessment Level
Code Execution Python/R scripts executed locally Medium
Network Access No external API calls Low
File System Access Read input files, write output files Medium
Instruction Tampering Standard prompt guidelines Low
Data Exposure Output files saved to workspace Low

Security Checklist

  • No hardcoded credentials or API keys
  • No unauthorized file system access (../)
  • Output does not expose sensitive information
  • Prompt injection protections in place
  • Input file paths validated (no ../ traversal)
  • Output directory restricted to workspace
  • Script execution in sandboxed environment
  • Error messages sanitized (no stack traces exposed)
  • Dependencies audited

Prerequisites

# Python dependencies
pip install -r requirements.txt

Evaluation Criteria

Success Metrics

  • Successfully executes main functionality
  • Output meets quality standards
  • Handles edge cases gracefully
  • Performance is acceptable

Test Cases

  1. Basic Functionality: Standard input → Expected output
  2. Edge Case: Invalid input → Graceful error handling
  3. Performance: Large dataset → Acceptable processing time

Lifecycle Status

  • Current Stage: Draft
  • Next Review Date: 2026-03-06
  • Known Issues: None
  • Planned Improvements:
    • Performance optimization
    • Additional feature support

Reviews (0)

Sign in to write a review.

No reviews yet. Be the first to review!

Comments (0)

Sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Compatible Platforms

Pricing

Free

Related Configs