Video Captions
Generate professional captions and subtitles with multi-engine transcription, word-level timing, styling presets, and burn-in.
Description
name: Video Captions slug: video-captions version: 1.0.1 homepage: https://clawic.com/skills/video-captions description: Generate professional captions and subtitles with multi-engine transcription, word-level timing, styling presets, and burn-in. changelog: Declared optional cloud API env vars in metadata to clarify that cloud engines require user-provided keys metadata: {"clawdbot":{"emoji":"🎬","requires":{"bins":["ffmpeg","whisper"],"env":{"optional":["ASSEMBLYAI_API_KEY","DEEPGRAM_API_KEY"]}},"os":["linux","darwin"]}}
When to Use
User needs captions or subtitles for video content. Agent handles transcription, timing, formatting, styling, translation, and burn-in across all major formats and platforms.
Quick Reference
| Topic | File |
|---|---|
| Transcription engines | engines.md |
| Output formats | formats.md |
| Styling presets | styling.md |
| Platform requirements | platforms.md |
Core Rules
1. Engine Selection by Context
| Scenario | Engine | Why |
|---|---|---|
| Default (recommended) | Whisper local | 100% offline, no data leaves machine |
| Apple Silicon | MLX Whisper | Native acceleration, still local |
| Word timestamps | whisper-timestamped | DTW alignment, still local |
Default: Whisper local (turbo model). See engines.md for optional cloud alternatives.
2. Format Selection by Platform
| Platform | Format | Notes |
|---|---|---|
| YouTube | VTT or SRT | VTT preferred |
| Netflix/Pro | TTML | Strict timing rules |
| Social (TikTok, IG) | Burn-in (ASS) | Embedded in video |
| General | SRT | Universal compatibility |
| Karaoke/effects | ASS | Advanced styling |
Ask user's target platform if not specified.
3. Professional Timing Standards
Netflix-compliant (default):
- Min duration: 5/6 second (0.833s)
- Max duration: 7 seconds
- Max chars/line: 42
- Max lines: 2
- Gap between subtitles: 2+ frames
Social media:
- Shorter segments (2-4 words)
- More frequent breaks
- Centered or dynamic positioning
4. Segmentation Rules
Break lines:
- After punctuation marks
- Before conjunctions (and, but, or)
- Before prepositions
Never separate:
- Article from noun
- Adjective from noun
- First name from last name
- Verb from subject pronoun
- Auxiliary from verb
5. Word-Level Timestamps
Use word timestamps for:
- Karaoke-style highlighting
- Precise sync verification
- TikTok/Instagram animated captions
- Quality checking transcript accuracy
Enable with --word-timestamps flag.
6. Speaker Identification
For multi-speaker content:
- Use diarization (pyannote local, or cloud APIs if configured)
- Format:
[Speaker 1]or[Name]if known - SDH format:
JOHN: What do you think?
7. Quality Verification
Before delivering:
- Check sync at start, middle, end
- Verify character limits per line
- Confirm speaker labels if multi-speaker
- Test burn-in render quality
Workflow
Basic Transcription
# Auto-detect language, output SRT
whisper video.mp4 --model turbo --output_format srt
# Specify language
whisper video.mp4 --model turbo --language es --output_format srt
# Multiple formats
whisper video.mp4 --model turbo --output_format all
Word-Level Timestamps
# Using whisper-timestamped
whisper_timestamped video.mp4 --model large-v3 --output_format srt
# With VAD pre-processing (reduces hallucinations)
whisper_timestamped video.mp4 --vad silero --accurate
Styled Subtitles (ASS)
# Generate SRT first, then convert with style
ffmpeg -i video.mp4 -vf "subtitles=video.srt:force_style='FontName=Arial,FontSize=24,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=2,Shadow=1,Alignment=2'" output.mp4
Burn-In for Social Media
# TikTok/Instagram style (centered, bold)
ffmpeg -i video.mp4 -vf "subtitles=video.srt:force_style='FontName=Montserrat-Bold,FontSize=32,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=3,Shadow=0,Alignment=10,MarginV=50'" output.mp4
# Netflix style (bottom, clean)
ffmpeg -i video.mp4 -vf "subtitles=video.srt:force_style='FontName=Netflix Sans,FontSize=48,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=2,Shadow=1,Alignment=2'" output.mp4
Translation
# Transcribe + translate to English
whisper video.mp4 --model turbo --task translate --output_format srt
Format Conversion
# SRT to VTT
ffmpeg -i video.srt video.vtt
# SRT to ASS (for styling)
ffmpeg -i video.srt video.ass
Caption Traps
- Hallucinations on silence → Use VAD pre-processing or trim silent sections
- Wrong language detection → Specify
--languageexplicitly for mixed content - Timing drift in long videos → Use word timestamps + manual spot-check
- Character limit violations → Set
--max_line_width 42for Netflix compliance - Missing speaker IDs → Enable diarization for multi-speaker content
- Burn-in quality loss → Use high bitrate output (
-b:v 8M)
Common Scenarios
YouTube Video
- Transcribe:
whisper video.mp4 --output_format vtt - Upload .vtt to YouTube Studio
- Review auto-sync suggestions
TikTok/Instagram Reel
- Transcribe with word timestamps
- Apply bold animated style
- Burn-in:
ffmpeg -i video.mp4 -vf "subtitles=video.ass" -c:a copy output.mp4 - Export at platform resolution
Netflix/Professional
- Use Whisper large-v3 for best local accuracy
- Export TTML format
- Verify: 42 chars/line, 2 lines max, timing gaps
- Include translator credit as last subtitle
Podcast/Interview
- Enable speaker diarization
- Format as dialogue:
[SPEAKER]: text - SDH option: include
[music],[laughter]descriptions
Foreign Film Translation
- Transcribe in original language
- Translate:
--task translatefor English - Or use external translation + timing sync
External Endpoints
Default: 100% LOCAL processing. No network calls.
| Endpoint | Data Sent | When Used |
|---|---|---|
| Whisper (local) | None (local) | Default — always |
| api.assemblyai.com | Audio file | Only if user sets ASSEMBLYAI_API_KEY |
| api.deepgram.com | Audio file | Only if user sets DEEPGRAM_API_KEY |
Cloud APIs are documented as alternatives but never used unless user explicitly provides API keys and requests cloud processing. By default, all processing stays on your machine.
Security & Privacy
Default workflow is 100% offline:
- Whisper runs locally on your machine
- Generated subtitle files stay local
- Burned-in videos stay local
- No network calls made
Cloud APIs are OPTIONAL and OPT-IN:
- Only used if you set
ASSEMBLYAI_API_KEYorDEEPGRAM_API_KEY - Only triggered when you explicitly use cloud engine commands
- If you never set these keys, no audio ever leaves your machine
This skill does NOT:
- Upload anything by default
- Require internet connection for basic use
- Store data externally
Related Skills
Install with clawhub install <slug> if user confirms:
ffmpeg— video/audio processingvideo— general video tasksvideo-edit— video editingaudio— audio processing
Feedback
- If useful:
clawhub star video-captions - Stay updated:
clawhub sync
Reviews (0)
No reviews yet. Be the first to review!
Comments (0)
No comments yet. Be the first to share your thoughts!