🧪 Skills
Local TTS
Local text-to-speech using Qwen3-TTS with mlx_audio (macOS Apple Silicon) or qwen-tts (Linux/Windows). Privacy-first offline TTS with natural, realistic voic...
v1.0.0
Description
name: local-tts description: Local text-to-speech using Qwen3-TTS with mlx_audio (macOS Apple Silicon) or qwen-tts (Linux/Windows). Privacy-first offline TTS with natural, realistic voice cloning and voice design. Use for local, secure, high-quality multilingual speech synthesis. license: MIT
Local TTS with Qwen3-TTS
Privacy-First | Offline | High-Quality | Natural Real Voices
Local text-to-speech synthesis using Qwen3-TTS models. Your text never leaves your machine.
Why Local TTS?
Unlike cloud TTS (Google, AWS, Azure), local-tts ensures:
- Zero data transmission - 100% on-device processing
- Works offline - No network required
- No API keys - No external dependencies
- GDPR/HIPAA friendly - Simplified compliance
See privacy & security details.
Platform Overview
| Platform | Backend | Installation | Best For |
|---|---|---|---|
| macOS (Apple Silicon) | mlx_audio |
pip install mlx-audio |
M1/M2/M3/M4 Macs |
| Linux/Windows | qwen-tts |
pip install qwen-tts |
CUDA GPUs |
Quick Start
macOS
pip install mlx-audio
brew install ffmpeg
# Natural female voice
python -m mlx_audio.tts.generate \
--text "Hello world" \
--model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit \
--voice Chelsie
Linux/Windows
pip install qwen-tts
# With optimizations (FlashAttention, bfloat16, auto-device)
python scripts/tts_linux.py "Hello world" --female
Key Concepts
--voice vs --instruct (Important)
| Model | --voice |
--instruct |
Notes |
|---|---|---|---|
| CustomVoice | Select preset voice | Add style/emotion | Can use together - voice + style control |
| VoiceDesign | N/A | Create voice from description | --instruct only |
| Base | N/A | N/A | For voice cloning with --ref_audio |
CustomVoice with style control:
python -m mlx_audio.tts.generate \
--text "Hello there!" \
--model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit \
--voice Serena \
--instruct "excited and enthusiastic"
9 Preset Voices (Open Source CustomVoice)
| Voice | Gender | Language | Character |
|---|---|---|---|
| Chelsie | Female | English (American) | Gentle, empathetic |
| Serena | Female | English | Warm, gentle |
| Ono Anna | Female | Japanese | Playful |
| Sohee | Female | Korean | Warm |
| Aiden | Male | English (American) | Sunny |
| Dylan | Male | English | Natural |
| Eric | Male | English | Real |
| Ryan | Male | English | Natural |
| Uncle Fu | Male | Chinese | Youthful Beijing |
Defaults: Female=Serena, Male=Aiden
Usage Examples
CustomVoice (Preset Voices)
# Natural female
python -m mlx_audio.tts.generate \
--text "Your text" --voice Serena --lang_code en \
--model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit
# Real male
python -m mlx_audio.tts.generate \
--text "Your text" --voice Aiden --lang_code en \
--model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit
VoiceDesign (Text-Based)
python -m mlx_audio.tts.generate \
--text "Hello" \
--model mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-8bit \
--instruct "A warm female voice, professional and clear"
Long Text Generation
For long text, increase --max_tokens and enable --join_audio (macOS/MLX only):
python -m mlx_audio.tts.generate \
--text "Your very long text here..." \
--model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit \
--voice Serena \
--max_tokens 4096 \
--join_audio \
--output long_audio.wav
Voice Cloning
python -m mlx_audio.tts.generate \
--text "Cloned voice speaking" \
--model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
--ref_audio sample.wav --ref_text "Sample transcript"
Parameters
| Parameter | Description | Values |
|---|---|---|
--text |
Text to speak | Required |
--model |
Model ID | See table below |
--voice |
Preset voice (CustomVoice) | Chelsie, Serena, Aiden, Ryan... |
--instruct |
Voice description (VoiceDesign) or style/emotion (CustomVoice) | e.g., "excited", "calm", "professional" |
--speed |
Speaking rate | 0.5-2.0 (default: 1.0) |
--pitch |
Voice pitch | 0.5-2.0 (default: 1.0) |
--lang_code |
Language | en, cn, ja, ko, de, fr... |
--ref_audio |
Reference for cloning | File path |
--output |
Output file | Path (auto-generated if omitted) |
--max_tokens |
Max generation tokens | Integer (default: 2048) - Increase for long text |
--join_audio |
Merge audio segments | true (default) or false - Recommended for long text |
Models
| Model | Size | Purpose |
|---|---|---|
Qwen3-TTS-12Hz-1.7B-CustomVoice |
1.7B | 9 preset voices + style control |
Qwen3-TTS-12Hz-1.7B-VoiceDesign |
1.7B | Text-based voice creation |
Qwen3-TTS-12Hz-1.7B-Base |
1.7B | Voice cloning |
Qwen3-TTS-12Hz-0.6B-* |
0.6B | Lightweight versions |
macOS: Add mlx-community/ prefix (e.g., mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit)
Scripts
scripts/tts_macos.py- macOS wrapperscripts/tts_linux.py- Linux/Windows wrapper with optimizations
Optimizations (Linux/Windows)
tts_linux.py automatically enables:
- FlashAttention - Faster, less memory
- bfloat16 - Better precision
- Auto device - CUDA → CPU fallback
- Mixed precision - Speed + quality
Troubleshooting
| Issue | Solution |
|---|---|
| macOS: Model not found | Use mlx-community/ prefix |
| macOS: Audio format | brew install ffmpeg |
| Linux: CUDA OOM | Use 0.6B models |
| Linux: Slow | Check CUDA: torch.cuda.is_available() |
References
Version
1.0.0 - See VERSION and package.json
Reviews (0)
Sign in to write a review.
No reviews yet. Be the first to review!
Comments (0)
No comments yet. Be the first to share your thoughts!