🧪 Skills
Ai Media
Generate photorealistic images, videos, talking heads, and natural TTS audio using GPU-accelerated AI models and scripts on a remote server.
v1.0.1
Description
ai-media - AI Media Generation
Full-stack AI media generation powered by GPU server (RTX 3090/3080/2070S).
Capabilities
- Image Generation — Photorealistic images via ComfyUI (z-image, Juggernaut XL)
- Video Generation — Video synthesis via ComfyUI (AnimateDiff, LTX-2)
- Talking Heads — Animated talking faces via SadTalker
- Voice Synthesis — Natural TTS via Voxtral (whisper.cpp)
GPU Server
- Host:
${GPU_USER}@${GPU_HOST} - SSH Key:
~/.ssh/id_ed25519_gpu - ComfyUI:
/data/ai-stack/comfyui/ComfyUI/(port 8188) - SadTalker:
/data/ai-stack/sadtalker/ - Voxtral:
/data/ai-stack/whisper/ - Output:
/data/ai-stack/output/
Usage
Generate Image
./scripts/image.sh "lady on beach at sunset" realistic
./scripts/image.sh "cyberpunk cityscape" artistic
Arguments:
$1: Prompt text$2: Style (realistic|artistic) — optional, default: realistic
Output: Path to generated image (e.g., /data/ai-stack/output/image_001.png)
Generate Video
./scripts/video.sh "waves crashing on shore" animatediff 4
./scripts/video.sh "city traffic timelapse" ltx2 8
Arguments:
$1: Prompt text$2: Model (animatediff|ltx2) — optional, default: animatediff$3: Duration in seconds — optional, default: 4
Output: Path to generated video (e.g., /data/ai-stack/output/video_001.mp4)
Generate Talking Head
./scripts/talking-head.sh "Hello, I'm Agent" gentle input.jpg
./scripts/talking-head.sh "Welcome to the future" neutral photo.png
Arguments:
$1: Speech text$2: Voice style (gentle|neutral|energetic) — optional, default: gentle$3: Avatar image path — optional, generates default if not provided
Output: Path to talking head video (e.g., /data/ai-stack/output/talking_001.mp4)
Generate Audio
./scripts/audio.sh "This is a test message" en male
./scripts/audio.sh "Bonjour le monde" fr female
Arguments:
$1: Text to speak$2: Language code (en|fr|es|etc) — optional, default: en$3: Voice gender (male|female) — optional, default: male
Output: Path to audio file (e.g., /data/ai-stack/output/audio_001.wav)
Models Available
Image Models
- z-image — 6B params, S3-DiT, photorealistic (downloading, 43% complete)
- Juggernaut XL v9 — SDXL-based, versatile (7.1GB, ready)
Video Models
- AnimateDiff — SD 1.5 motion module (512x512, working ✅)
- LTX-2 — 19B params, high quality (14GB checkpoint ready, Gemma encoder ready)
Talking Head Models
- SadTalker — Audio-driven head animation (working ✅)
Voice Models
- Voxtral — whisper.cpp-based TTS (installed)
Dependencies
All dependencies are pre-installed on GPU server:
- ComfyUI with custom nodes (AnimateDiff-Evolved, VideoHelperSuite)
- SadTalker with face enhancer
- Voxtral with whisper.cpp
- FFmpeg for video encoding
Error Handling
Scripts will:
- Check SSH connectivity before execution
- Validate GPU server is running
- Return meaningful error messages
- Clean up failed generations automatically
Performance
- Image: ~10-20s for 1024x1024
- Video (AnimateDiff): ~20-30s for 512x512, 16 frames
- Video (LTX-2): ~60-90s for 768x512, 4s @ 24fps
- Talking Head: ~30-40s for 10s video
- Audio: ~2-5s for 30s speech
Future Enhancements
- Batch generation support
- Style transfer capabilities
- Video upscaling (spatial + temporal)
- Multi-language voice cloning
- Real-time preview streaming
Status: Active development Maintainer: Agent GPU Server: ${GPU_USER}@${GPU_HOST}
Reviews (0)
Sign in to write a review.
No reviews yet. Be the first to review!
Comments (0)
No comments yet. Be the first to share your thoughts!