name: sogni-gen version: "1.5.16" description: Generate images and videos using Sogni AI's decentralized network, with local credential/config files and optional local media inputs. Ask the agent to "draw", "generate", "create an image", or "make a video/animate" from a prompt or reference image. homepage: https://sogni.ai metadata: clawdbot: emoji: "🎨" primaryEnv: "SOGNI_API_KEY" os: ["darwin", "linux", "win32"] requires: bins: ["node"] anyBins: ["ffmpeg"] env: - "SOGNI_API_KEY" - "SOGNI_USERNAME" - "SOGNI_PASSWORD" - "SOGNI_CREDENTIALS_PATH" - "SOGNI_LAST_RENDER_PATH" - "SOGNI_MEDIA_INBOUND_DIR" - "OPENCLAW_CONFIG_PATH" - "OPENCLAW_PLUGIN_CONFIG" - "FFMPEG_PATH" - "SOGNI_DOWNLOADS_DIR" - "SOGNI_MCP_SAVE_DOWNLOADS" - "SOGNI_ALLOWED_DOWNLOAD_HOSTS" config: - "/.config/sogni/credentials" - "/.openclaw/openclaw.json" - "/.clawdbot/media/inbound" - "/.config/sogni/last-render.json" - "~/Downloads/sogni" install: - id: npm kind: exec command: "cd {{skillDir}} && cp skill-package.json package.json && npm i" label: "Prepare runtime dependencies"

Sogni Image & Video Generation

Generate images and videos using Sogni AI's decentralized GPU network.

Setup

Get Sogni credentials at https://app.sogni.ai/
Create credentials file:

mkdir -p ~/.config/sogni
cat > ~/.config/sogni/credentials << 'EOF'
SOGNI_API_KEY=your_api_key
# or:
# SOGNI_USERNAME=your_username
# SOGNI_PASSWORD=your_password
EOF
chmod 600 ~/.config/sogni/credentials

You can also export SOGNI_API_KEY, or SOGNI_USERNAME + SOGNI_PASSWORD, instead of writing the file.

Install dependencies (if cloned):

cd /path/to/sogni-gen
npm i

Or install from npm (no git clone):

mkdir -p ~/.clawdbot/skills
cd ~/.clawdbot/skills
npm i sogni-gen
ln -sfn node_modules/sogni-gen sogni-gen

When this skill is distributed via ClawHub, it bootstraps its local runtime dependencies from skill-package.json during install. That avoids relying on a root package.json being present in the published skill artifact.

Filesystem Paths and Overrides

Default file paths used by this skill:

Credentials file (read): ~/.config/sogni/credentials
Last render metadata (read/write): ~/.config/sogni/last-render.json
OpenClaw config (read): ~/.openclaw/openclaw.json
Media listing for --list-media (read): ~/.clawdbot/media/inbound
MCP local result copies (write): ~/Downloads/sogni

Path override environment variables:

SOGNI_CREDENTIALS_PATH
SOGNI_LAST_RENDER_PATH
SOGNI_MEDIA_INBOUND_DIR
OPENCLAW_CONFIG_PATH
SOGNI_DOWNLOADS_DIR (MCP)
SOGNI_MCP_SAVE_DOWNLOADS=0 to disable MCP local file writes
SOGNI_ALLOWED_DOWNLOAD_HOSTS to override which HTTPS hosts the MCP server may auto-download and save locally

Usage (Images & Video)

# Generate and get URL
node sogni-gen.mjs "a cat wearing a hat"

# Save to file
node sogni-gen.mjs -o /tmp/cat.png "a cat wearing a hat"

# JSON output (for scripting)
node sogni-gen.mjs --json "a cat wearing a hat"

# Check token balances (no prompt required)
node sogni-gen.mjs --balance

# Check token balances in JSON
node sogni-gen.mjs --json --balance

# Quiet mode (suppress progress)
node sogni-gen.mjs -q -o /tmp/cat.png "a cat wearing a hat"

Options

Flag	Description	Default
`-o, --output <path>`	Save to file	prints URL
`-m, --model <id>`	Model ID	z_image_turbo_bf16
`-w, --width <px>`	Width	512
`-h, --height <px>`	Height	512
`-n, --count <num>`	Number of images	1
`-t, --timeout <sec>`	Timeout seconds	30 (300 for video)
`-s, --seed <num>`	Specific seed	random
`--last-seed`	Reuse seed from last render	-
`--seed-strategy <s>`	Seed strategy: random\|prompt-hash	prompt-hash
`--multi-angle`	Multiple angles LoRA mode (Qwen Image Edit)	-
`--angles-360`	Generate 8 azimuths (front -> front-left)	-
`--angles-360-video`	Assemble looping 360 mp4 using i2v between angles (requires ffmpeg)	-
`--azimuth <key>`	front\|front-right\|right\|back-right\|back\|back-left\|left\|front-left	front
`--elevation <key>`	low-angle\|eye-level\|elevated\|high-angle	eye-level
`--distance <key>`	close-up\|medium\|wide	medium
`--angle-strength <n>`	LoRA strength for multiple_angles	0.9
`--angle-description <text>`	Optional subject description	-
`--steps <num>`	Override steps (model-dependent)	-
`--guidance <num>`	Override guidance (model-dependent)	-
`--output-format <f>`	Image output format: png\|jpg	png
`--sampler <name>`	Sampler (model-dependent)	-
`--scheduler <name>`	Scheduler (model-dependent)	-
`--lora <id>`	LoRA id (repeatable, edit only)	-
`--loras <ids>`	Comma-separated LoRA ids	-
`--lora-strength <n>`	LoRA strength (repeatable)	-
`--lora-strengths <n>`	Comma-separated LoRA strengths	-
`--token-type <type>`	Token type: spark\|sogni	spark
`--balance, --balances`	Show SPARK/SOGNI balances and exit	-
`-c, --context <path>`	Context image for editing	-
`--last-image`	Use last generated image as context/ref	-
`--video, -v`	Generate video instead of image	-
`--workflow <type>`	Video workflow (t2v\|i2v\|s2v\|ia2v\|a2v\|v2v\|animate-move\|animate-replace)	inferred
`--fps <num>`	Frames per second (video)	16
`--duration <sec>`	Duration in seconds (video)	5
`--frames <num>`	Override total frames (video)	-
`--auto-resize-assets`	Auto-resize video assets	true
`--no-auto-resize-assets`	Disable auto-resize	-
`--estimate-video-cost`	Estimate video cost and exit (requires --steps)	-
`--photobooth`	Face transfer mode (InstantID + SDXL Turbo)	-
`--cn-strength <n>`	ControlNet strength (photobooth)	0.8
`--cn-guidance-end <n>`	ControlNet guidance end point (photobooth)	0.3
`--ref <path>`	Reference image for video or photobooth face	required for video/photobooth
`--ref-end <path>`	End frame for i2v interpolation	-
`--ref-audio <path>`	Reference audio for s2v	-
`--ref-video <path>`	Reference video for animate/v2v workflows	-
`--controlnet-name <name>`	ControlNet type for v2v: canny\|pose\|depth\|detailer	-
`--controlnet-strength <n>`	ControlNet strength for v2v (0.0-1.0)	0.8
`--sam2-coordinates <coords>`	SAM2 click coords for animate-replace (x,y or x1,y1;x2,y2)	-
`--trim-end-frame`	Trim last frame for seamless video stitching	-
`--first-frame-strength <n>`	Keyframe strength for start frame (0.0-1.0)	-
`--last-frame-strength <n>`	Keyframe strength for end frame (0.0-1.0)	-
`--last`	Show last render info	-
`--json`	JSON output	false
`--strict-size`	Do not auto-adjust i2v video size for reference resizing constraints	false
`-q, --quiet`	No progress output	false
`--extract-last-frame <video> <image>`	Extract last frame from video (safe ffmpeg wrapper)	-
`--concat-videos <out> <clips...>`	Concatenate video clips (safe ffmpeg wrapper)	-
`--list-media [type]`	List recent inbound media (images\|audio\|all)	images

OpenClaw Config Defaults

When installed as an OpenClaw plugin, sogni-gen will read defaults from:

~/.openclaw/openclaw.json

{
  "plugins": {
    "entries": {
      "sogni-gen": {
        "enabled": true,
        "config": {
          "defaultImageModel": "z_image_turbo_bf16",
          "defaultEditModel": "qwen_image_edit_2511_fp8_lightning",
          "defaultPhotoboothModel": "coreml-sogniXLturbo_alpha1_ad",
          "videoModels": {
            "t2v": "wan_v2.2-14b-fp8_t2v_lightx2v",
            "i2v": "wan_v2.2-14b-fp8_i2v_lightx2v",
            "s2v": "wan_v2.2-14b-fp8_s2v_lightx2v",
            "ia2v": "ltx2-19b-fp8_ia2v_distilled",
            "a2v": "ltx2-19b-fp8_a2v_distilled",
            "animate-move": "wan_v2.2-14b-fp8_animate-move_lightx2v",
            "animate-replace": "wan_v2.2-14b-fp8_animate-replace_lightx2v",
            "v2v": "ltx2-19b-fp8_v2v_distilled"
          },
          "defaultVideoWorkflow": "t2v",
          "defaultNetwork": "fast",
          "defaultTokenType": "spark",
          "seedStrategy": "prompt-hash",
          "modelDefaults": {
            "flux1-schnell-fp8": { "steps": 4, "guidance": 3.5 },
            "flux2_dev_fp8": { "steps": 20, "guidance": 7.5 }
          },
          "defaultWidth": 768,
          "defaultHeight": 768,
          "defaultCount": 1,
          "defaultFps": 16,
          "defaultDurationSec": 5,
          "defaultImageTimeoutSec": 30,
          "defaultVideoTimeoutSec": 300,
          "credentialsPath": "~/.config/sogni/credentials",
          "lastRenderPath": "~/.config/sogni/last-render.json",
          "mediaInboundDir": "~/.clawdbot/media/inbound"
        }
      }
    }
  }
}

CLI flags always override these defaults. If your OpenClaw config lives elsewhere, set OPENCLAW_CONFIG_PATH. Seed strategies: prompt-hash (deterministic) or random.

Image Models

Model	Speed	Use Case
`z_image_turbo_bf16`	Fast (~5-10s)	General purpose, default
`flux1-schnell-fp8`	Very fast	Quick iterations
`flux2_dev_fp8`	Slow (~2min)	High quality
`chroma-v.46-flash_fp8`	Medium	Balanced
`qwen_image_edit_2511_fp8`	Medium	Image editing with context (up to 3)
`qwen_image_edit_2511_fp8_lightning`	Fast	Quick image editing
`coreml-sogniXLturbo_alpha1_ad`	Fast	Photobooth face transfer (SDXL Turbo)

Video Models

WAN 2.2 Models

Model	Speed	Use Case
`wan_v2.2-14b-fp8_i2v_lightx2v`	Fast	Default video generation
`wan_v2.2-14b-fp8_i2v`	Slow	Higher quality video
`wan_v2.2-14b-fp8_t2v_lightx2v`	Fast	Text-to-video
`wan_v2.2-14b-fp8_s2v_lightx2v`	Fast	Sound-to-video
`wan_v2.2-14b-fp8_animate-move_lightx2v`	Fast	Animate-move
`wan_v2.2-14b-fp8_animate-replace_lightx2v`	Fast	Animate-replace

LTX-2 / LTX-2.3 Models

Model	Speed	Use Case
`ltx2-19b-fp8_t2v_distilled`	Fast (~2-3min)	Text-to-video, 8-step
`ltx2-19b-fp8_t2v`	Medium (~5min)	Text-to-video, 20-step quality
`ltx2-19b-fp8_i2v_distilled`	Fast (~2-3min)	Image-to-video, 8-step
`ltx2-19b-fp8_i2v`	Medium (~5min)	Image-to-video, 20-step quality
`ltx2-19b-fp8_ia2v_distilled`	Fast (~2-3min)	Image+audio-to-video
`ltx2-19b-fp8_a2v_distilled`	Fast (~2-3min)	Audio-to-video
`ltx2-19b-fp8_v2v_distilled`	Fast (~3min)	Video-to-video with ControlNet
`ltx2-19b-fp8_v2v`	Medium (~5min)	Video-to-video with ControlNet, quality
`ltx23-22b-fp8_t2v_distilled`	Fast (~2-3min)	Text-to-video, LTX-2.3

Image Editing with Context

Edit images using reference images (Qwen models support up to 3):

# Single context image
node sogni-gen.mjs -c photo.jpg "make the background a beach"

# Multiple context images (subject + style)
node sogni-gen.mjs -c subject.jpg -c style.jpg "apply the style to the subject"

# Use last generated image as context
node sogni-gen.mjs --last-image "make it more vibrant"

When context images are provided without -m, defaults to qwen_image_edit_2511_fp8_lightning.

Photobooth (Face Transfer)

Generate stylized portraits from a face photo using InstantID ControlNet. When a user mentions "photobooth", wants a stylized portrait of themselves, or asks to transfer their face into a style, use --photobooth with --ref pointing to their face image.

# Basic photobooth
node sogni-gen.mjs --photobooth --ref face.jpg "80s fashion portrait"

# Multiple outputs
node sogni-gen.mjs --photobooth --ref face.jpg -n 4 "LinkedIn professional headshot"

# Custom ControlNet tuning
node sogni-gen.mjs --photobooth --ref face.jpg --cn-strength 0.6 --cn-guidance-end 0.5 "oil painting"

Uses SDXL Turbo (coreml-sogniXLturbo_alpha1_ad) at 1024x1024 by default. The face image is passed via --ref and styled according to the prompt. Cannot be combined with --video or -c/--context.

Agent usage:

# Photobooth: stylize a face photo
node {{skillDir}}/sogni-gen.mjs -q --photobooth --ref /path/to/face.jpg -o /tmp/stylized.png "80s fashion portrait"

# Multiple photobooth outputs
node {{skillDir}}/sogni-gen.mjs -q --photobooth --ref /path/to/face.jpg -n 4 -o /tmp/stylized.png "LinkedIn professional headshot"

Multiple Angles (Turnaround)

Generate specific camera angles from a single reference image using the Multiple Angles LoRA:

# Single angle
node sogni-gen.mjs --multi-angle -c subject.jpg \
  --azimuth front-right --elevation eye-level --distance medium \
  --angle-strength 0.9 \
  "studio portrait, same person"

# 360 sweep (8 azimuths)
node sogni-gen.mjs --angles-360 -c subject.jpg --distance medium --elevation eye-level \
  "studio portrait, same person"

# 360 sweep video (looping mp4, uses i2v between angles; requires ffmpeg)
node sogni-gen.mjs --angles-360 --angles-360-video /tmp/turntable.mp4 \
  -c subject.jpg --distance medium --elevation eye-level \
  "studio portrait, same person"

The prompt is auto-built with the required <sks> token plus the selected camera angle keywords. --angles-360-video generates i2v clips between consecutive angles (including last→first) and concatenates them with ffmpeg for a seamless loop.

360 Video Best Practices

When a user requests a "360 video", follow this workflow:

Default camera parameters (do not ask unless they specify):
- Elevation: default to medium
- Distance: default to medium

Map user terms to flags:

User says	Flag value
"high" angle	`--elevation high-angle`
"medium" angle	`--elevation eye-level`
"low" angle	`--elevation low-angle`
"close"	`--distance close-up`
"medium" distance	`--distance medium`
"far"	`--distance wide`

Always use first-frame/last-frame stitching - the --angles-360-video flag automatically handles this by generating i2v clips between consecutive angles including last→first for seamless looping.

Example command:

node sogni-gen.mjs --angles-360 --angles-360-video /tmp/output.mp4 \
  -c /path/to/image.png --elevation eye-level --distance medium \
  "description of subject"

Transition Video Rule

For any transition video work, always use the Sogni skill/plugin (not raw ffmpeg or other shell commands). Use the built-in --extract-last-frame, --concat-videos, and --looping flags for video manipulation.

Insufficient Funds Handling

When you see "Debit Error: Insufficient funds", reply:

"Insufficient funds. Claim 50 free daily Spark points at https://app.sogni.ai/"

Video Generation

Generate videos from a reference image:

# Text-to-video (t2v)
node sogni-gen.mjs --video "ocean waves at sunset"

# Basic video from image
node sogni-gen.mjs --video --ref cat.jpg -o cat.mp4 "cat walks around"

# Use last generated image as reference
node sogni-gen.mjs --last-image --video "gentle camera pan"

# Custom duration and FPS
node sogni-gen.mjs --video --ref scene.png --duration 10 --fps 24 "zoom out slowly"

# Sound-to-video (s2v)
node sogni-gen.mjs --video --ref face.jpg --ref-audio speech.m4a \
  -m wan_v2.2-14b-fp8_s2v_lightx2v "lip sync talking head"

# Image+audio-to-video (ia2v, LTX)
node sogni-gen.mjs --video --workflow ia2v --ref cover.jpg --ref-audio song.mp3 \
  "music video with synchronized motion"

# Audio-to-video (a2v, LTX)
node sogni-gen.mjs --video --workflow a2v --ref-audio song.mp3 \
  "abstract audio-reactive visualizer"

# LTX-2.3 text-to-video
node sogni-gen.mjs --video -m ltx23-22b-fp8_t2v_distilled --duration 20 \
  "A wide cinematic aerial shot opens over steep tropical cliffs at golden hour, warm sunlight grazing the rock faces while sea mist drifts above the water below. Palm trees bend gently along the ridge as waves roll against the shoreline, leaving bright bands of foam across the dark stone. The camera glides forward in one continuous pass, revealing more of the coastline as sunlight flickers across wet surfaces and distant birds wheel through the haze. The scene holds a calm, upscale travel-film mood with smooth stabilized motion and crisp environmental detail."

# Animate (motion transfer)
node sogni-gen.mjs --video --ref subject.jpg --ref-video motion.mp4 \
  --workflow animate-move "transfer motion"

Video-to-Video (V2V) with ControlNet

Transform an existing video using LTX-2 models with ControlNet guidance:

# Basic v2v with canny edge detection
node sogni-gen.mjs --video --workflow v2v --ref-video input.mp4 \
  --controlnet-name canny "stylized anime version"

# V2V with pose detection and custom strength
node sogni-gen.mjs --video --workflow v2v --ref-video dance.mp4 \
  --controlnet-name pose --controlnet-strength 0.7 "robot dancing"

# V2V with depth map
node sogni-gen.mjs --video --workflow v2v --ref-video scene.mp4 \
  --controlnet-name depth "watercolor painting style"

ControlNet types: canny (edge detection), pose (body pose), depth (depth map), detailer (detail enhancement).

Photo Restoration

Restore damaged vintage photos using Qwen image editing:

# Basic restoration
sogni-gen -c damaged_photo.jpg -o restored.png \
  "professionally restore this vintage photograph, remove damage and scratches"

# Detailed restoration with preservation hints
sogni-gen -c old_photo.jpg -o restored.png -w 1024 -h 1280 \
  "restore this vintage photo, remove peeling, tears and wear marks, \
  preserve natural features and expression, maintain warm nostalgic color tones"

Tips for good restorations:

Describe the damage: "peeling", "scratches", "tears", "fading"
Specify what to preserve: "natural features", "eye color", "hair", "expression"
Mention the era for color tones: "1970s warm tones", "vintage sepia"

Finding received images (Telegram/etc):

node {{skillDir}}/sogni-gen.mjs --json --list-media images

Do NOT use ls, cp, or other shell commands to browse user files. Always use --list-media to find inbound media.

IMPORTANT KEYWORD RULE

If the user message includes the word "photobooth" (case-insensitive), always use --photobooth mode with --ref set to the user-provided face image.
Prioritize this rule over generic image-edit flows (-c) for that request.

LTX-2.3 Prompt Rule

Whenever the chosen video model is ltx23-22b-fp8_t2v_distilled, do not pass the user's short request through unchanged. Rewrite it into an LTX-2.3-safe prompt before calling sogni-gen.

Output one single paragraph only. No line breaks, bullet points, section labels, tag lists, or screenplay formatting.
Use 4-8 flowing present-tense sentences describing one continuous shot. No cuts, montage, or unrelated scene jumps.
Start with shot scale plus the scene's visual identity, then describe environment, time of day, atmosphere, textures, and specific light sources.
Keep people, clothing, props, and locations concrete and stable across the whole paragraph.
Give the scene one main action thread from start to finish. Use connectors like as, while, and then so motion reads as a continuous filmed moment.
If the user asks for dialogue, embed the spoken words inline as prose and identify who is speaking and how they deliver the line.
Express emotion through visible physical cues such as posture, grip, jaw tension, breathing, or pacing. Ambient sound can be woven into the prose naturally.
Use positive phrasing only. Do not add negative prompts, "no ..." clauses, on-screen text/logo requests, vague filler words like beautiful or nice, or structural markup such as [DIALOGUE].
Keep action density proportional to duration. For short clips, describe one main beat rather than several separate events.
Preserve the user's request, but expand it into cinematic prose. Do not invent a different story just to make the prompt longer.

Duration-Aware Pacing

Match scene density to clip length so prompts stay filmable:

About 1-4s: describe exactly 1 action or moment.
About 5-8s: describe about 2 sequential actions.
About 9-12s: describe about 3 sequential actions.
Longer clips: add only a small number of additional sequential beats. Do not turn the prompt into a montage or a full story arc unless the duration clearly supports it.

Orientation Mapping

When the user explicitly asks for an orientation or aspect ratio, map it to safe LTX dimensions:

vertical, portrait, story, reel, tiktok -> -w 1088 -h 1920
landscape, horizontal, widescreen, youtube, 16:9 -> -w 1920 -h 1088
square, 1:1 -> -w 1088 -h 1088
4:3 portrait -> -w 832 -h 1088
4:3 landscape -> -w 1088 -h 832

Camera Language Normalization

When the user uses loose camera language, translate it into concrete motion phrasing inside the prose prompt:

zoom in -> slow push-in
zoom out -> slow pull-back
pan left / pan right -> smooth pan left / smooth pan right
orbit / circle around -> slow arc left or slow arc right
follow -> tracking follow

Short example:

User ask: "4k video of a woman in a neon alley"

Use this shape instead: "A medium cinematic shot frames a woman in her 30s standing in a rain-soaked neon alley at night, violet and amber signs reflecting across the wet pavement while warm steam drifts from street vents. She wears a dark trench coat with damp strands of black hair clinging near her cheek as light glances across the fabric texture and the brick walls behind her. She turns toward the camera and steps forward with measured focus, one hand tightening around the strap of her bag while rain taps softly on the metal fire escape and a distant train hum rolls through the block. The camera performs a slow push-in as her jaw sets and her breathing steadies, maintaining smooth stabilized motion and a tense urban-thriller mood."

Agent Usage

When user asks to generate/draw/create an image:

# Generate and save locally
node {{skillDir}}/sogni-gen.mjs -q -o /tmp/generated.png "user's prompt"

# Edit an existing image
node {{skillDir}}/sogni-gen.mjs -q -c /path/to/input.jpg -o /tmp/edited.png "make it pop art style"

# Generate video from image
node {{skillDir}}/sogni-gen.mjs -q --video --ref /path/to/image.png -o /tmp/video.mp4 "A medium shot holds on the subject in soft late-afternoon light as fabric edges and background details remain clear and stable. The camera performs a slow push-in while the subject shifts weight subtly and turns slightly toward the lens, keeping the motion gentle and continuous. Leaves rustle softly in the background and the scene maintains smooth cinematic movement with no abrupt action changes."

# Generate text-to-video
node {{skillDir}}/sogni-gen.mjs -q --video -o /tmp/video.mp4 "A wide cinematic shot opens on ocean waves rolling toward a rocky shoreline at sunset, golden light spreading across the water while sea mist drifts through the air. Foam patterns form and recede over the dark sand as the horizon glows orange and pink in the distance. The camera glides forward in one continuous movement, holding smooth stabilized motion and calm environmental detail throughout the scene."

# HD / "4K" text-to-video: prefer LTX-2.3
node {{skillDir}}/sogni-gen.mjs -q --video -m ltx23-22b-fp8_t2v_distilled -w 1920 -h 1088 -o /tmp/video.mp4 "A wide cinematic aerial shot opens over a rugged ocean coastline at golden hour, warm sunlight catching the cliff faces while white surf breaks against dark rock below. Low sea mist hangs over the water and bands of foam trace the shoreline as gulls wheel through the distance. The camera glides forward in one continuous pass, revealing the curve of the coast while wet stone flashes with reflected light and the scene keeps smooth stabilized motion from start to finish. The overall mood feels expansive and polished, with crisp environmental detail and steady travel-film energy."

# HD / "4K" image-to-video: prefer LTX i2v
node {{skillDir}}/sogni-gen.mjs -q --video --ref /path/to/image.png -m ltx2-19b-fp8_i2v_distilled -w 1920 -h 1088 -o /tmp/video.mp4 "A medium cinematic shot holds on the scene with clean subject separation and stable environmental detail as directional light shapes the surfaces and background depth. The camera performs a slow push-in while the main subject makes one subtle continuous movement, keeping posture and identity consistent from start to finish. Ambient motion in the background stays gentle and the overall clip remains smooth, stabilized, and visually coherent."

# Photobooth: stylize a face photo
node {{skillDir}}/sogni-gen.mjs -q --photobooth --ref /path/to/face.jpg -o /tmp/stylized.png "80s fashion portrait"

# Check current SPARK/SOGNI balances (no prompt required)
node {{skillDir}}/sogni-gen.mjs --json --balance

# Find user-sent images/audio
node {{skillDir}}/sogni-gen.mjs --json --list-media images

# Then send via message tool with filePath

High-Res Video Routing

When the user asks for video in "hd", "1080p", "4k", "uhd", or "high-res", do not use the default WAN video models.

For text-to-video, use -m ltx23-22b-fp8_t2v_distilled.
For image-to-video, use -m ltx2-19b-fp8_i2v_distilled.
Prefer LTX-sized dimensions such as -w 1920 -h 1088.
If the user explicitly asks for vertical, portrait, story, reel, tiktok, square, or 4:3, apply the matching dimensions from the Orientation Mapping rules instead of defaulting to 16:9.
Rewrite the user's request using the LTX-2.3 Prompt Rule before invoking the command. Do not send short slogan-style prompts to LTX.
Treat "4k" as a signal to use the highest practical LTX path exposed by this skill, even if the exact output is not literal 3840x2160.

Security: Agents must use the CLI's built-in flags (--extract-last-frame, --concat-videos, --list-media) for all file operations and video manipulation. Never run raw shell commands (ffmpeg, ls, cp, etc.) directly.

Animate Between Two Images (First-Frame / Last-Frame)

When a user asks to animate between two images, use --ref (first frame) and --ref-end (last frame) to create a creative interpolation video:

# Animate from image A to image B
node {{skillDir}}/sogni-gen.mjs -q --video --ref /tmp/imageA.png --ref-end /tmp/imageB.png -o /tmp/transition.mp4 "descriptive prompt of the transition"

Animate a Video to an Image (Scene Continuation)

When a user asks to animate from a video to an image (or "continue" a video into a new scene):

Extract the last frame of the existing video using the built-in safe wrapper:

node {{skillDir}}/sogni-gen.mjs --extract-last-frame /tmp/existing.mp4 /tmp/lastframe.png

Generate a new video using the last frame as --ref and the target image as --ref-end:

node {{skillDir}}/sogni-gen.mjs -q --video --ref /tmp/lastframe.png --ref-end /tmp/target.png -o /tmp/continuation.mp4 "scene transition prompt"

Concatenate the videos using the built-in safe wrapper:

node {{skillDir}}/sogni-gen.mjs --concat-videos /tmp/full_sequence.mp4 /tmp/existing.mp4 /tmp/continuation.mp4

This ensures visual continuity — the new clip picks up exactly where the previous one ended.

Do NOT run raw ffmpeg commands. Always use --extract-last-frame and --concat-videos for video manipulation.

Always apply this pattern when:

User says "animate image A to image B" → use --ref A --ref-end B
User says "animate this video to this image" → extract last frame, use as --ref, target image as --ref-end, then stitch
User says "continue this video" with a target image → same as above

JSON Output

{
  "success": true,
  "prompt": "a cat wearing a hat",
  "model": "z_image_turbo_bf16", 
  "width": 512,
  "height": 512,
  "urls": ["https://..."],
  "localPath": "/tmp/cat.png"
}

On error (with --json), the script returns a single JSON object like:

{
  "success": false,
  "error": "Video width and height must be divisible by 16 (got 500x512).",
  "errorCode": "INVALID_VIDEO_SIZE",
  "hint": "Choose --width/--height divisible by 16. For i2v, also match the reference aspect ratio."
}

Balance check example (--json --balance):

{
  "success": true,
  "type": "balance",
  "spark": 12.34,
  "sogni": 0.56
}

Cost

Uses Spark tokens from your Sogni account. 512x512 images are most cost-efficient.

Troubleshooting

Auth errors: Check SOGNI_API_KEY or the credentials in ~/.config/sogni/credentials
i2v sizing gotchas: Video sizes are constrained (min 480px, max 1536px, divisible by 16). For i2v, the client wrapper resizes the reference (fit: inside) and uses the resized dimensions as the final video size. Because this uses rounding, a requested size can still yield an invalid final size (example: 1024x1536 requested but ref becomes 1024x1535).
Auto-adjustment: With a local --ref, the script will auto-adjust the requested size to avoid non-16 resized reference dimensions.
If the script adjusts your size but you want to fail instead: pass --strict-size and it will print a suggested --width/--height.
Timeouts: Try a faster model or increase -t timeout
No workers: Check https://sogni.ai for network status

Sogni Gen

Description