🧪 Skills

IMA Studio TTS — seed-tts, DouBao

TTS (text-to-speech) via IMA Open API with seed-tts-2.0. Voice synthesis, speech from text, dubbing, audio content creation. Output: audio URL (mp3/wav). Flo...

v1.0.5
❤️ 0
⬇️ 155
👁 1
Share

Description


name: IMA Studio TTS version: 1.0.0 category: file-generation author: IMA Studio (imastudio.com) keywords: imastudio, tts, text-to-speech, speech synthesis, voice, IMA, seed-tts, seed-tts-2.0 argument-hint: "[text to speak]" description: > TTS (text-to-speech) via IMA Open API with seed-tts-2.0. Voice synthesis, speech from text, dubbing, audio content creation. Output: audio URL (mp3/wav). Flow: query products, create task, poll until done. Requires IMA API key.

IMA TTS (Text-to-Speech)

Overview

Call IMA Open API to create text-to-speech audio. Same flow as other IMA creation skills: query products → create task → poll until done. Task type is text_to_speech. This skill targets seed-tts-2.0 only — seed-tts-1.1 is not supported; the script defaults to seed-tts-2.0 when no model is specified.

⚙️ How This Skill Works

This skill uses a bundled Python script scripts/ima_tts_create.py to call the IMA Open API:

  • Sends text (prompt) to https://api.imastudio.com
  • Uses --user-id only locally for preference storage
  • Returns an audio URL when synthesis is complete
  • Reflection mechanism: on create failure, retries up to 3 times with parameter adjustments

What gets sent to IMA: prompt (text to speak), model selection, parameters (e.g. voice_id, speed). Not sent: API key in prompt body; user_id is local only.

Agent Execution

Use the bundled script:

# List available TTS models (optional; default is seed-tts-2.0)
python3 {baseDir}/scripts/ima_tts_create.py --api-key $IMA_API_KEY --list-models

# Generate speech (default model: seed-tts-2.0; omit --model-id to use default)
python3 {baseDir}/scripts/ima_tts_create.py \
  --api-key $IMA_API_KEY \
  --model-id seed-tts-2.0 \
  --prompt "Text to be spoken here." \
  --user-id {user_id} \
  --output-json

Script outputs JSON; parse it for url and pass to the user via the UX protocol below.


Environment

Base URL: https://api.imastudio.com

Header Required Value
Authorization Bearer ima_your_api_key_here
x-app-source ima_skills
x_app_language recommended en / zh

⚠️ MANDATORY: Always Query Product List First

You MUST call /open/v1/product/list with category=text_to_speech before creating any task. attribute_id is required; if 0 or missing → "Invalid product attribute" and task fails.

GET /open/v1/product/list?app=ima&platform=web&category=text_to_speech

Then traverse the V2 tree: type=2 = model groups, type=3 = versions (leaves). Only type=3 nodes have credit_rules and form_config. Use a leaf’s model_id, id (= model_version), and credit_rules[0].attribute_id / points for create.


Core Flow

1. GET /open/v1/product/list?app=ima&platform=web&category=text_to_speech
   → Get attribute_id, credit, model_version, form_config

2. POST /open/v1/tasks/create
   → task_type: "text_to_speech", parameters[].parameters.prompt = text to speak

3. POST /open/v1/tasks/detail  { "task_id": "..." }
   → Poll every 2–5s until medias[].resource_status == 1 and status != "failed"
   → Read medias[].url (and optional duration_str, format)

Task Detail API — Actual Response Shape

Poll POST /open/v1/tasks/detail until completion. Response uses the same structure as other IMA audio tasks:

Field Type Meaning
resource_status int or null 0=处理中, 1=可用, 2=失败, 3=已删除;null 视为 0
status string "pending" / "processing" / "success" / "failed"
url string Audio URL when resource_status=1 (mp3/wav)
duration_str string Optional, e.g. "30s"
format string Optional, e.g. "mp3", "wav"

Completed success example:

{
  "id": "task_xxx",
  "medias": [{
    "resource_status": 1,
    "status": "success",
    "url": "https://cdn.../output.mp3",
    "duration_str": "12s",
    "format": "mp3"
  }]
}

Rules:

  • Treat resource_status: null as 0 (processing).
  • Success only when all medias have resource_status == 1 and status != "failed".
  • On resource_status == 2 or status == "failed", stop and handle error (e.g. use error_msg / remark).

API 2: Create Task

POST /open/v1/tasks/create

text_to_speech — no image input. src_img_url: [], input_images: [].

{
  "task_type": "text_to_speech",
  "enable_multi_model": false,
  "src_img_url": [],
  "parameters": [{
    "attribute_id":  "<from credit_rules>",
    "model_id":      "<model_id>",
    "model_name":    "<model_name>",
    "model_version": "<version_id>",
    "app":           "ima",
    "platform":      "web",
    "category":      "text_to_speech",
    "credit":        "<points>",
    "parameters": {
      "prompt":       "Text to be spoken.",
      "n":            1,
      "input_images": [],
      "cast":         {"points": "<points>", "attribute_id": "<attribute_id>"}
    }
  }]
}

prompt must be inside parameters[].parameters, not at top level. Extra fields (e.g. voice_id, speed) come from product form_config; include only those present in the product’s credit_rules/form_config.

Response: data.id = task_id for polling.


Supported Task Type & Models

category Capability Input
text_to_speech Text → Speech prompt (text to speak)

Models: This skill supports seed-tts-2.0 only (seed-tts-1.1 is not supported). The script defaults to --model-id seed-tts-2.0 when none is provided. For current attribute_id and credit, the script reads from the product list at runtime.

seed-tts-2.0 — Verified request parameters

The following parameters[].parameters shape has been verified to work for seed-tts-2.0 (attribute_id/credit come from product list and may differ by app/platform):

Parameter Type Required Description
prompt string Text to speak (合成文本).
n int Usually 1.
model string Sub-model: seed-tts-2.0-expressive (default) or seed-tts-2.0-standard.
speaker string optional Speaker ID / 发音人,e.g. zh_male_sophie_uranus_bigtts音色列表 1257544 中原生 voice_type). 注意: 使用原生格式(如 zh_male_*_uranus_bigtts),不支持 BV*_streaming 格式。
audio_params object optional emotion(情感)、speech_rate(语速 [-50,100])、loudness_rate(音量 [-50,100])等,见 1598757 请求 Body.
additions object optional e.g. {"explicit_language": "crosslingual", "context_texts": []}.
cast object {"points": <credit>, "attribute_id": <attribute_id>} from product list.

Script example with extra params:

python3 ima_tts_create.py --api-key $IMA_API_KEY --model-id seed-tts-2.0 \
  --prompt "阳光青年音色测试,你好世界。" \
  --extra-params '{"model":"seed-tts-2.0-expressive","speaker":"zh_male_sophie_uranus_bigtts","audio_params":{"emotion":"neutral"},"additions":{"explicit_language":"crosslingual","context_texts":[]}}' \
  --output-json

Note: The script gets attribute_id and credit from the product list (e.g. app=ima&platform=web → often 2 pts / attribute_id 4419 for seed-tts-2.0). If you have a different app/platform (e.g. webAgent), the product list may return different credit_rules (e.g. 5 pts / attribute_id 8987); the script uses whatever the product list returns for the chosen model.

Speaker / 音色列表(seed-tts-2.0 兼容火山引擎音色): 完整音色 ID 与场景分类见项目内 volcengine_tts_timbre_list.json。该文件来自 火山引擎豆包语音合成音色列表,使用原生 voice_type 格式(如 zh_male_sophie_uranus_bigtts 魅力苏菲、zh_female_vv_uranus_bigtts Vivi)。⚠️ 注意: IMA API 只支持原生格式(*_uranus_bigtts 系列),不支持 BV*_streaming 豆包音色 ID。

与火山引擎 2.0 文档对照: 上述参数与 HTTP Chunked/SSE 单向流式 V3 请求 Body 一致:req_params.text → prompt,req_params.speaker → speaker(必填项),req_params.model → model(expressive/standard),req_params.audio_params(emotion、speech_rate、loudness_rate 等),req_params.additions(如 explicit_language)。2.0 能力说明见 豆包语音合成2.0能力介绍(语音指令、引用上文、语音标签等)。


🎤 当用户说「帮我制作旁白/配音」时如何询问

当用户表达「帮我制作旁白」「做一段配音」「把这段文字读出来」等意图时,必须先收集关键信息再调用脚本,避免缺参或盲目默认。

必问

询问项 对应参数 说明
要朗读的内容 / 旁白文案 prompt 合成文本,必填。若用户只给主题,可请用户提供具体文案或由你生成后让用户确认。

建议问(让用户选择)

询问项 对应参数 选项来源与示例
音色 / 发音人 speaker 从项目内 volcengine_tts_timbre_list.json(或 音色列表 1257544)按场景推荐:通用场景(魅力苏菲 zh_male_sophie_uranus_bigtts、Vivi zh_female_vv_uranus_bigtts、云舟 zh_male_m191_uranus_bigtts)、视频配音(大壹 zh_male_dayi_uranus_bigtts、猴哥 zh_male_sunwukong_uranus_bigtts)、角色扮演(知性灿灿 zh_female_cancan_uranus_bigtts、撒娇学妹 zh_female_sajiaoxuemei_uranus_bigtts)。可简短列出 3–5 个候选让用户选,或问「要男声/女声?偏解说/读书/助手?」再缩小范围。⚠️ 使用原生格式*_uranus_bigtts)。

可选问(按需补充)

询问项 对应参数 说明与取值
情感 / 情绪 audio_params.emotion 部分音色支持,如 neutral、sad、angry;详见 音色列表-多情感音色
语速 audio_params.speech_rate 范围 [-50, 100],0 为正常,100 约 2 倍速。可通过 --extra-params '{"audio_params":{"speech_rate":20}}' 传入。
音量 audio_params.loudness_rate 范围 [-50, 100],0 为正常(mix 音色不支持)。
模型风格 model seed-tts-2.0-expressive(默认,表现力强)或 seed-tts-2.0-standard(更稳定)。

脚本对应: --prompt 必填;--speaker--emotion 直接支持;语速/音量/模型等通过 --extra-params 传入 JSON(见上文 Script example)。


📥 User Input Parsing (Parameter Recognition)

Map user intent to parameters using product form_config (e.g. voice, speed):

User intent / phrasing Parameter (if in form_config) Notes
旁白 / 配音 / 朗读 / 把这段读出来 prompt + speaker(建议问) 先问清内容与音色,再调用;见上方「当用户说制作旁白/配音时如何询问」。
女声 / 女声朗读 / female voice voice_id / voice_type / speaker Use value from form_config or e.g. speaker ID
男声 / 男声朗读 / male voice voice_id / voice_type / speaker Use value from form_config or e.g. speaker ID
发音人 / 音色 / speaker speaker seed-tts-2.0: e.g. zh_male_sophie_uranus_bigtts,见 volcengine_tts_timbre_list.json(原生格式)
情感 / 情绪 / emotion audio_params.emotion e.g. "neutral", "sad";部分音色支持
语速快/慢 / speed up/slow audio_params.speech_rate 范围 [-50, 100],0 为正常
音调 / pitch pitch If supported
大声/小声 / volume audio_params.loudness_rate 范围 [-50, 100]
风格 expressive/standard model seed-tts-2.0: seed-tts-2.0-expressive / seed-tts-2.0-standard

If the user does not specify, use form_config defaults. Do not send parameters not present in the product’s credit_rules/attributes or form_config (reflection will strip them on retry).


🧠 User Preference Memory

Storage: ~/.openclaw/memory/ima_prefs.json

{
  "user_{user_id}": {
    "text_to_speech": {
      "model_id": "...",
      "model_name": "...",
      "credit": 2,
      "last_used": "..."
    }
  }
}
  • Before generation: Load prefs; if user_{user_id}.text_to_speech exists, use that model and optionally mention it.
  • After success: Save used model to user_{user_id}.text_to_speech.
  • On explicit change: e.g. “换成XXX” / “以后都用XXX” → switch and save.

💬 User Experience Protocol (IM / Feishu / Discord)

TTS usually completes in a few seconds to tens of seconds. Do not leave users without feedback.

Step 0 — Initial Acknowledgment (Normal Reply)

First reply with a short acknowledgment (normal reply, not message tool), e.g.:

  • 好的,正在帮你把这段文字转成语音。
  • OK, converting this text to speech.

Step 1 — Pre-Generation Notification (message tool)

Push once:

🔊 开始语音合成,请稍候…
• 模型:[Model Name]
• 预计耗时:[X ~ Y 秒]
• 消耗积分:[N pts]

Step 2 — Progress

Poll every 2–5s. Every 10–15s send a progress update, e.g.:

⏳ 语音合成中… [P]%
已等待 [elapsed]s,预计最长 [max]s

Cap progress at 95% until API returns success.

Step 3 — Success (message tool)

When resource_status == 1 and status != "failed", send the audio and caption:

  • media = medias[0].url
  • caption example:
✅ 语音合成成功!
• 模型:[Model Name]
• 耗时:[actual]s
• 消耗积分:[N pts]
🔗 原始链接:[url]

Use the URL from the API (do not use local file paths).

Step 4 — Failure (message tool)

On failure or API/network error, send a short, user-friendly message and suggestions:

❌ 语音合成失败
• 原因:[自然语言原因]
• 建议:换个模型重试或检查文本长度/内容

需要我帮你用其他模型重试吗?

Error translation (do not expose raw API/technical errors):

Technical ✅ Say (CN) ✅ Say (EN)
401 Unauthorized 密钥无效或未授权,请至 imaclaw.ai 生成新密钥 API key invalid; generate at imaclaw.ai
4008 Insufficient points 积分不足,请至 imaclaw.ai 购买积分 Insufficient points; buy at imaclaw.ai
Invalid product attribute 参数配置异常,请稍后重试 Configuration error, try again later
Error 6006 / 6010 积分或参数不匹配,请换模型或重试 Points/params mismatch, try another model
resource_status == 2 / status failed 合成失败,建议换模型或缩短文本 Synthesis failed, try another model or shorter text
timeout 合成超时,请稍后重试 Timed out, try again later
Network error 网络不稳定,请检查后重试 Network unstable, check and retry

Links: API key — https://www.imaclaw.ai/imaclaw/apikey ;Credits — https://www.imaclaw.ai/imaclaw/subscription

Step 5 — Done

After Step 0–4, no further reply needed. Do not send duplicate confirmations.


Common Mistakes

Mistake Fix
prompt at top level Put prompt inside parameters[].parameters
Wrong or missing attribute_id Always call product list first; use credit_rules[0]
Single poll Poll until all medias have resource_status == 1
Ignoring status when resource_status=1 Check status != "failed"
Sending params not in form_config/credit_rules Use only params from product list; script reflection will strip others on retry

Security & Local Data

  • Network: This skill uses only https://api.imastudio.com (no image upload domain for TTS).
  • Local files: ~/.openclaw/memory/ima_prefs.json (preferences), ~/.openclaw/logs/ima_skills/ (logs, e.g. 7-day retention). No prompts or API keys stored.
  • API key: Set via environment (e.g. IMA_API_KEY) or agent config; never hardcode.

Python Example (Minimal)

import time
import requests

BASE = "https://api.imastudio.com"
HEADERS = {
    "Authorization": "Bearer ima_your_key",
    "Content-Type": "application/json",
    "x-app-source": "ima_skills",
}

# 1. Product list
r = requests.get(
    f"{BASE}/open/v1/product/list",
    headers=HEADERS,
    params={"app": "ima", "platform": "web", "category": "text_to_speech"},
)
tree = r.json()["data"]
# ... find type=3 node, get attribute_id, model_id, model_version, credit ...

# 2. Create task
body = {
    "task_type": "text_to_speech",
    "enable_multi_model": False,
    "src_img_url": [],
    "parameters": [{
        "attribute_id": attribute_id,
        "model_id": model_id,
        "model_name": model_name,
        "model_version": model_version,
        "app": "ima", "platform": "web",
        "category": "text_to_speech",
        "credit": credit,
        "parameters": {
            "prompt": "Hello, world.",
            "n": 1,
            "input_images": [],
            "cast": {"points": credit, "attribute_id": attribute_id},
        },
    }],
}
r = requests.post(f"{BASE}/open/v1/tasks/create", headers=HEADERS, json=body)
task_id = r.json()["data"]["id"]

# 3. Poll
while True:
    r = requests.post(f"{BASE}/open/v1/tasks/detail", headers=HEADERS, json={"task_id": task_id})
    task = r.json()["data"]
    medias = task.get("medias") or []
    if not medias:
        time.sleep(3)
        continue
    rs = medias[0].get("resource_status")
    if rs is None: rs = 0
    if rs == 2 or (medias[0].get("status") or "").lower() == "failed":
        raise RuntimeError(medias[0].get("error_msg") or "failed")
    if rs == 1 and (medias[0].get("url") or medias[0].get("watermark_url")):
        url = medias[0]["url"] or medias[0]["watermark_url"]
        print(url)  # e.g. https://cdn.../output.mp3
        break
    time.sleep(3)

Quick Reference

Item Value
Task type text_to_speech
Product list GET /open/v1/product/list?category=text_to_speech
Create POST /open/v1/tasks/create (prompt inside parameters[].parameters)
Poll POST /open/v1/tasks/detail every 2–5s
Done when All medias: resource_status=1, status≠"failed", url present
Script scripts/ima_tts_create.py (--list-models, --model-id, --prompt, --output-json)

Supported Models & Search Terms

Model: seed-tts-2.0 (also known as: seed tts, seed-tts, ByteDance TTS)

Capabilities: text-to-speech (TTS), speech synthesis, voice synthesis, voice generation, text to speech, dubbing, narration, voiceover, audio generation

Reviews (0)

Sign in to write a review.

No reviews yet. Be the first to review!

Comments (0)

Sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Compatible Platforms

Pricing

Free

Related Configs