name: sensevoice-transcribe description: Transcribe audio files (WAV/MP3/M4A/FLAC) to timestamped text using SenseVoice-Small + FSMN-VAD. Supports single-file and batch mode with VAD-anchored per-segment timestamps (~15s granularity). Use when the user wants to transcribe speech/audio, run batch transcription on daylog recordings, or re-transcribe specific dates. Replaces the old whisper-transcribe skill.

SenseVoice Transcribe

Transcribe audio to timestamped text using FunASR's iic/SenseVoiceSmall model with fsmn-vad for timestamp anchoring.

Pipeline

FSMN-VAD segments audio into speech regions (~258 segments for 30min file)
SenseVoice-Small transcribes full audio with merge_vad=True
Raw text is split by <|zh|> tags → cleaned via rich_transcription_postprocess()
Text segments are proportionally mapped to VAD timestamps
Output: [HH:MM:SS] text per line, ~15s granularity

Environment

Venv: ~/.openclaw/venvs/sensevoice/
Python: 3.12
Key packages: funasr==1.3.1, modelscope, onnxruntime
Model cache: ~/.cache/modelscope/hub/models/iic/SenseVoiceSmall
VAD cache: ~/.cache/modelscope/hub/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch

First-time Setup

python3 -m venv ~/.openclaw/venvs/sensevoice
source ~/.openclaw/venvs/sensevoice/bin/activate
pip install funasr modelscope onnxruntime
# Models auto-download on first run (~234MB SenseVoice + ~4MB VAD)

Usage

Single File

source ~/.openclaw/venvs/sensevoice/bin/activate
python3 -c "
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess
from datetime import datetime, timedelta
import re

wav = '<WAV_PATH>'
# Parse start time from filename: TX01_MIC015_20260308_124130_orig.wav
m = re.search(r'(\d{8})_(\d{6})', wav)
start_dt = datetime.strptime(m.group(1)+m.group(2), '%Y%m%d%H%M%S') if m else None

vad_model = AutoModel(model='fsmn-vad', disable_update=True)
model = AutoModel(model='iic/SenseVoiceSmall', vad_model='fsmn-vad',
                  vad_kwargs={'max_single_segment_time': 30000}, device='cpu')

vad_segs = vad_model.generate(input=wav)[0].get('value', [])
res = model.generate(input=wav, cache={}, language='zh', use_itn=True,
                     batch_size_s=60, merge_vad=True)

texts = [rich_transcription_postprocess(s).strip()
         for s in re.split(r'<\|zh\|>', res[0]['text']) if s.strip()]
texts = [s for s in texts if len(s) > 1]

ratio = len(vad_segs) / len(texts) if texts else 1
for i, t in enumerate(texts):
    vi = min(int(i * ratio), len(vad_segs)-1)
    ts = (start_dt + timedelta(milliseconds=vad_segs[vi][0])).strftime('%H:%M:%S') if start_dt else f'{vad_segs[vi][0]//1000:.0f}s'
    print(f'[{ts}] {t}')
"

Batch Mode (daylog)

The bundled scripts/batch_transcribe.py handles the full daylog pipeline:

source ~/.openclaw/venvs/sensevoice/bin/activate
cd ~/Documents/dec/daylog

# Dry run — see what would be transcribed
python3 scripts/batch_transcribe.py --dry-run

# Transcribe all new files
python3 scripts/batch_transcribe.py

# Re-transcribe specific dates (deletes existing, then re-runs)
python3 scripts/batch_transcribe.py --force-dates 2026-03-07,2026-03-08

# With progress file + Discord webhook
python3 scripts/batch_transcribe.py \
  --progress-file /tmp/daylog-progress.json \
  --discord-webhook https://discord.com/api/webhooks/...

Flags:

Flag	Description
`--dry-run`	Preview without writing
`--engine sensevoice\|whisper`	Engine (default: sensevoice)
`--force-dates YYYY-MM-DD,...`	Delete & re-transcribe these dates
`--progress-file PATH`	Write JSON progress for monitoring
`--discord-webhook URL`	Post start/milestone/finish to Discord

Directory layout:

daylog/
├── raw/                          # WAV input (DJI MIC 3, 48kHz/32bit, ~247MB/30min)
│   ├── TX01_MIC009_20260308_094129_orig.wav
│   └── ...
├── transcripts/                  # Output, grouped by date
│   └── 2026-03-08/
│       ├── 000_TX01_MIC009_20260308_094129_orig.txt
│       └── ...
└── notes/                        # Compiled daily notes (separate step)
    └── 2026-03-08.md

Behavior:

Groups WAV files by date extracted from filename (YYYYMMDD)
Sorts by timestamp within each date for correct chronological order
Skips already-transcribed files unless --force-dates
Indexed output filenames (000_, 001_, ...) for sort order
Discord milestones every 25% progress

Output Format

[录音开始: 09:41:29]
[09:41:35] 到了，我们下车吧。
[09:41:48] 武康大楼，人好多啊。
[09:42:04] 你帮我在这里拍一张。
...

Performance (Apple M4, 10-core CPU)

Metric	Value
RTF	~0.04 (25x realtime)
CPU	~1.2 cores (12%)
RAM	~1.5GB
30min WAV	~73s transcription + ~4s VAD
Accuracy	92% keyword accuracy (vs Whisper-medium 23%, turbo 38%)
Hallucinations	0 (vs Whisper hundreds per session)
Model size	234MB (vs Whisper-large-v3-turbo 1.5GB)

vs Old Whisper Skill

	Whisper (old)	SenseVoice (new)
Model	mlx-whisper-medium	SenseVoice-Small (234MB)
Accuracy	23-38%	92%
Hallucinations	Hundreds per session	0
Timestamp	Per-word (~2-4s)	VAD-anchored (~15s)
Duplicate lines	~23%	<0.2%
Chinese support	Weak	Native (Mandarin-optimized)

Emoji Note

SenseVoice appends emotion tags (😊😔😡😮) to segments. These are model artifacts reflecting detected speech emotion, not literal emoji in the audio. Downstream consumers (note compilation) should ignore or strip them.

SenseVoice Transcribe

Description

SenseVoice Transcribe

Pipeline

Environment

First-time Setup

Usage

Single File

Batch Mode (daylog)

Output Format

Performance (Apple M4, 10-core CPU)

vs Old Whisper Skill

Emoji Note

Reviews (0)

Comments (0)

Compatible Platforms

Links

Pricing

Related Configs

self-improving-agent

Self Improving Agent

Find Skills

Summarize