🧪 Skills
SenseVoice Transcribe
Transcribe audio files (WAV/MP3/M4A/FLAC) to timestamped text using SenseVoice-Small + FSMN-VAD. Supports single-file and batch mode with VAD-anchored per-se...
v1.0.0
Description
name: sensevoice-transcribe description: Transcribe audio files (WAV/MP3/M4A/FLAC) to timestamped text using SenseVoice-Small + FSMN-VAD. Supports single-file and batch mode with VAD-anchored per-segment timestamps (~15s granularity). Use when the user wants to transcribe speech/audio, run batch transcription on daylog recordings, or re-transcribe specific dates. Replaces the old whisper-transcribe skill.
SenseVoice Transcribe
Transcribe audio to timestamped text using FunASR's iic/SenseVoiceSmall model with fsmn-vad for timestamp anchoring.
Pipeline
- FSMN-VAD segments audio into speech regions (~258 segments for 30min file)
- SenseVoice-Small transcribes full audio with
merge_vad=True - Raw text is split by
<|zh|>tags → cleaned viarich_transcription_postprocess() - Text segments are proportionally mapped to VAD timestamps
- Output:
[HH:MM:SS] textper line, ~15s granularity
Environment
Venv: ~/.openclaw/venvs/sensevoice/
Python: 3.12
Key packages: funasr==1.3.1, modelscope, onnxruntime
Model cache: ~/.cache/modelscope/hub/models/iic/SenseVoiceSmall
VAD cache: ~/.cache/modelscope/hub/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch
First-time Setup
python3 -m venv ~/.openclaw/venvs/sensevoice
source ~/.openclaw/venvs/sensevoice/bin/activate
pip install funasr modelscope onnxruntime
# Models auto-download on first run (~234MB SenseVoice + ~4MB VAD)
Usage
Single File
source ~/.openclaw/venvs/sensevoice/bin/activate
python3 -c "
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess
from datetime import datetime, timedelta
import re
wav = '<WAV_PATH>'
# Parse start time from filename: TX01_MIC015_20260308_124130_orig.wav
m = re.search(r'(\d{8})_(\d{6})', wav)
start_dt = datetime.strptime(m.group(1)+m.group(2), '%Y%m%d%H%M%S') if m else None
vad_model = AutoModel(model='fsmn-vad', disable_update=True)
model = AutoModel(model='iic/SenseVoiceSmall', vad_model='fsmn-vad',
vad_kwargs={'max_single_segment_time': 30000}, device='cpu')
vad_segs = vad_model.generate(input=wav)[0].get('value', [])
res = model.generate(input=wav, cache={}, language='zh', use_itn=True,
batch_size_s=60, merge_vad=True)
texts = [rich_transcription_postprocess(s).strip()
for s in re.split(r'<\|zh\|>', res[0]['text']) if s.strip()]
texts = [s for s in texts if len(s) > 1]
ratio = len(vad_segs) / len(texts) if texts else 1
for i, t in enumerate(texts):
vi = min(int(i * ratio), len(vad_segs)-1)
ts = (start_dt + timedelta(milliseconds=vad_segs[vi][0])).strftime('%H:%M:%S') if start_dt else f'{vad_segs[vi][0]//1000:.0f}s'
print(f'[{ts}] {t}')
"
Batch Mode (daylog)
The bundled scripts/batch_transcribe.py handles the full daylog pipeline:
source ~/.openclaw/venvs/sensevoice/bin/activate
cd ~/Documents/dec/daylog
# Dry run — see what would be transcribed
python3 scripts/batch_transcribe.py --dry-run
# Transcribe all new files
python3 scripts/batch_transcribe.py
# Re-transcribe specific dates (deletes existing, then re-runs)
python3 scripts/batch_transcribe.py --force-dates 2026-03-07,2026-03-08
# With progress file + Discord webhook
python3 scripts/batch_transcribe.py \
--progress-file /tmp/daylog-progress.json \
--discord-webhook https://discord.com/api/webhooks/...
Flags:
| Flag | Description |
|---|---|
--dry-run |
Preview without writing |
--engine sensevoice|whisper |
Engine (default: sensevoice) |
--force-dates YYYY-MM-DD,... |
Delete & re-transcribe these dates |
--progress-file PATH |
Write JSON progress for monitoring |
--discord-webhook URL |
Post start/milestone/finish to Discord |
Directory layout:
daylog/
├── raw/ # WAV input (DJI MIC 3, 48kHz/32bit, ~247MB/30min)
│ ├── TX01_MIC009_20260308_094129_orig.wav
│ └── ...
├── transcripts/ # Output, grouped by date
│ └── 2026-03-08/
│ ├── 000_TX01_MIC009_20260308_094129_orig.txt
│ └── ...
└── notes/ # Compiled daily notes (separate step)
└── 2026-03-08.md
Behavior:
- Groups WAV files by date extracted from filename (
YYYYMMDD) - Sorts by timestamp within each date for correct chronological order
- Skips already-transcribed files unless
--force-dates - Indexed output filenames (
000_,001_, ...) for sort order - Discord milestones every 25% progress
Output Format
[录音开始: 09:41:29]
[09:41:35] 到了,我们下车吧。
[09:41:48] 武康大楼,人好多啊。
[09:42:04] 你帮我在这里拍一张。
...
Performance (Apple M4, 10-core CPU)
| Metric | Value |
|---|---|
| RTF | ~0.04 (25x realtime) |
| CPU | ~1.2 cores (12%) |
| RAM | ~1.5GB |
| 30min WAV | ~73s transcription + ~4s VAD |
| Accuracy | 92% keyword accuracy (vs Whisper-medium 23%, turbo 38%) |
| Hallucinations | 0 (vs Whisper hundreds per session) |
| Model size | 234MB (vs Whisper-large-v3-turbo 1.5GB) |
vs Old Whisper Skill
| Whisper (old) | SenseVoice (new) | |
|---|---|---|
| Model | mlx-whisper-medium | SenseVoice-Small (234MB) |
| Accuracy | 23-38% | 92% |
| Hallucinations | Hundreds per session | 0 |
| Timestamp | Per-word (~2-4s) | VAD-anchored (~15s) |
| Duplicate lines | ~23% | <0.2% |
| Chinese support | Weak | Native (Mandarin-optimized) |
Emoji Note
SenseVoice appends emotion tags (😊😔😡😮) to segments. These are model artifacts reflecting detected speech emotion, not literal emoji in the audio. Downstream consumers (note compilation) should ignore or strip them.
Reviews (0)
Sign in to write a review.
No reviews yet. Be the first to review!
Comments (0)
No comments yet. Be the first to share your thoughts!