name: mlx-local-inference description: > Full local AI inference stack on Apple Silicon Macs via MLX. Includes: LLM chat (Qwen3-14B, Gemma3-12B), speech-to-text ASR (Qwen3-ASR, Whisper), text embeddings (Qwen3-Embedding 0.6B/4B), OCR (PaddleOCR-VL), TTS (Qwen3-TTS), and an automatic transcription daemon with LLM correction. All models run locally via MLX with OpenAI-compatible APIs. Use when the user needs local AI capabilities: text generation, speech recognition, embeddings/vector search, OCR, text-to-speech, or batch audio transcription — without cloud API calls. metadata: { "openclaw": { "os": ["darwin"], "requires": { "anyBins": ["python3"] } } }

MLX Local Inference Stack

Full local AI inference on Apple Silicon Macs. All services expose OpenAI-compatible APIs.

Services Overview

Service	Port	Access	Models
LLM + Whisper + Embedding	8787	LAN (`0.0.0.0`)	qwen3-14b, gemma-3-12b, whisper-large-v3-turbo, qwen3-embedding-0.6b/4b
ASR (Qwen3-ASR)	8788	localhost only	Qwen3-ASR-1.7B-8bit
Transcribe Daemon	—	file-based	Uses ASR + LLM

LaunchAgents: com.mlx-server (8787), com.mlx-audio-server (8788), com.mlx-transcribe-daemon

1. LLM — Local Chat Completions

Models

Model ID	Params	Best For
`qwen3-14b`	14B 4bit	Chinese, deep reasoning (built-in think mode)
`gemma-3-12b`	12B 4bit	English, code generation

API

curl -X POST http://localhost:8787/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-14b",
    "messages": [{"role": "user", "content": "Hello"}],
    "temperature": 0.7,
    "max_tokens": 2048
  }'

Add "stream": true for streaming.

Python

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1", api_key="unused")
response = client.chat.completions.create(
    model="qwen3-14b",
    messages=[{"role": "user", "content": "Hello"}],
    temperature=0.7, max_tokens=2048
)
print(response.choices[0].message.content)

Qwen3 Think Mode

Qwen3 may include <think>...</think> chain-of-thought tags. Strip them:

import re
text = re.sub(r'<think>.*?</think>\s*', '', text, flags=re.DOTALL)

Model Selection Guide

Scenario	Recommended
Chinese text	`qwen3-14b`
Cantonese	`qwen3-14b`
English writing	`gemma-3-12b`
Code generation	Either
Deep reasoning	`qwen3-14b` (think mode)
Quick Q&A	`gemma-3-12b`

2. ASR — Speech-to-Text

Qwen3-ASR (best for Chinese/Cantonese)

curl -X POST http://127.0.0.1:8788/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=mlx-community/Qwen3-ASR-1.7B-8bit" \
  -F "language=zh"

Whisper (multilingual, 99 languages)

curl -X POST http://localhost:8787/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=whisper-large-v3-turbo"

ASR Model Comparison

	Qwen3-ASR (port 8788)	Whisper (port 8787)
Chinese/Cantonese	Strong	Average
Multilingual	No	Yes (99 langs)
LAN access	No (localhost)	Yes
Loading	On-demand	Always loaded

Supported audio formats

wav, mp3, m4a, flac, ogg, webm

Long audio

Split into 10-min chunks first:

ffmpeg -y -ss 0 -t 600 -i long.wav -ar 16000 -ac 1 chunk_000.wav

3. Embeddings — Text Vectorization

Models

Model ID	Size	Use Case
`qwen3-embedding-0.6b`	0.6B 4bit	Fast retrieval, low latency
`qwen3-embedding-4b`	4B 4bit	High-accuracy semantic matching

API

curl -X POST http://localhost:8787/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-embedding-0.6b", "input": "text to embed"}'

Batch

curl -X POST http://localhost:8787/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-embedding-4b", "input": ["text 1", "text 2"]}'

4. OCR — Image Text Extraction

Default Model: PaddleOCR-VL-1.5-6bit

Item	Value
Model ID	`paddleocr-vl-6bit`
Speed	~185 t/s
Memory	~3.3 GB
Prompt	`OCR:`

CLI

cd ~/.mlx-server/venv
python -m mlx_vlm.generate \
  --model mlx-community/PaddleOCR-VL-1.5-6bit \
  --image image.jpg \
  --prompt "OCR:" \
  --max-tokens 512 --temp 0.0

Python

from mlx_vlm import generate, load
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("mlx-community/PaddleOCR-VL-1.5-6bit")
config = load_config("mlx-community/PaddleOCR-VL-1.5-6bit")
prompt = apply_chat_template(processor, config, "OCR:", num_images=1)
out = generate(model, processor, prompt, "image.jpg",
               max_tokens=512, temperature=0.0, verbose=False)
print(out.text if hasattr(out, "text") else out)

Notes

Prompt must be exactly OCR: for PaddleOCR-VL
temperature=0.0 for deterministic output
RGBA images must be converted to RGB first
Venv: ~/.mlx-server/venv

5. TTS — Text-to-Speech

Model: Qwen3-TTS (cached, not auto-served)

Item	Value
Model	Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit
Memory	~2GB
Feature	Custom voice cloning

CLI

~/.mlx-server/venv/bin/mlx_audio.tts.generate \
  --model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit \
  --text "你好，这是一段测试语音"

As API (via mlx_audio.server on port 8788)

curl -X POST http://127.0.0.1:8788/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit",
    "input": "你好世界"
  }' --output speech.wav

6. Transcribe Daemon — Automatic Batch Transcription

Drop audio files into ~/transcribe/ for automatic processing:

Daemon detects file (polls every 15s)
Phase 1: Transcribe via Qwen3-ASR → filename_raw.md
Phase 2: Correct via Qwen3-14B LLM → filename_corrected.md
Move results to ~/transcribe/done/

LLM Correction Rules

Fix homophone errors (的/得/地, 在/再)
Preserve Cantonese characters (嘅、唔、咁、喺、冇、佢)
Add punctuation and paragraphs
Remove filler words

Supported formats

wav, mp3, m4a, flac, ogg, webm

Service Management

# LLM + Whisper + Embedding server (port 8787)
launchctl kickstart -k gui/$(id -u)/com.mlx-server

# ASR server (port 8788)
launchctl kickstart -k gui/$(id -u)/com.mlx-audio-server

# Transcribe daemon
launchctl kickstart gui/$(id -u)/com.mlx-transcribe-daemon

# Logs
tail -f ~/.mlx-server/logs/server.log
tail -f ~/.mlx-server/logs/mlx-audio-server.err.log
tail -f ~/.mlx-server/logs/transcribe-daemon.err.log

Requirements

Apple Silicon Mac (M1/M2/M3/M4)
Python 3.10+ with mlx, mlx-lm, mlx-audio, mlx-vlm
Recommended: 32GB+ RAM for running multiple models

MLX Local Inference Stack

Description

MLX Local Inference Stack

Services Overview

1. LLM — Local Chat Completions

Models

API

Python

Qwen3 Think Mode

Model Selection Guide

2. ASR — Speech-to-Text

Qwen3-ASR (best for Chinese/Cantonese)

Whisper (multilingual, 99 languages)

ASR Model Comparison

Supported audio formats

Long audio

3. Embeddings — Text Vectorization

Models

API

Batch

4. OCR — Image Text Extraction

Default Model: PaddleOCR-VL-1.5-6bit

CLI

Python

Notes

5. TTS — Text-to-Speech

Model: Qwen3-TTS (cached, not auto-served)

CLI

As API (via mlx_audio.server on port 8788)

6. Transcribe Daemon — Automatic Batch Transcription

LLM Correction Rules

Supported formats

Service Management

Requirements

Reviews (0)

Comments (0)

Compatible Platforms

Links

Pricing

Related Configs

self-improving-agent

Self Improving Agent

Find Skills

Summarize