WeChat Work Doc Fetcher
Fetch and convert WeChat Work developer docs pages into clean Markdown files for use in Obsidian, handling SPA content and required authentication.
Description
wecom-doc-fetcher
Use this skill when the user wants to save any page from the WeChat Work (企业微信) developer documentation site (developer.work.weixin.qq.com/document/path/*) as a clean Markdown file in their Obsidian vault.
Files in this skill
wecom-doc-fetcher/
├── SKILL.md # this file
└── wx_doc_fetch.py # the fetch & convert script
Setup (one-time)
Run these once before using the skill:
pip install requests playwright
playwright install chromium
playwright install chromiumdownloads a ~150 MB headless Chromium binary. This is required for automaticdoc_iddetection.
Python 3.8+ is required.
Usage
Place wx_doc_fetch.py anywhere convenient (e.g. your vault's scripts folder), then run:
# Basic: auto-detect doc_id, print to stdout
python wx_doc_fetch.py <URL>
# Save to file
python wx_doc_fetch.py <URL> output.md
# Skip Playwright, supply doc_id manually
python wx_doc_fetch.py <URL> output.md --doc-id <integer>
# Override cookies at runtime
python wx_doc_fetch.py <URL> output.md --cookies "wwapidoc.sid=xxx; ..."
Example
python wx_doc_fetch.py https://developer.work.weixin.qq.com/document/path/94677 发送消息.md
# [info] path_id=94677 doc_id=31152
# [done] 已写入:发送消息.md
How It Works
The WeChat Work docs site is a Vue SPA — the visible content is not in the initial HTML. It is loaded at runtime via a private POST API:
POST https://developer.work.weixin.qq.com/docFetch/fetchCnt?lang=zh_CN&ajax=1&f=json
Body: doc_id=<integer> (application/x-www-form-urlencoded)
The response includes data.content_md — the page content as a Markdown string. The script fetches this field, cleans it, and writes the result.
Why not WebFetch / defuddle?
The page renders client-side. WebFetch and defuddle only see the pre-JS HTML skeleton — no content. Scraping innerText via browser tools works but produces a very large accessibility tree with poor formatting. The content_md API field is the cleanest, most token-efficient source.
URL path ID ≠ doc_id
The number in the browser URL (e.g. 94677) is a routing slug — not the doc_id the API needs. The actual doc_id (e.g. 31152) is determined at runtime by loading the page with Playwright and intercepting the fetchCnt XHR request.
Manual doc_id Fallback
If Playwright is unavailable or times out:
- Open the target URL in Chrome
- DevTools → Network tab → filter by
fetchCnt - Click the request → Payload tab
- Read the
doc_idvalue - Pass it with
--doc-id:
python wx_doc_fetch.py https://developer.work.weixin.qq.com/document/path/94677 发送消息.md --doc-id 31152
Cookie Configuration
The fetchCnt API requires an authenticated session. Playwright's headless browser obtains session cookies automatically when loading the page — no manual cookie setup needed for normal use.
If you see errCode: -30001 in the output, the session is rejected. Fix:
- Open the site in Chrome while logged in
- DevTools → Network → any
fetchCntrequest → Copy as cURL - Find the
-b '...'cookie string in the copied command - Either paste it into
COOKIES_RAWat the top ofwx_doc_fetch.py, or pass it via--cookies "..."
Key cookies and their lifetimes:
| Cookie | Purpose | Lifetime |
|---|---|---|
wwapidoc.sid |
Session identifier | ~24 hours |
wwapidoc.token_wt |
JWT auth token | ~30 minutes |
API Reference
| Item | Detail |
|---|---|
| Endpoint | POST /docFetch/fetchCnt?lang=zh_CN&ajax=1&f=json&random=<rand> |
| Body | doc_id=<integer> (form-urlencoded) |
| Auth | Session cookies |
| Key response field | data.content_md |
| Other response fields | data.content_html, data.content_html_v2, data.content_txt, data.title, data.time |
content_md Cleaning Rules
The content_md field is mostly valid CommonMark but has site-specific issues. The clean_md() function in wx_doc_fetch.py handles all of them:
| # | Problem | Raw example | After cleaning |
|---|---|---|---|
| 1 | [TOC] marker at top |
[TOC]\n# 概述 |
# 概述 |
| 2 | Heading missing space after # |
##接口定义 |
## 接口定义 |
| 3 | Internal numeric anchor links | [接收事件](#12977) |
接收事件 |
| 3 | Anchors with sub-path | [开启API](#31106/如何开启API) |
开启API |
| 4 | HTML line breaks inside table cells | 说明</br>补充 |
说明 补充 |
| 5 | <b> bold tags |
<b>注意</b> |
**注意** |
| 6 | <code> inline tags |
<code>open_kfid</code> |
`open_kfid` |
| 7 | <font> color tags |
<font color="red">警告</font> |
警告 |
| 8 | !!#rrggbb text!! site-specific highlight |
!!#ff0000 重要!! |
重要 |
| 9 | Leading spaces before table rows | ··| 参数 | |
| 参数 | |
| 10 | No blank line before table (Obsidian won't render) | 文字\n| col | |
文字\n\n| col | |
| 11 | Excess blank lines | 3+ \n in a row |
2 \n max |
Rule 10 — critical regex note
The blank-line-before-table rule must match on lines that don't start with |, not just on the trailing character of the previous line:
# CORRECT — matches on start of line, avoids breaking table rows apart
re.sub(r"^([^|\n][^\n]*)\n(\|)", r"\1\n\n\2", content, flags=re.MULTILINE)
# WRONG — table rows end with "| " (trailing space), so last char is space,
# causing blank lines to be inserted between every table row
re.sub(r"([^\n])\n(\|)", r"\1\n\n\2", content)
Reviews (0)
No reviews yet. Be the first to review!
Comments (0)
No comments yet. Be the first to share your thoughts!