Korean Scraper
--- name: korean-scraper description: Korean website specialized scraper with anti-bot protection (Naver, Coupang, Daum, Instagram) version: 1.0.0 author: ๋ฌดํญ์ด ๐ง --- # korean-scraper **ํ
Description
name: korean-scraper description: Korean website specialized scraper with anti-bot protection (Naver, Coupang, Daum, Instagram) version: 1.0.0 author: ๋ฌดํญ์ด ๐ง
korean-scraper
ํ๊ตญ ์น์ฌ์ดํธ ์ ๋ฌธ ์คํฌ๋ํผ โ Playwright ๊ธฐ๋ฐ์ผ๋ก ๋ค์ด๋ฒ, ์ฟ ํก, ๋ค์ ๋ฑ ํ๊ตญ ์ฃผ์ ์ฌ์ดํธ์์ ๊ตฌ์กฐํ๋ ๋ฐ์ดํฐ๋ฅผ ์ถ์ถํฉ๋๋ค. Anti-bot ๋ณดํธ ์ฐํ ๊ธฐ๋ฅ ํฌํจ.
When to Use
- ๋ค์ด๋ฒ ๋ธ๋ก๊ทธ ๊ฒ์ ๊ฒฐ๊ณผ ์์ง ๋๋ ํน์ ๋ธ๋ก๊ทธ ๋ณธ๋ฌธ ์ถ์ถ
- ๋ค์ด๋ฒ ์นดํ ์ธ๊ธฐ๊ธ/์ต์ ๊ธ ์คํฌ๋ํ
- ์ฟ ํก ์ํ ์ ๋ณด (๊ฐ๊ฒฉ, ๋ฆฌ๋ทฐ, ๋ณ์ ) ์์ง
- ๋ค์ด๋ฒ ๋ด์ค/๋ค์ ๋ด์ค ๊ธฐ์ฌ ๋ณธ๋ฌธ ์ถ์ถ
- ํ๊ตญ ์ฌ์ดํธ ๋์ ์๋ํ๋ ๋ฐ์ดํฐ ์์ง
Installation
cd skills/korean-scraper
npm install
npx playwright install chromium
Quick Start
๋ค์ด๋ฒ ๋ธ๋ก๊ทธ
# ๊ฒ์ ๊ฒฐ๊ณผ ์์ง
node scripts/naver-blog.js search "๋ง์ง ์ถ์ฒ" --limit 10
# ํน์ ๋ธ๋ก๊ทธ ๋ณธ๋ฌธ ์ถ์ถ
node scripts/naver-blog.js extract "https://blog.naver.com/..."
๋ค์ด๋ฒ ์นดํ
# ์ธ๊ธฐ๊ธ ์์ง
node scripts/naver-cafe.js popular "์นดํURL" --limit 20
# ์ต์ ๊ธ ์์ง
node scripts/naver-cafe.js recent "์นดํURL" --limit 20
์ฟ ํก ์ํ
# ์ํ ์ ๋ณด ์ถ์ถ
node scripts/coupang.js product "์ํURL"
# ๊ฒ์ ๊ฒฐ๊ณผ ์์ง
node scripts/coupang.js search "๋ฌด์ ์ด์ดํฐ" --limit 20
๋ค์ด๋ฒ ๋ด์ค
# ๊ฒ์ ๊ฒฐ๊ณผ ์์ง
node scripts/naver-news.js search "AI" --limit 10
# ๊ธฐ์ฌ ๋ณธ๋ฌธ ์ถ์ถ
node scripts/naver-news.js extract "https://n.news.naver.com/..."
๋ค์ ๋ด์ค
# ๊ฒ์ ๊ฒฐ๊ณผ ์์ง
node scripts/daum-news.js search "๊ฒฝ์ " --limit 10
# ๊ธฐ์ฌ ๋ณธ๋ฌธ ์ถ์ถ
node scripts/daum-news.js extract "https://v.daum.net/..."
Output Format
๋ชจ๋ ์คํฌ๋ฆฝํธ๋ ๊ตฌ์กฐํ๋ JSON์ ๋ฐํํฉ๋๋ค:
๋ค์ด๋ฒ ๋ธ๋ก๊ทธ ๊ฒ์
{
"status": "success",
"query": "๋ง์ง ์ถ์ฒ",
"count": 10,
"results": [
{
"title": "์์ธ ๊ฐ๋จ ๋ง์ง ์ถ์ฒ BEST 5",
"url": "https://blog.naver.com/...",
"blogger": "๋ง์งํํ๊ฐ",
"date": "2026-02-15",
"snippet": "๊ฐ๋จ์ญ ๊ทผ์ฒ ์จ์ ๋ง์ง๋ค์..."
}
]
}
๋ค์ด๋ฒ ๋ธ๋ก๊ทธ ๋ณธ๋ฌธ
{
"status": "success",
"url": "https://blog.naver.com/...",
"title": "์์ธ ๊ฐ๋จ ๋ง์ง ์ถ์ฒ BEST 5",
"author": "๋ง์งํํ๊ฐ",
"date": "2026-02-15",
"content": "# ์์ธ ๊ฐ๋จ ๋ง์ง ์ถ์ฒ BEST 5\n\n1. ...",
"images": ["https://..."],
"tags": ["๋ง์ง", "๊ฐ๋จ", "์์ธ"]
}
์ฟ ํก ์ํ
{
"status": "success",
"url": "https://www.coupang.com/...",
"productName": "์ ํ ์์ดํ ํ๋ก 2์ธ๋",
"price": 299000,
"originalPrice": 359000,
"discount": "17%",
"rating": 4.8,
"reviewCount": 1523,
"rocketDelivery": true,
"seller": "์ฟ ํก",
"images": ["https://..."]
}
๋ค์ด๋ฒ ์นดํ
{
"status": "success",
"cafeUrl": "https://cafe.naver.com/...",
"type": "popular",
"count": 20,
"posts": [
{
"title": "์ ์
ํ์ ์ธ์ฌ๋๋ฆฝ๋๋ค",
"url": "https://cafe.naver.com/.../12345",
"author": "๋๋ค์",
"date": "2026-02-17",
"views": 523,
"comments": 12
}
]
}
๋ด์ค ๊ธฐ์ฌ
{
"status": "success",
"url": "https://n.news.naver.com/...",
"title": "AI ์์ฅ ๊ท๋ชจ ๊ธ์ฑ์ฅ ์ ๋ง",
"media": "์กฐ์ ์ผ๋ณด",
"author": "ํ๊ธธ๋ ๊ธฐ์",
"date": "2026-02-17 09:30",
"content": "# AI ์์ฅ ๊ท๋ชจ ๊ธ์ฑ์ฅ ์ ๋ง\n\n...",
"category": "IT/๊ณผํ",
"images": ["https://..."]
}
Anti-Bot Features
- navigator.webdriver ์จ๊น โ ์๋ํ ํ์ง ํํผ
- ์ค์ User-Agent ์ฌ์ฉ โ ๋ชจ๋ฐ์ผ/๋ฐ์คํฌํ ๋๋ค
- ์ธ๊ฐ ํ๋ ๋ชจ๋ฐฉ โ ๋๋ค ๋๋ ์ด, ์คํฌ๋กค
- Stealth Plugin โ Playwright extra stealth
- Cloudflare ์ฐํ โ ๋๊ธฐ ์๊ฐ ์๋ ์กฐ์
Rate Limiting
๋ชจ๋ ์คํฌ๋ฆฝํธ๋ ๊ธฐ๋ณธ์ ์ผ๋ก ์ฌ์ดํธ๋ฅผ ๋ณดํธํฉ๋๋ค:
- ์์ฒญ๋น 2-5์ด ๋๋ค ๋๋ ์ด
- ๋์ผ ๋๋ฉ์ธ 1์ด๋น ์ต๋ 1ํ ์์ฒญ
- 429 ์๋ต ์ ์๋ ๋ฐฑ์คํ
--fastํ๋๊ทธ๋ก ๋๋ ์ด ์ถ์ ๊ฐ๋ฅ (์ฃผ์)
Error Handling
| ์ํฉ | ๋์ |
|---|---|
| 404 | JSON์ผ๋ก ์๋ฌ ๋ฐํ, ๊ณ์ ์งํ |
| 403/์ฐจ๋จ | ์ฌ์๋ (์ต๋ 3ํ) |
| ํ์์์ | ๋๊ธฐ ์๊ฐ ์ฐ์ฅ ํ ์ฌ์๋ |
| ๋ก๊ทธ์ธ ํ์ | ๊ฒฝ๊ณ ๋ฉ์์ง + ๊ฐ๋ฅํ ๋ฐ์ดํฐ๋ง ๋ฐํ |
Environment Variables
# Headless ๋ชจ๋ ๋๊ธฐ (๋๋ฒ๊น
์ฉ)
HEADLESS=false node scripts/naver-blog.js ...
# ์คํฌ๋ฆฐ์ท ์ ์ฅ
SCREENSHOT=true node scripts/coupang.js ...
# ๋๊ธฐ ์๊ฐ ์กฐ์ (ms)
WAIT_TIME=10000 node scripts/naver-cafe.js ...
# User-Agent ์ปค์คํ
USER_AGENT="..." node scripts/naver-news.js ...
Integration Examples
OpenClaw Agent ํตํฉ
// ๋ค์ด๋ฒ ๋ธ๋ก๊ทธ ๊ฒ์
const result = await exec({
command: 'node scripts/naver-blog.js search "AI ํธ๋ ๋" --limit 5',
workdir: '/path/to/skills/korean-scraper'
});
const data = JSON.parse(result.stdout);
Batch Processing
# ์ฌ๋ฌ URL ์ผ๊ด ์ฒ๋ฆฌ
cat urls.txt | while read url; do
node scripts/naver-blog.js extract "$url" >> results.jsonl
done
Limitations
- ๋ก๊ทธ์ธ ํ์ ์ฝํ ์ธ : ํ์ฌ ๋น๋ก๊ทธ์ธ ์ํ๋ก๋ง ์คํฌ๋ํ (์ฟ ํก ์ผ๋ถ ๋ฆฌ๋ทฐ ๋ฑ)
- ๋์ ๋ก๋ฉ: ๋ฌดํ ์คํฌ๋กค์ ๊ธฐ๋ณธ 10๊ฐ๊น์ง๋ง (--scroll ํ๋๊ทธ๋ก ํ์ฅ ๊ฐ๋ฅ)
- CAPTCHA: ์๋ ์ฐํ ํ์ (์๋ํ ๋ถ๊ฐ)
- IP ์ฐจ๋จ: ๊ณผ๋ํ ์์ฒญ ์ ์ผ์์ ์ฐจ๋จ ๊ฐ๋ฅ (rate limiting ์ค์ ํ์)
Compliance & Ethics
- โ ๊ณต๊ฐ๋ ์ ๋ณด๋ง ์์ง
- โ robots.txt ์ค์ (๊ธฐ๋ณธ๊ฐ)
- โ Rate limiting์ผ๋ก ์๋ฒ ๋ถํ ์ต์ํ
- โ ๊ฐ์ธ์ ๋ณด ์์ง ๊ธ์ง
- โ ๋ก๊ทธ์ธ ํ์ ์ฝํ ์ธ ๋ฌด๋จ ์ ๊ทผ ๊ธ์ง
- โ ์ ์๊ถ ์นจํด ๋ชฉ์ ์ฌ์ฉ ๊ธ์ง
Troubleshooting
๋ฌธ์ : 403 Forbidden
ํด๊ฒฐ์ฑ :
- User-Agent ๋ณ๊ฒฝ ์๋
- ๋๊ธฐ ์๊ฐ ๋๋ฆฌ๊ธฐ (
WAIT_TIME=15000) - Headless ๋ชจ๋ ๋๊ธฐ (
HEADLESS=false)
๋ฌธ์ : ๋น ๊ฒฐ๊ณผ ๋ฐํ
ํด๊ฒฐ์ฑ :
- URL ํ์ ํ์ธ
- ์ฌ์ดํธ ๊ตฌ์กฐ ๋ณ๊ฒฝ ๊ฐ๋ฅ์ฑ (์ ๋ ํฐ ์ ๋ฐ์ดํธ ํ์)
- ๋ก๊ทธ์ธ ํ์ ์ฌ๋ถ ํ์ธ
๋ฌธ์ : Timeout
ํด๊ฒฐ์ฑ :
WAIT_TIME๋๋ฆฌ๊ธฐ- ์ธํฐ๋ท ์ฐ๊ฒฐ ํ์ธ
- ์ฌ์ดํธ ์ ๊ทผ ๊ฐ๋ฅ ์ฌ๋ถ ํ์ธ (VPN ํ์ ๋ฑ)
Maintenance
ํ๊ตญ ์ฌ์ดํธ๋ค์ UI๋ฅผ ์์ฃผ ๋ณ๊ฒฝํ๋ฏ๋ก, ์ ๋ ํฐ ์ ๋ฐ์ดํธ๊ฐ ํ์ํ ์ ์์ต๋๋ค.
์
๋ ํฐ ์์น: scripts/ ๋ด ๊ฐ ํ์ผ ์๋จ SELECTORS ๊ฐ์ฒด
const SELECTORS = {
blogTitle: '.se-title-text',
blogContent: '.se-main-container',
// ...
};
Future Improvements
- ์ธ์คํ๊ทธ๋จ ๊ฒ์๋ฌผ ์คํฌ๋ํ
- ๋ค์ด๋ฒ ์ผํ ๊ฐ๊ฒฉ ๋น๊ต
- ์ ํ๋ธ ํ๊ตญ ์ฑ๋ ๋ฉํ๋ฐ์ดํฐ
- ๋ฐฐ์น ์ฒ๋ฆฌ ์ต์ ํ (๋ณ๋ ฌ ์คํ)
- ์ฟ ํค/์ธ์ ๊ด๋ฆฌ (๋ก๊ทธ์ธ ์ ์ง)
- Proxy ์ง์
References
Reviews (0)
No reviews yet. Be the first to review!
Comments (0)
No comments yet. Be the first to share your thoughts!