🧪 Skills

Clawtext Ingest

Multi-source memory ingestion with Discord support, automatic deduplication, and agent-ready patterns

v1.0.1
❤️ 0
⬇️ 162
👁 1
Share

Description


name: ClawText Ingest description: Multi-source memory ingestion with Discord support, automatic deduplication, and agent-ready patterns keywords: discord, memory, ingestion, rag, agents, deduplication, cli

ClawText Ingest — Production-Ready Memory Ingestion

Version: 1.3.0 | License: MIT | Status: Production ✅
Author: ragesaq | Category: Memory & Knowledge Management
GitHub: https://github.com/ragesaq/clawtext-ingest


🎯 What It Does

ClawText Ingest transforms external data (Discord forums, files, URLs, JSON, text) into structured, deduplicated memories for AI agents.

The Problem It Solves

  • Manual ingestion — Tedious, error-prone, no metadata
  • Duplicate memories — Same data ingested multiple times
  • Unstructured data — No hierarchy, no context preservation
  • One-time imports — No recurring/scheduled ingestion
  • Discord-specific gaps — Can't preserve forum post↔reply structure

The Solution

One command imports from Discord, files, URLs, or JSON
100% idempotent — Run 1000x, zero duplicates
Automatic metadata — YAML frontmatter with date, project, type, entities
6 agent patterns — Autonomous workflows documented and ready
Discord-native — Forum hierarchy preserved, progress bars, auto-batch mode


✨ Key Features

🎯 Discord Integration (New in v1.3.0)

  • Forum + Channel + Thread support
  • Hierarchy preservation — Post↔reply structure in metadata
  • Real-time progress — Live feedback for large ingestions
  • Auto-batch mode — <500 posts: full, ≥500 posts: streaming
  • One-command setup — 5-minute bot creation

📁 Multi-Source Ingestion

  • Files — Glob patterns (Markdown, text, etc.)
  • URLs — Single or bulk URL ingestion
  • JSON — Chat exports, API responses
  • Raw text — Quick knowledge capture
  • Batch operations — Unified ingestion from multiple sources

🔄 Deduplication & Safety

  • SHA1-based — Cryptographic hash matching
  • 100% idempotent — Safe for repeated runs
  • ConfigurablecheckDedupe: true/false per operation
  • Zero data loss — Failed items tracked, fallback per-item ingestion
  • Hash persistence.ingest_hashes.json for cross-session tracking

🤖 Agent-Ready

  • 6 documented patterns — Direct API, Discord Agent, CLI, Cron, Batch, Thread
  • Working code examples — Copy-paste ready
  • Real-world patterns — GitHub sync, Discord monitoring, team decisions
  • Error handling — Comprehensive error recovery
  • Progress callbacks — Track ingestion in real-time

🛠️ Developer-Friendly

  • CLI toolclawtext-ingest + clawtext-ingest-discord commands
  • Node.js API — Simple imports for programmatic use
  • TypeScript-ready — Clear method signatures
  • Extensible — Custom transforms, field mapping
  • Well-documented — 11 guides, 20+ examples

🔗 ClawText Integration

  • Automatic cluster indexing — New memories indexed after rebuild
  • RAG injection — Relevant context injected into agent prompts
  • Project routing — Organize memories by project/source
  • Entity linking — Auto-extract and link related entities

🚀 Quick Start

Installation

# Via npm
npm install clawtext-ingest

# Via OpenClaw
openclaw install clawtext-ingest

Discord Ingestion (5 minutes)

# 1. Set up Discord bot (see DISCORD_BOT_SETUP.md)
# 2. Get bot token, set DISCORD_TOKEN env var

# 3. Inspect forum
clawtext-ingest-discord describe-forum --forum-id FORUM_ID --verbose

# 4. Ingest with progress
DISCORD_TOKEN=xxx clawtext-ingest-discord fetch-discord --forum-id FORUM_ID

# 5. Rebuild ClawText clusters
clawtext-ingest rebuild

File Ingestion

clawtext-ingest ingest-files --input="docs/*.md" --project="docs"

Node.js API

import { ClawTextIngest } from 'clawtext-ingest';

const ingest = new ClawTextIngest();

// Ingest files
await ingest.fromFiles(['docs/**/*.md'], { project: 'docs', type: 'fact' });

// Ingest JSON
await ingest.fromJSON(chatArray, { project: 'team' }, {
  keyMap: { contentKey: 'message', dateKey: 'timestamp', authorKey: 'user' }
});

// Rebuild clusters for RAG injection
await ingest.rebuildClusters();

🤖 Agent Integration (6 Patterns)

Pattern 1: Direct API

For: In-agent code
Use when: Agents need to ingest as part of workflow

const ingest = new ClawTextIngest();
await ingest.fromFiles(['docs/**/*.md'], { project: 'docs' });

Pattern 2: Discord Agent

For: Autonomous Discord ingestion
Use when: Agents need to fetch Discord forums

const runner = new DiscordIngestionRunner(ingest);
await runner.ingestForumAutonomous({
  forumId, mode: 'batch', token: process.env.DISCORD_TOKEN
});

Pattern 3: CLI Subprocess

For: Agents executing commands
Use when: Simpler CLI-based execution needed

await execAsync('clawtext-ingest-discord fetch-discord --forum-id ID');

Pattern 4: Cron/Scheduled

For: Recurring tasks
Use when: Daily/hourly ingestion needed

cron.schedule('0 * * * *', () => agentIngest());

Pattern 5: Batch Multi-Source

For: Unified ingestion
Use when: Multiple sources in one operation

await ingest.ingestAll([
  { type: 'files', data: ['docs/**/*.md'], metadata: {...} },
  { type: 'json', data: chatExport, metadata: {...} }
]);

Pattern 6: Discord Thread

For: Thread-specific ingestion
Use when: Single thread fetch needed

await runner.ingestThread(threadId);

→ See AGENT_GUIDE.md for complete examples


📊 Real-World Examples

Example 1: Daily Documentation Sync

async function syncDocsDaily() {
  const ingest = new ClawTextIngest();
  const result = await ingest.ingestAll([
    { type: 'files', data: ['docs/**/*.md'], metadata: { project: 'docs' } },
    { type: 'urls', data: ['https://docs.example.com/api'], metadata: { project: 'api-docs' } }
  ]);
  await ingest.rebuildClusters();
  return result;
}

Example 2: Discord Forum Monitoring

async function monitorDiscordForum(forumId) {
  const ingest = new ClawTextIngest();
  const runner = new DiscordIngestionRunner(ingest);
  
  const result = await runner.ingestForumAutonomous({
    forumId,
    mode: 'batch',
    token: process.env.DISCORD_TOKEN,
    onProgress: (p) => console.log(`${p.percent}% complete...`)
  });
  
  return result;
}

Example 3: Team Decisions Ingestion

async function ingestTeamDecisions() {
  const ingest = new ClawTextIngest();
  
  const result = await ingest.ingestAll([
    { type: 'files', data: ['decisions/adr/**/*.md'], metadata: { type: 'adr' } },
    { type: 'json', data: slackThread, metadata: { type: 'decision', source: 'slack' } }
  ]);
  
  await ingest.rebuildClusters();
  return result;
}

🛒 CLI Commands

clawtext-ingest — File/URL/JSON/Text Ingestion

clawtext-ingest ingest-files --input="docs/*.md" --project="docs" --verbose
clawtext-ingest ingest-urls --input="https://example.com" --project="research"
clawtext-ingest ingest-json --input=messages.json --source="slack"
clawtext-ingest ingest-text --input="Finding: X is better than Y" --project="findings"
clawtext-ingest batch --config=sources.json
clawtext-ingest rebuild
clawtext-ingest status

clawtext-ingest-discord — Discord Integration

# Inspect forum
clawtext-ingest-discord describe-forum --forum-id FORUM_ID --verbose

# Fetch & ingest
DISCORD_TOKEN=xxx clawtext-ingest-discord fetch-discord \
  --forum-id FORUM_ID \
  --mode batch \
  --batch-size 100 \
  --verbose

📚 Documentation

Document Purpose Read Time
README.md Overview + quick start 5 min
QUICKSTART.md 5-minute setup 5 min
AGENT_GUIDE.md 6 autonomous patterns 10 min
API_REFERENCE.md Complete API docs 15 min
PHASE2_CLI_GUIDE.md CLI commands 10 min
DISCORD_BOT_SETUP.md Bot creation 5 min
CLAYHUB_GUIDE.md Publication 5 min
INDEX.md Documentation index 2 min

🎯 Who Should Use This

  • AI/Agent developers — Building knowledge-aware agents
  • RAG engineers — Populating memory for context injection
  • Teams using Discord — Leveraging Discord as knowledge base
  • DevOps/MLOps — Automated knowledge ingestion pipelines
  • Researchers — Structuring unstructured data sources

⚡ Performance

Operation Speed Notes
Ingest 100 files ~5 sec With SHA1 dedup check
Ingest 1000 JSON items ~15 sec Batch processing
Small forum (<100 msgs) ~10 sec Full mode
Large forum (1000+ msgs) ~2 min Auto-batch, streaming
Rebuild clusters ~5-30 sec Depends on total memories

✅ Quality Metrics

Metric Value
Tests 22/22 passing ✅
Code 1,254 production lines
Documentation 92 KB across 11 guides
Examples 20+ working examples
Coverage 100% critical paths

🔗 Integration with ClawText

  1. Ingest data → Creates memories with YAML metadata
  2. Rebuild clusters → ClawText indexes new memories
  3. RAG layer → Relevant context injected on next prompt
  4. Agent response — Enhanced with contextual information
# Complete workflow
clawtext-ingest-discord fetch-discord --forum-id ID  # Step 1
clawtext-ingest rebuild                               # Step 2
# Step 3-4 automatic (ClawText + Agent)

🆘 Support


📦 Installation & Requirements

Requirements:

  • Node.js ≥ 18.0.0
  • OpenClaw (for agent patterns)
  • ClawText ≥ 1.2.0 (for RAG integration)

Installation:

npm install clawtext-ingest
# or
openclaw install clawtext-ingest

Binaries:

  • clawtext-ingest — File/URL/JSON ingestion
  • clawtext-ingest-discord — Discord integration

🚀 Why This Over Alternatives

Feature ClawText-Ingest Manual Generic Importer API Tool
Discord native
Deduplication Partial
Agent patterns
Metadata auto Partial
ClawText integration
Idempotent Partial

📄 License

MIT — Use freely, open source, community supported


🙌 Contributing

Contributions welcome! See GitHub issues for current priorities.


Ready to ingest? Start with QUICKSTART.md (5 min) or AGENT_GUIDE.md if you're building agents.

Reviews (0)

Sign in to write a review.

No reviews yet. Be the first to review!

Comments (0)

Sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Compatible Platforms

Pricing

Free

Related Configs