Ingest Pipeline

The ingest pipeline is backed by a persistent queue with a deterministic state machine. When you say “ingest everything”, the system scans raw/, deduplicates against what's already in the wiki, and queues every new file. Each item progresses through defined stages.

You can process files in batches, interrupt and resume across sessions — the queue picks up where you left off. State is persisted in ingest_queue.json.

Pipeline stages

01

Tag assignment

The agent asks for a project tag before any processing begins. No document enters the wiki without a domain.

02

Domain validation

A blind review verifies the document belongs to the declared project — semantic and quantitative evaluations run independently.

03

Page scaffolding

The agent creates or updates source, entity, and concept pages with proper frontmatter and wikilinks.

04

Content extraction

The agent reads the source document and writes structured summaries, extracting key information into wiki pages.

05

Discovery review

The agent identifies entities and concepts blind, then a deterministic tool cross-checks for missed high-frequency terms.

06

Post-validation

A final check ensures all expected wikilinks are present and no existing wiki pages were overlooked.

Discovery review

During ingest, the agent reads the source document and identifies entities and concepts to link — blind, without any tool assistance. Then a deterministic tool (extract_terms.py) extracts every word from the original document with its frequency count, along with the list of terms already linked via [[wikilinks]].

The agent compares the two: high-frequency terms that aren't linked yet may be entities or concepts it missed. This catches the things a human reader would notice — “this document mentions 'governance' six times but there's no link to the Data Governance page”.

try it

Ingest everything in raw/ and tag it as project-alpha

No filtering, no stopwords, no language-specific logic. The tool provides raw data; the agent decides what's noise and what's relevant.

Missing link post-validation

After the agent writes a wiki page, the validation step checks for a specific blind spot: existing wiki pages whose names appear in the source document but weren't linked with [[wikilinks]] in the wiki page. If the source mentions “OpenAI” four times and there's an OpenAI.md entity page but no [[OpenAI]] link, the validator flags it.

These are non-blocking warnings — the validation still passes, but the agent reviews each warning and adds links where appropriate. It's a safety net that catches the links a thorough reader would expect to see.

File conversion

The wiki works with Markdown. If your sources are in other formats, the agent handles the conversion automatically during ingest (or on demand with /convert). It writes a conversion script and a validation script in .staging/, runs both, and loops until validation passes. Every converted file is verified before it enters the pipeline — no silent failures.

Non-markdown files are automatically converted and validated before entering the pipeline, with original binaries archived in processed/.