A unified, open-source ingestion tool for Open Knowledge Format (OKF) bundles — read any OKF bundle, validate its conformance (permissively, per the spec), build the concept graph, and load it into a portable, queryable DuckDB catalog. One catalog format, two idiomatic bindings: R and Python.
Point it at your
[[wikilink]]vault. As of 0.6, okf-ingest resolves both markdown](path)links and[[wikilink]]references (Obsidian / Logseq / Foam) — by name (id / alias / title), so links survive file renames. Your existing notes become a deterministic, queryable, renderable knowledge graph with no rewriting. See[[wikilinks]]& aliases.
The image is
okf graphrun on okf-ingest's own docs — the project dogfoods OKF: that folder is a conformant bundle you caningest,html/graph, anddoctorwith the tool itself.
OKF (Google Cloud, v0.1) is a directory of markdown files with YAML frontmatter — one concept per file, markdown links as a graph. Validators and parsers already exist (Node, a web tool, a pure-Rust crate). What no other tool does — and what this one is for — is load a bundle into a SQL-queryable DuckDB catalog with built-in semantic search (RAG), and do it from R or Python (there was no R or Python OKF tooling at all). See Related tools.
okf-ingest is pure, deterministic machinery: the same bundle in always produces the same catalog, the same graph, the same clusters, and the same rendered HTML out — byte-for-byte, on any machine, with no network and no API key. There are no LLM agents anywhere in it. It never asks a model to summarize a page, infer a "layer," guess a relationship, or decide anything. It reads exactly the structure the author wrote — the frontmatter, the markdown links — and surfaces that.
This is a deliberate line. A wave of tools will read your knowledge base by turning agents loose to summarize and "understand" it; their output is non-reproducible, costs tokens, ships your corpus to a model, and quietly invents structure. okf-ingest does the opposite — it's the boring, auditable substrate underneath:
- Reproducible — deterministic enough to assert on in CI; a parity test locks R and Python to byte-identical catalogs. No "re-ran it and got different edges."
- Free & offline — no tokens, no keys, no calls. Parsing, validation, the link graph, community clustering (deterministic label propagation), backlinks, impact, and HTML/graph rendering are all plain code.
- Private — your content never leaves the machine. Nothing is sent anywhere.
- Composable with agents, not replaced by them — when you do want an LLM,
okf hands it the curated graph to reason over (
okf context) rather than pretending to be the reasoner. You bring the model; okf brings the ground truth.
The two honest exceptions, both opt-in and explicit: the embed/rag layer
calls a local, pluggable embedding model (default Ollama — swap in your own) to
add vector search; and ingested_at is a wall-clock metadata field you can
override (the conformance suite does, which is how it stays byte-stable). The
knowledge representation itself — concepts, graph, clusters, render — is 100%
deterministic and model-free.
| Wiki / docs site | Vector DB | Agent "understand my wiki" | okf-ingest | |
|---|---|---|---|---|
| Reproducible (same in → same out) | ✅ | ❌ non-deterministic | ✅ byte-locked | |
| Offline, no API key / tokens | ✅ | ❌ | ✅ | |
| Content stays local / private | ✅ | ❌ sent to a model | ✅ | |
| SQL / programmatic access | ❌ | ✅ DuckDB | ||
| Explicit concept graph | ❌ | ✅ inferred | ✅ author-written | |
Your [[wikilink]] notes → queryable graph |
❌ ignores links | ✅ + rename-safe | ||
| Renders to HTML + interactive graph | ❌ | ✅ | ✅ | |
| Semantic search | ❌ | ✅ | ✅ | ✅ (opt-in) |
| Invents structure with an LLM | ❌ | ❌ | ✅ | ❌ by design |
graph LR
B["OKF bundle<br/>(dir · git · tar/zip)"] --> I[ingest]
I --> C[(DuckDB catalog<br/>concepts · links · validation · chunks)]
C --> CX["context<br/>(LLM-wiki blob)"]
C --> H["html<br/>(site / single)"]
C --> G["graph / export<br/>(interactive · JSON · Mermaid)"]
C --> D["doctor<br/>(health · --fix)"]
C --> R["embed / rag<br/>(opt-in, local model)"]
Often you don't — by design. OKF bundles are meant to be read directly by an
agent: load index.md, follow the curated links, pull the few relevant concept
files into context. For a small, well-linked bundle (dozens to low-hundreds of
concepts), that index-first traversal is the intended pattern — cf. Karpathy's
"LLM wiki"; Google's own framing is that OKF complements RAG, it doesn't
require it. No catalog, no embeddings, no okf-ingest — just let the agent
navigate the markdown.
Reach for okf-ingest when direct reading isn't enough:
- Programmatic / SQL access (any size) — query concepts, the link graph, or conformance findings from code or CI, in R or Python. → the DuckDB catalog.
- Large bundles — thousands of concepts, where loading the index or the whole bundle into context isn't practical.
- Semantic / cross-corpus retrieval — feeding OKF into a wider RAG pipeline,
or similarity search over a big/heterogeneous knowledge base. → the optional
embed/raglayer.
For small curated bundles, skip embed/rag — the explicit graph the author
wrote beats fuzzy vector matches, and following links costs nothing. If you want
tooling for that wiki pattern (rather than against it), use
okf context: it assembles
the index-first, link-following slice for an agent to read directly — no
embeddings involved.
One tool, two bindings — use whichever you live in.
R — from R-universe:
install.packages("okf", repos = c(travisjakel = "https://travisjakel.r-universe.dev",
CRAN = "https://cloud.r-project.org"))
library(okf)
res <- okf_ingest("my-bundle", db_path = "kb.duckdb") # dir, git URL, or tar/zip
okf_embed(res$con) # local Ollama nomic-embed-text
okf_rag(res$con, "how is revenue computed?", k = 3)[, c("path", "title", "score")]
#> path title score
#> 1 metrics/revenue.md Revenue 0.709
#> 2 orders.md Orders 0.642Python / CLI — from PyPI:
pip install okf-ingest # or: uv pip install okf-ingest · uv add okf-ingest
okf ingest ./my-bundle --db kb.duckdb # dir, git URL, or tar/zip
okf embed kb.duckdb
okf rag kb.duckdb --query "how is revenue computed?" -k 5
# [0.71] metrics/revenue.md#1 — Revenue
# [0.64] orders.md#1 — OrdersThe catalog is plain DuckDB — query it with SQL, R, Python, or the bare duckdb
CLI. Ingest/embed in one language, query from the other.
okf is deterministic and composes with your agent — give the agent the graph,
let it reason. Drop this into your agent's instructions
(AGENTS.md / CLAUDE.md / Cursor rules) so it drives okf instead of grepping:
The docs at <PATH> are an OKF bundle. Use `okf` to navigate them:
- `okf context <PATH> --start <concept>.md --depth 1` → index.md + that concept +
its linked neighbours as one markdown blob. Read that; don't grep files ad hoc.
- `okf query <PATH> --search "<term>"` (substring) or `--sql "<SELECT…>"` for lookups.
Start from index.md to see the map.
That's it — one paste and the agent reads the bundle index-first, the way OKF is
meant to be consumed (okf runs locally; no data leaves the machine). For
semantic instead of substring lookup, ingest once (okf ingest <PATH> --db kb.duckdb),
okf embed kb.duckdb, then point the agent at okf rag kb.duckdb --query "…". See
context.
The interoperability core is a contract, not compiled code:
schema/catalog.sql— the DuckDB catalog schema. Both bindings write matching catalogs (same rows, types, links, validation, andcontent_hash— a parity-locked conformance test enforces this); thefrontmatterJSON column is semantically equal but not byte-for-byte identical across languages. You can query the catalog with the bareduckdbCLI, no library at all.conformance/— language-agnostic golden bundles + expected outputs that every binding must reproduce.
The bindings (r/okf, py/okf) are thin, native, ~300-line packages kept in lockstep by that shared corpus. This matches OKF's own ethos ("no required tooling — if you can cat a file you can read OKF") far better than a heavyweight FFI core would.
Per OKF §6, a bundle is conformant iff every non-reserved .md has parseable YAML frontmatter with a non-empty type (a free string — no enum). Everything else is permissive: missing recommended fields, unknown types/keys, broken links, and missing index.md produce findings, never rejection. See docs/SPEC_NOTES.md.
# Python — installs the `okf` command + importable package
pip install ./py # (or: pip install okf-ingest once published)
# R — from R-universe (binaries; pulls deps automatically)
options(repos = c(travisjakel = "https://travisjakel.r-universe.dev",
CRAN = "https://cloud.r-project.org"))
install.packages("okf")
# …or from a clone:
R CMD INSTALL r/okf # (or: remotes::install_local("r/okf"))Optional extras: pip install okf-ingest[html] (or R install.packages("commonmark"))
adds the markdown engine for okf html; embeddings/rag use a local Ollama
server (no extra Python dep; R uses the httr2 Suggests).
Both bindings can also be used without installing (dev mode): source("r/okf/R/okf.R")
in R, or PYTHONPATH=py python -m okf … for the Python CLI.
R
source("r/okf/R/okf.R")
res <- okf_ingest("path/to/bundle", db_path = "catalog.duckdb")
res$summary # n_concepts, conformant, errors, links_broken, ...
okf_search(res$con, "revenue") # full-text-ish lookup over bodies
okf_findings(res$con) # conformance findingsPython
import okf.okf as okf
con, summary = okf.ingest("path/to/bundle", db_path="catalog.duckdb")
okf.search(con, "revenue")Both produce the same okf_bundle / okf_concept / okf_link / okf_validation tables.
Optional, and overkill for small curated bundles — see Do you actually need RAG?. It pays off for large or cross-corpus knowledge bases, not a hand-linked folder of a few dozen concepts.
embed chunks concept bodies (paragraph-merged to ~600 chars), embeds each via
a pluggable embedder (default: local Ollama nomic-embed-text, 768-dim;
swap in any texts -> list[vector] callable), and stores vectors in okf_chunk.
rag embeds a query and ranks chunks by cosine similarity using DuckDB's native
list_cosine_similarity — no vector-DB extension required. Embeddings are
part of the shared catalog, so you can embed with one binding and query with the
other.
The DuckDB catalog is the queryable materialization; it isn't on the critical path for parsing, linting, or graphing a bundle. If you don't want to touch DuckDB at all, three functions hand you the whole model as plain data structures (R: data frames / lists; Python: dicts / dataclasses) and never go through the catalog:
| R | Python | Gives you | |
|---|---|---|---|
| Parse | okf_read() |
read_bundle() |
concepts (frontmatter + body) |
| Lint / health | okf_validate() |
validate() |
the same findings doctor reports — broken links, orphans, missing fields, non-ISO timestamps |
| Graph | okf_links() |
links() |
resolved + broken edges (markdown and [[wikilinks]]) |
rd <- okf_read("my-bundle") # no DuckDB
val <- okf_validate(rd) # data.frame of findings (broken_link, orphan, …)
lk <- okf_links(rd) # src_path / dst_raw / dst_path / resolved
subset(lk, !resolved) # every dangling link, as data — your call what to dob = okf.read_bundle("my-bundle") # no DuckDB
val = okf.validate(b) # list of findings
lk = okf.links(b) # the edge graph
[e for e in lk if not e["resolved"]]Reach for the catalog (okf_ingest) when you want what a SQL engine is for:
SQL query, the rendered html/graph, context blobs, or vector rag. So
DuckDB is effectively opt-in by usage — build it when you need it, ignore it when
you don't.
Identical subcommands in both languages — after install just okf … (Python
console script); in R via Rscript r/okf/bin/okf.R … (uses the installed
package, or falls back to dev source):
okf validate <bundle> [--strict] [--json] # lint; exit 1 on errors (or warnings w/ --strict)
okf ingest <source> --db catalog.duckdb [--subdir <p>] [--branch <b>] [--incremental] [--json]
okf query catalog.duckdb [--sql "…"] [--search <term>] [--concepts|--links|--findings] [--json]
okf context <bundle|catalog> [--start <concept>] [--depth N] [--max-tokens N] # LLM-wiki context blob
okf html <bundle|catalog> --out <dir> | --single <file.html> [--title T] # render for viewing
okf graph <bundle|catalog> --out <file.html> [--title T] # interactive force-directed graph
okf export <bundle|catalog> [--json] # portable {nodes, edges} graph JSON
okf impact <bundle|catalog> <concept> [--json] # inbound / outbound / transitive ripple
okf embed catalog.duckdb [--model nomic-embed-text] [--incremental] # chunk + embed bodies for search
okf rag catalog.duckdb --query "…" [-k 5] [--model …] # top-k semantic matchescontext is the faithful OKF / "LLM wiki" consume operation: hand an agent
index.md plus a concept and its link-neighborhood, assembled into one
markdown blob to read directly. It walks the concept graph you already built —
no embeddings, no vector search — and is capped to a token budget. This is
the on-concept alternative to rag for curated bundles:
okf context ./my-bundle --start orders.md --depth 1 --max-tokens 8000 > ctx.md
# emits index.md + orders.md + everything one link away, ready to paste into a promptIt accepts a bundle directly (dir/git/tar/zip) or an ingested .duckdb catalog.
html is a thin "render for viewing" layer: turn a bundle into browsable HTML
with no build step, no JavaScript, inline CSS — copy the output anywhere and
open it. Two modes:
okf html ./my-bundle --out site/ # navigable site: one .html per concept + index.html
okf html ./my-bundle --single bundle.html # one self-contained file (concepts become anchored sections)Internal .md links are rewritten to page-relative .html (site) or
in-page #anchors (single), so the result works straight off the filesystem
(file://) however the source wrote its links. Each page gets a metadata bar
(type / status / timestamp / tags), a "Linked from" backlinks line, and a
footer badge that surfaces broken or orphan links from validate. Bodies render
via a thin markdown engine (R commonmark, a Suggests dep; Python markdown via
the okf-ingest[html] extra). Like context, it accepts a bundle
(dir/git/tar/zip) or a .duckdb catalog.
The catalog already holds the link graph; these expose it (all deterministic, no LLM):
okf graph ./my-bundle --out graph.html # interactive force-directed page (vanilla JS, no CDN)
okf export ./my-bundle > graph.json # portable {nodes, edges} for any external visualizer
okf impact ./my-bundle signals/x.md # outbound / inbound / transitive ripple of a conceptgraph is a single self-contained HTML page — pan/zoom/drag, type-to-search,
nodes coloured by OKF type with community clustering as the fallback (a
deterministic label-propagation, [okf_clusters]). Click a node to open its
rendered .html, so dropping graph.html into a html --out site root turns it
into a live map. export emits the same node/edge model as JSON (nodes carry
id/type/title/tags/cluster/href), extending the "core is a contract"
idea beyond the DuckDB catalog. impact answers "what does changing this ripple
to" from the resolved-link graph.
ingest --incremental diffs each concept's content_hash against a prior ingest
into the same --db, rewriting only changed/added concepts (and dropping removed
ones); the JSON summary reports changed/added/removed/cached. embed --incremental re-embeds only concepts whose content changed, skipping the
expensive embedder calls for the rest — the right default for large, often-edited
wikis.
Knowledge bases drift — links break when files move, timestamps go stale,
concepts orphan. okf doctor is a deterministic one-shot health scan with a
score and CI exit codes:
okf doctor ./my-bundle # health: 92/100 (broken links, orphans, stale ts, dup titles…)
okf doctor ./my-bundle --strict # exit 1 on any warning — drop into CI / a hook
okf doctor ./my-bundle --stale-days 365 # also flag timestamps older than a year
okf doctor ./my-bundle --fix # apply ONLY safe repairs, report each--fix is conservative on purpose — it only normalizes a parseable non-ISO
timestamp, and re-points a broken link when exactly one basename matches.
Anything ambiguous is reported, never guessed (no LLM). Ready-made
examples/pre-commit and
examples/github-action.yml wire it into your
workflow so a bundle can't drift broken.
Alongside markdown ](path.md) links (resolved by path, unchanged), okf-ingest
resolves [[wikilink]] references — [[Concept Name]] and [[name|display]] —
by name, trying id → alias → title → filename-stem (ambiguous names
resolve to nothing, never a guess). Add aliases: [Alt Name] and an optional
id: to a concept's frontmatter to give it stable handles. This makes
okf-ingest work on Obsidian / Logseq / Foam-style vaults out of the box, and —
because links target a name, not a path — they survive file renames. Existing
markdown-only bundles are byte-identical; wikilinks are purely additive.
A <source> is a local directory, a git URL (github/gitlab/bitbucket, .git,
or git@), or a tar/zip archive (local path or http(s) URL). Remote sources
are fetched to a temp dir and cleaned up automatically; --subdir selects a
bundle within a repo/archive and --branch picks a git ref:
okf ingest https://github.com/org/repo.git --subdir docs/okf --db kb.duckdb
okf ingest https://example.com/bundle.tar.gz --db kb.duckdbvalidate is CI-friendly (non-zero exit = non-conformant). The catalog is
portable across bindings — ingest with R, query with Python, or vice-versa:
Rscript r/okf/bin/okf.R ingest ./bundle --db cat.duckdb # R writes
okf query cat.duckdb --search revenue # Python readsRscript conformance/check_r.R # R binding vs expected/*.json
python conformance/check_py.py # Python binding vs expected/*.jsonschema/catalog.sql core: the catalog schema (interop contract)
conformance/ core: golden bundles + expected outputs + per-lang checks
docs/ ARCHITECTURE.md, SPEC_NOTES.md
r/okf/ R binding
py/okf/ Python binding
Stable · lightly maintained. The whole consume side is implemented,
CLI-wrapped, and conformance-tested in both languages over one portable DuckDB
catalog: validate → ingest → query → context → render (html / graph /
export --mermaid) → impact → doctor→ embed → rag, with --incremental
ingest/embed and dir/git/tar/zip sources. Packaged to
PyPI and
R-universe.
The feature surface is complete and the conformance contract is locked, so the package is stable — safe to depend on. It is lightly maintained: expect fixes for bugs, conformance regressions, and OKF-spec updates, but not a fast cadence of new features. Issues and PRs are welcome (see CONTRIBUTING.md); response times are best-effort.
- Authoring (
new/add) — scaffold conformant concepts. Deliberately not done yet; okf-ingest is consume-first (for authoring today, seeokf-knowledge). watch— re-ingest/render on file change for live editing.- HTML polish — optional sidebar nav and theme palettes (the page stays no-JS, inline-CSS).
- More
doctor --fixclasses, as long as they stay unambiguously safe.
Is anything sent to an LLM? No — not by the core. The only model calls are
the opt-in embed/rag layer, and that's a local embedder (Ollama by
default) you can swap. Parsing, the graph, clusters, rendering, and doctor are
plain deterministic code. See Deterministic by design.
R or Python — which catalog do I get? The same one. Both write byte-identical DuckDB catalogs (a parity test enforces it); ingest in one, query from the other.
Do I need embeddings/RAG? Usually not for small curated bundles — the graph
the author wrote beats fuzzy matches, and okf context costs nothing. See
Do you actually need RAG?.
How do I keep a bundle healthy over time? okf doctor (+ the
examples/ pre-commit hook / GitHub Action) gates drift in CI;
--fix repairs the unambiguously-safe issues.
Does it author/edit my knowledge? No. It reads what you wrote. doctor --fix
makes only mechanical, reported repairs (ISO timestamps, unique-match moved
links) — never content.
What's "conformant"? Parseable frontmatter with a non-empty type on every
non-reserved file. Everything else is a finding, never a rejection.
Contributing? See CONTRIBUTING.md — new bindings just need to reproduce the conformance corpus.
The OKF tooling ecosystem appeared within weeks of the v0.1 spec. okf-ingest is deliberately positioned where the others aren't — a queryable catalog + RAG, in R and Python:
| Tool | Lang | Validate | Parse/graph | [[wikilinks]] |
Queryable store | Embeddings / RAG |
|---|---|---|---|---|---|---|
GoogleCloudPlatform/knowledge-catalog |
Py/TS | — | producer + HTML viz | — | — | — |
W4G1/okf |
Rust | ✓ | ✓ | — | — | — |
sniperunder123/okf-knowledge |
Python (Claude Code skill) | ✓ | ✓ + authoring & graph viz | — | — | — |
| WitsCode / okf.site | Node/web | ✓ | partial | — | — | — |
| okf-skills / okf-skill | agent skills | ✓ | ✓ | — | — | — |
| okf-ingest (this) | R + Python | ✓ | ✓ | ✓ rename-safe | DuckDB catalog | ✓ |
okf-ingest sits on the consume side of the OKF lifecycle. For the produce
side — authoring, maintaining, and visualizing bundles (especially inside Claude
Code) — okf-knowledge is a
nice complement: curate a bundle there, then okf ingest it into a queryable
DuckDB + RAG catalog here. If you only need to lint a bundle, the Rust/Node
validators are great.
Apache-2.0 (matching the OKF reference implementation). See LICENSE.
