Skip to content

xbrain diff — compare two snapshots, surface unexpected drift #18

@VGonPa

Description

@VGonPa

After a re-enrichment, it is currently impossible to tell how much changed. Did 2% of items get reassigned, or 40%? Did the topic-page overviews shift drastically? Today: no diff, no answer.

Depends on: snapshot system (#C).

What drift means

Two distinct phenomena that the diff distinguishes:

  1. Expected drift from corpus growth — a topic with +20 items (from 50 to 70) should change its overview. Healthy signal.
  2. Unexpected drift from prompt/model change — same items reassigned to different topics, or an overview rewritten when only 2 items changed. Noise.

The diff surfaces both; the user (or eval, #8) judges which is which.

Mini-spec

`xbrain diff ` (default `snapshot-b` = current live state).

Output sections:

  • Items reassigned: count + % of items whose `primary_topic` changed. List the top N most-frequent transitions (e.g. `ai-coding → software-engineering: 12 items`).
  • Topic-level changes: for each topic, items added / removed / unchanged. Flag topics with >10% growth or >10% shrinkage.
  • Overview drift: for each topic, similarity between old and new overview (cosine similarity of embeddings, or LLM-judged similarity if WS3 WS3 — enrichment evaluation harness #8 is available). Flag overviews that changed sharply on small corpus changes.
  • Vocab changes: which slugs were added, removed, renamed.

Optionally `--format json` for machine consumption (CI / WS3 eval can ingest).

Acceptance

  • Works without WS3 / LLM judge: minimum viable diff is mechanical (counts, set differences, embedding similarity via a small offline model).
  • Adds an optional `--judge` flag that uses the WS3 judge if WS3 — enrichment evaluation harness #8 is built.
  • Tests cover: snapshot diff with fixture pairs; transitions correctly counted; vocab changes detected.

Why this is not part of WS3

WS3 compares against a fixed gold standard ("is your enrichment good in absolute terms?"). `xbrain diff` compares two runs of your system ("is your enrichment stable across changes?"). Different question, different tool, shared infrastructure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions