Skip to content

[content-hash 1/5] refactor: record passage_id_scheme in meta.json#330

Open
raoabinav wants to merge 3 commits into
StarTrail-org:mainfrom
raoabinav:refactor/passage-id-scheme-field
Open

[content-hash 1/5] refactor: record passage_id_scheme in meta.json#330
raoabinav wants to merge 3 commits into
StarTrail-org:mainfrom
raoabinav:refactor/passage-id-scheme-field

Conversation

@raoabinav
Copy link
Copy Markdown
Contributor

@raoabinav raoabinav commented May 20, 2026

Sub-PR 1 of 5 from #329.

Purely additive. Writes a new passage_id_scheme: "sequential" field into the .meta.json produced by both build_index and build_index_from_arrays. Existing index loaders ignore the field, so this changes nothing for any caller.

Also bumps meta_data["version"] from "1.0" to "1.1". No code currently reads version, so the bump is safe; it's documentation of the schema evolution for future migration logic.

Two module-level constants (PASSAGE_ID_SCHEME_SEQUENTIAL, PASSAGE_ID_SCHEME_CONTENT_HASH) document the value space. The content-hash scheme itself lands in sub-PR 2.

Content-hash passage IDs train (#329)

@ASuresh0524
Copy link
Copy Markdown
Collaborator

@raoabinav thanks for the pr! fix CI error before I can merge please for all content-hash PRs

@raoabinav raoabinav force-pushed the refactor/passage-id-scheme-field branch from a922ff4 to 4b0b883 Compare May 25, 2026 18:29
raoabinav added a commit to raoabinav/LEANN that referenced this pull request Jun 1, 2026
Sub-PR 2 of 5 from StarTrail-org#329. Builds on StarTrail-org#330 (which added the meta.json field).

New behavior:
- `LeannBuilder(..., passage_id_scheme="content-hash")` makes add_text() key
  passages by sha256(text)[:16] instead of insertion index. Stable across file
  moves, reorderings, and re-runs of the same corpus.
- `leann build --id-scheme content-hash` exposes it at the CLI.
- Default unchanged ("sequential"). Existing indexes continue to work
  identically; no migration triggered.

Identical-text chunks collide (same hash). For this sub-PR the second
occurrence overwrites the first in the offset map — that's the dedup
behavior I'd want by default. A `--preserve-duplicates` escape hatch can
land later if needed (see the open question in StarTrail-org#329).
@raoabinav raoabinav force-pushed the refactor/passage-id-scheme-field branch from 4b0b883 to 2a308f2 Compare June 1, 2026 18:18
@raoabinav
Copy link
Copy Markdown
Contributor Author

Superseded by #347, which consolidates the content-hash implementation on current main and adds migration coverage. This split PR can be closed after #347 is accepted.

raoabinav added 2 commits June 2, 2026 19:43
Sub-PR 1 of 5 from the plan in StarTrail-org#329. Purely additive — no behavior change
for any caller, existing index loaders ignore the field.

Writes a new `passage_id_scheme: "sequential"` field into the .meta.json
produced by both build_index and build_index_from_arrays. Bumps version
to "1.1" for human-inspectable schema tracking (no code reads version today,
so the bump is safe).

Module-level constants PASSAGE_ID_SCHEME_SEQUENTIAL / _CONTENT_HASH document
the value space; the content-hash scheme itself ships in sub-PR 2.
@raoabinav raoabinav force-pushed the refactor/passage-id-scheme-field branch from 3d5ed48 to b7e7c5c Compare June 3, 2026 02:45
raoabinav added a commit to raoabinav/LEANN that referenced this pull request Jun 3, 2026
Sub-PR 2 of 5 from StarTrail-org#329. Builds on StarTrail-org#330 (which added the meta.json field).

New behavior:
- `LeannBuilder(..., passage_id_scheme="content-hash")` makes add_text() key
  passages by sha256(text)[:16] instead of insertion index. Stable across file
  moves, reorderings, and re-runs of the same corpus.
- `leann build --id-scheme content-hash` exposes it at the CLI.
- Default unchanged ("sequential"). Existing indexes continue to work
  identically; no migration triggered.

Identical-text chunks collide (same hash). For this sub-PR the second
occurrence overwrites the first in the offset map — that's the dedup
behavior I'd want by default. A `--preserve-duplicates` escape hatch can
land later if needed (see the open question in StarTrail-org#329).
raoabinav added a commit to raoabinav/LEANN that referenced this pull request Jun 3, 2026
Sub-PR 2 of 5 from StarTrail-org#329. Builds on StarTrail-org#330 (which added the meta.json field).

New behavior:
- `LeannBuilder(..., passage_id_scheme="content-hash")` makes add_text() key
  passages by sha256(text)[:16] instead of insertion index. Stable across file
  moves, reorderings, and re-runs of the same corpus.
- `leann build --id-scheme content-hash` exposes it at the CLI.
- Default unchanged ("sequential"). Existing indexes continue to work
  identically; no migration triggered.

Identical-text chunks collide (same hash). For this sub-PR the second
occurrence overwrites the first in the offset map — that's the dedup
behavior I'd want by default. A `--preserve-duplicates` escape hatch can
land later if needed (see the open question in StarTrail-org#329).
raoabinav added a commit to raoabinav/LEANN that referenced this pull request Jun 3, 2026
Sub-PR 3 of 5 from StarTrail-org#329. Builds on StarTrail-org#330 / StarTrail-org#331.

Two changes:
1. `LeannCLI._make_incremental_builder` now reads the existing index's
   `passage_id_scheme` from meta.json and uses that, ignoring any conflicting
   `--id-scheme` on the args (with a note printed). Otherwise an update
   command on a content-hash index would mix sequential IDs into a hash-keyed
   passages.jsonl and break lookups.
2. `LeannSearcher` exposes `self.passage_id_scheme` so consumers can
   introspect; defaults to "sequential" for older indexes that don't record
   it (pre-StarTrail-org#330).

No behavior change for fresh builds — the CLI's --id-scheme still controls
which scheme a brand-new index gets.
raoabinav added a commit to raoabinav/LEANN that referenced this pull request Jun 3, 2026
Sub-PR 2 of 5 from StarTrail-org#329. Builds on StarTrail-org#330 (which added the meta.json field).

New behavior:
- `LeannBuilder(..., passage_id_scheme="content-hash")` makes add_text() key
  passages by sha256(text)[:16] instead of insertion index. Stable across file
  moves, reorderings, and re-runs of the same corpus.
- `leann build --id-scheme content-hash` exposes it at the CLI.
- Default unchanged ("sequential"). Existing indexes continue to work
  identically; no migration triggered.

Identical-text chunks collide (same hash). For this sub-PR the second
occurrence overwrites the first in the offset map — that's the dedup
behavior I'd want by default. A `--preserve-duplicates` escape hatch can
land later if needed (see the open question in StarTrail-org#329).
raoabinav added a commit to raoabinav/LEANN that referenced this pull request Jun 3, 2026
Sub-PR 3 of 5 from StarTrail-org#329. Builds on StarTrail-org#330 / StarTrail-org#331.

Two changes:
1. `LeannCLI._make_incremental_builder` now reads the existing index's
   `passage_id_scheme` from meta.json and uses that, ignoring any conflicting
   `--id-scheme` on the args (with a note printed). Otherwise an update
   command on a content-hash index would mix sequential IDs into a hash-keyed
   passages.jsonl and break lookups.
2. `LeannSearcher` exposes `self.passage_id_scheme` so consumers can
   introspect; defaults to "sequential" for older indexes that don't record
   it (pre-StarTrail-org#330).

No behavior change for fresh builds — the CLI's --id-scheme still controls
which scheme a brand-new index gets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants