test: strengthen primitive invariants and search artifacts by jamon8888 · Pull Request #1 · jamon8888/anonymize

jamon8888 · 2026-06-26T12:27:26Z

Port of PR stella#217 from stella/anonymize

test: strengthen primitive invariants
test: gate native fixture parity
chore: pin text-search overlap artifact fix
test: strengthen search artifact properties
chore: pin text-search artifact identity fix

Summary by CodeRabbit

New Features
- Expanded anonymization support with improved detection for dates, money amounts, legal forms, names, addresses, coreferences, hotwords, and false-positive filtering.
- Added redaction/deanonymization improvements, placeholder generation, boundary cleanup, and optional diagnostics for deeper review.
- Enabled packaged native artifacts and broader Rust/Python version syncing for more consistent builds.
Bug Fixes
- Improved entity matching accuracy and overlap handling, reducing missed or duplicated redactions.
- Added additional validation around search artifacts and packaged outputs.

sourcery-ai

Sorry, we are unable to review this pull request

The GitHub API does not allow us to fetch diffs exceeding 20000 lines

coderabbitai · 2026-06-26T12:27:44Z

Warning

Review limit reached

@jamon8888, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 4 minutes and 52 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 3bd1d037-00e6-43bb-89cf-f9d99234d29f

📥 Commits

Reviewing files that changed from the base of the PR and between fdff5b3 and 6ed68bc.

📒 Files selected for processing (16)

.github/tools/sync-runtime-version.mjs
.github/workflows/ci.yml
.github/workflows/dependency-review.yml
crates/anonymize-adapter-contract/src/lib.rs
crates/anonymize-core/src/legal_forms.rs
crates/anonymize-core/src/name_corpus.rs
crates/anonymize-core/src/normalize.rs
crates/anonymize-core/src/prepared.rs
crates/anonymize-core/src/processors.rs
crates/anonymize-core/src/redact.rs
crates/anonymize-core/src/resolution/boundary.rs
crates/anonymize-core/src/resolution/sanitize.rs
crates/anonymize-core/src/resolution/types.rs
crates/anonymize-core/src/search.rs
docs/superpowers/specs/2026-06-26-gliner2-pii-integration-design.md
docs/superpowers/specs/2026-06-26-gliner2-pii-integration-rust-design.md

📝 Walkthrough

Walkthrough

Updated workspace, CI, and dependency-policy configuration. Added core anonymization pipeline modules for search, normalization, entity processing, detectors, cleanup, prepared-search orchestration, and adapter conversion.

Changes

Build and release configuration

Layer / File(s)	Summary
Workspace manifests and lint policy `Cargo.toml`, `.cargo/config.toml`, `clippy.toml`, `.gitignore`, `crates/anonymize-core/Cargo.toml`, `crates/anonymize-adapter-contract/Cargo.toml`	Workspace membership, shared package metadata, lint policies, build profiles, crate manifests, and ignore rules were updated.
CI, version sync, and packlist `.github/tools/{check-packlist.mjs, sync-runtime-version.mjs}`, `.github/workflows/{ci.yml, dependency-review.yml}`	Version synchronization, dependency-review policy, packlist checks, and CI workflow steps were updated.

Anonymization pipeline

Layer / File(s)	Summary
Shared types and search foundation `crates/anonymize-core/src/{lib.rs, resolution/types.rs, byte_offsets.rs, artifact_bytes.rs, search.rs, normalize.rs, placeholders.rs, anchored.rs}`, `crates/anonymize-core/data/identifier-cues.txt`	Shared pipeline types, search/normalization/serialization primitives, placeholder reuse, anchored extraction scaffolding, and identifier-cue data were added.
Match processors `crates/anonymize-core/src/{processors.rs, false_positives.rs}`	Regex, deny-list, gazetteer, country, and false-positive processing were added for pipeline entities.
Anchored detectors `crates/anonymize-core/src/{dates.rs, money.rs, hotwords.rs}`	Anchored date, monetary amount, and hotword detectors were added.
Address and legal-form detectors `crates/anonymize-core/src/{address_context.rs, address_seeds.rs, legal_forms.rs}`	Address context, address seed, and legal-form detection from configured text and match heuristics were added.
Coreference and name corpus `crates/anonymize-core/src/{coreference.rs, name_corpus.rs}`	Coreference propagation and supplemental name detection from configured term lists and span similarity rules were added.
Boundary cleanup and redaction `crates/anonymize-core/src/{resolution/common.rs, resolution/boundary.rs, resolution/merge.rs, resolution/sanitize.rs, resolution/mod.rs, redact.rs}`, `crates/anonymize-core/data/{address-final-abbrevs.txt, legal-period-suffixes.txt}`	Boundary normalization, entity merge/dedup, sanitization, and redaction logic were added with supporting suffix data.
Prepared search bridge `crates/anonymize-core/src/prepared.rs`, `crates/anonymize-adapter-contract/{src/lib.rs, examples/native_adapter_perf.rs}`	Prepared-search orchestration and adapter conversion code were added.

Sequence Diagram(s)

sequenceDiagram
  participant native_adapter_perf
  participant PreparedSearch
  participant SearchIndex
  participant merge_and_dedup
  participant redact_text
  native_adapter_perf->>PreparedSearch: redact_static_entities(full_text, operators)
  PreparedSearch->>SearchIndex: find_matches(full_text)
  PreparedSearch->>merge_and_dedup: merge_and_dedup(entities)
  PreparedSearch->>redact_text: redact_text(full_text, entities, config)
  PreparedSearch-->>native_adapter_perf: RedactionResult

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐰 I hopped through lints with a carrot grin,
through search maps, redactions, and version sync spin.
New burrows of dates and names now gleam,
the cozy code garden now hums in a stream.
Hop-hop—safe trails for every entity dream.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 6.30% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title is concise and names real parts of the change set, especially search artifacts and test/invariant hardening.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pr-217

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

github-actions · 2026-06-26T12:38:49Z

Dependency Review

The following issues were found:

✅ 0 vulnerable package(s)
✅ 0 package(s) with incompatible licenses
✅ 0 package(s) with invalid SPDX license definitions
⚠️ 32 package(s) with unknown licenses.
⚠️ 3 packages with OpenSSF Scorecard issues.

View full job summary

coderabbitai

Actionable comments posted: 18

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/tools/sync-runtime-version.mjs:
- Around line 96-100: The version-sync regexes in syncTextVersion and the
related Cargo.toml/Cargo.lock patterns only match LF newlines, so CRLF checkouts
fail to sync correctly. Update the regexes used in sync-runtime-version.mjs to
tolerate Windows line endings, either by matching \r?\n in the relevant capture
patterns or by normalizing file text before applying them, so the existing sync
logic works across environments.

In @.github/workflows/ci.yml:
- Around line 84-99: The two parity steps in the CI workflow are hardcoding a
fetch of origin main, which can compare PRs against the wrong baseline on
non-main targets. Update the Migration fixture parity and Native fixture parity
steps to use the PR base branch dynamically via github.base_ref, with a fallback
to main, so the git fetch in these workflow jobs always matches the actual PR
target branch.

In @.github/workflows/dependency-review.yml:
- Around line 45-48: The allow-dependencies-licenses entries are currently broad
package waivers, so scope each Cargo package in dependency-review-action by
adding the specific license qualifier via the dependency-review workflow
configuration. Update the allow list in dependency-review.yml for
stella-aho-corasick-core, stella-fuzzy-search-core, and stella-regex-set-core to
use the appropriate ?license=<identifier> form, or remove them entirely if the
existing Apache-2.0 WITH LLVM-exception allowance already covers the intended
cases.

In `@crates/anonymize-adapter-contract/src/lib.rs`:
- Around line 741-743: prepared_search_package_digest currently returns the
header-declared digest from prepared_search_package_parts without validating it
against the payload, so update it to verify the package before returning the
digest. Reuse the existing verification flow used by
prepared_search_package_from_bytes, specifically
verify_prepared_search_package_digest, and ensure the function only returns the
digest after the payload integrity check succeeds. Keep the change localized to
prepared_search_package_digest and related helpers in this module.

In `@crates/anonymize-core/src/legal_forms.rs`:
- Around line 1358-1363: The uppercase check in the prefix-trimming logic is
being applied to the lowercased buffer, so it can never reliably detect the
original capitalization. Update the trimming path in the legal form parsing
logic to inspect the corresponding character(s) from the original text buffer
instead of the lowercased one, keeping the existing whitespace handling in
place. Use the same `leading_ws_len`/`after` flow, but base the
`char::is_uppercase` guard on the original source string so direct-prefix
trimming only succeeds when the original text באמת has an uppercase
continuation.

In `@crates/anonymize-core/src/name_corpus.rs`:
- Around line 557-565: The low-confidence person-name branch in name scoring is
being lost because the final `(score >=
HIGH_CONFIDENCE_NAME_SCORE).then_some(score)` gate filters it out. Update the
scoring flow in the `name_corpus` logic so the `LOW_CONFIDENCE_NAME_SCORE` path
is returned/emitted instead of being discarded, while keeping the existing
high-confidence behavior for stronger matches.

In `@crates/anonymize-core/src/normalize.rs`:
- Around line 296-304: The `find_ethereum_address` matcher is too permissive
because it accepts any 42-character slice after `0x`, even when the address is
embedded in a longer hex token. Update `find_ethereum_address` in `normalize.rs`
to use token-boundary checks like the Bech32/Base58 matching paths, so it only
returns an Ethereum address when the `0x`-prefixed 40-hex sequence is standalone
and not followed by additional hex characters. Ensure the logic still validates
the hex payload while rejecting longer embedded tokens.

In `@crates/anonymize-core/src/prepared.rs`:
- Around line 1767-1793: Validate the compact literal slice bounds against the
artifact-backed literal index even when literal_patterns is empty. Update the
guard in prepared.rs around the validate_slice_bounds calls so the checks for
slices.deny_list, slices.street_types, slices.gazetteer, slices.countries, and
slices.hotwords still run whenever literal artifacts are enabled, using the
actual literals.len() or equivalent artifact index length rather than relying
only on config.literal_patterns.len().
- Around line 1590-1601: split_regex_patterns is routing any unclaimed pattern
into regex, which can let indexes outside slices.regex or overlapping other
slices slip through and desync regex_meta from prepared artifacts. Update
split_regex_patterns to validate each pattern_index against the declared slices
before pushing it, using the existing pattern_index, slices.legal_forms, and
slices.triggers checks plus an explicit slices.regex membership/range check, and
reject or error on any pattern that does not belong to exactly one declared
slice.

In `@crates/anonymize-core/src/processors.rs`:
- Around line 1842-1857: The gazetteer span expansion in the processor logic is
too permissive because it extends any exact hit followed by a short
space-delimited token, so tighten the check before mutating end in the relevant
span-extension path. Update the logic around offsets.slice,
next_space_offset_after_initial, and the GazetteerExtension return to require
artifact-backed extension metadata or a stricter validation of the suffix token
before returning the expanded span. Keep the existing exact-hit handling, but
only allow extension when the extra token is explicitly supported by the
gazetteer artifact metadata or a stronger suffix rule.
- Around line 107-113: `StringGroups` currently derives `Deserialize`, which
bypasses the bounds checks performed by `from_table_indices` and can let invalid
group indexes slip through. Replace the derived deserialization on
`StringGroups` with a manual `Deserialize` implementation that first reads
`table` and `groups`, then reconstructs the value through `from_table_indices`
(or the same validation logic) so any out-of-range indices fail immediately.
Keep `StringGroup::iter` unchanged and use `StringGroups` as the validation
entry point.

In `@crates/anonymize-core/src/redact.rs`:
- Around line 95-97: The deanonymise logic in redact.rs is doing repeated global
replacements on the growing output, which can recursively expand
placeholder-like text inside restored values. Update the loop in deanonymise to
base replacements on the original redacted_text and apply each redaction_map
entry in a single pass so previously restored content is not rewritten by later
iterations; use the existing deanonymise/redaction_map flow and keep the
placeholder-to-original mapping intact.

In `@crates/anonymize-core/src/resolution/boundary.rs`:
- Around line 81-149: `resolve_cross_label_overlaps` mutates entity boundaries
in place but continues to assume `sorted` stays ordered by `start`, which can
break later overlap checks in chained/3-way overlap cases. After changing
`right_mut.start` or `left_mut.end`, re-establish ordering for the affected
portion (or restart the sweep from the modified region) before continuing, so
subsequent comparisons in `resolve_cross_label_overlaps` only evaluate truly
overlapping spans and do not truncate unrelated entities.
- Around line 177-226: The locked-entity branch in boundary resolution is not
updating the same-label tracking, so a later span can still merge past a locked
entity and cause it to be dropped by remove_nested_same_label. In the main loop
in boundary merging logic, make the has_locked_boundary(entity) path also
refresh last_by_label for entity.label to the newly appended index so locked
spans become merge barriers for subsequent same-label entities. Keep the fix
localized to the sorted-entity iteration and the merge decision flow that uses
last_by_label, gap_occupied, and merge_into_previous.

In `@crates/anonymize-core/src/resolution/merge.rs`:
- Line 74: The source-less sanitization path in merge resolution should not
adjust byte offsets because it may rewrite entities based on entity.text instead
of the original source slice. Update the merge flow around
resolve_same_span_label_conflicts to keep this pass text-only, and move any
offset-changing cleanup to sanitize_entities_with_source when full_text is
available. Make sure the entities coming from boundary handling remain aligned
with their original start/end bytes before any redaction step.

In `@crates/anonymize-core/src/resolution/types.rs`:
- Around line 27-35: Add a Serde rename rule to SourceDetail so its serialized
form matches the adapter contract’s kebab-case mapping. Update the enum
definition for SourceDetail to use #[serde(rename_all = "kebab-case")] alongside
the existing derives, ensuring variants like CustomDenyList serialize
consistently as kebab-case in all paths such as diagnostics and logs.

In `@crates/anonymize-core/src/search.rs`:
- Around line 227-254: The empty-pattern branch in new_with_artifacts is
incorrectly reusing any non-empty artifacts.slots as an all-literal search,
which can restore stale detector state instead of rejecting it. Update
SearchIndex::new_with_artifacts to only take the all-literal path when the
artifacts are explicitly known to be literal-compatible, and otherwise fail the
parity check when patterns is empty but persisted artifacts remain. Use the
existing new_with_artifacts, new_all_literal_with_artifacts, and
SearchIndexArtifacts/slots flow to locate and tighten this decision.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 55181e53-7cb0-4ab9-b422-356991e3bdfc

📥 Commits

Reviewing files that changed from the base of the PR and between 4924c74 and fdff5b3.

⛔ Files ignored due to path filters (2)

Cargo.lock is excluded by !**/*.lock
bun.lock is excluded by !**/*.lock

📒 Files selected for processing (149)

.cargo/config.toml
.github/tools/check-packlist.mjs
.github/tools/sync-runtime-version.mjs
.github/workflows/ci.yml
.github/workflows/dependency-review.yml
.gitignore
Cargo.toml
clippy.toml
crates/anonymize-adapter-contract/Cargo.toml
crates/anonymize-adapter-contract/examples/native_adapter_perf.rs
crates/anonymize-adapter-contract/src/lib.rs
crates/anonymize-core/Cargo.toml
crates/anonymize-core/data/address-final-abbrevs.txt
crates/anonymize-core/data/identifier-cues.txt
crates/anonymize-core/data/legal-period-suffixes.txt
crates/anonymize-core/src/address_context.rs
crates/anonymize-core/src/address_seeds.rs
crates/anonymize-core/src/anchored.rs
crates/anonymize-core/src/artifact_bytes.rs
crates/anonymize-core/src/byte_offsets.rs
crates/anonymize-core/src/coreference.rs
crates/anonymize-core/src/dates.rs
crates/anonymize-core/src/diagnostics.rs
crates/anonymize-core/src/false_positives.rs
crates/anonymize-core/src/hotwords.rs
crates/anonymize-core/src/legal_forms.rs
crates/anonymize-core/src/lib.rs
crates/anonymize-core/src/money.rs
crates/anonymize-core/src/name_corpus.rs
crates/anonymize-core/src/normalize.rs
crates/anonymize-core/src/placeholders.rs
crates/anonymize-core/src/prepared.rs
crates/anonymize-core/src/processors.rs
crates/anonymize-core/src/redact.rs
crates/anonymize-core/src/resolution/boundary.rs
crates/anonymize-core/src/resolution/common.rs
crates/anonymize-core/src/resolution/merge.rs
crates/anonymize-core/src/resolution/mod.rs
crates/anonymize-core/src/resolution/sanitize.rs
crates/anonymize-core/src/resolution/types.rs
crates/anonymize-core/src/search.rs
crates/anonymize-core/src/signatures.rs
crates/anonymize-core/src/triggers.rs
crates/anonymize-core/src/types.rs
crates/anonymize-core/src/validators.rs
crates/anonymize-core/src/zones.rs
crates/anonymize-core/tests/address_seed_parity.rs
crates/anonymize-core/tests/false_positive_parity.rs
crates/anonymize-core/tests/normalize.rs
crates/anonymize-core/tests/prepared.rs
crates/anonymize-core/tests/primitives_properties.rs
crates/anonymize-core/tests/processors.rs
crates/anonymize-core/tests/redaction.rs
crates/anonymize-core/tests/resolution.rs
crates/anonymize-core/tests/search.rs
crates/anonymize-core/tests/trigger_parity.rs
crates/anonymize-napi/Cargo.toml
crates/anonymize-napi/build.rs
crates/anonymize-napi/src/lib.rs
crates/anonymize-py/Cargo.toml
crates/anonymize-py/build.rs
crates/anonymize-py/pyproject.toml
crates/anonymize-py/src/lib.rs
package.json
packages/anonymize/.gitignore
packages/anonymize/README.md
packages/anonymize/index.cjs
packages/anonymize/package.json
packages/anonymize/scripts/build-native-node.mjs
packages/anonymize/scripts/build-native-pipeline-package.mjs
packages/anonymize/scripts/dist-smoke.mjs
packages/anonymize/scripts/migration-fixture-perf.mjs
packages/anonymize/scripts/native-adapter-perf.mjs
packages/anonymize/src/__test__/countries.test.ts
packages/anonymize/src/__test__/dictionary-bundle.test.ts
packages/anonymize/src/__test__/load-dictionaries.ts
packages/anonymize/src/__test__/native-adapter-parity.test.ts
packages/anonymize/src/__test__/native-node.test.ts
packages/anonymize/src/__test__/pipeline-config.test.ts
packages/anonymize/src/build-unified-search.ts
packages/anonymize/src/context.ts
packages/anonymize/src/data/address-boundaries.json
packages/anonymize/src/data/address-context.json
packages/anonymize/src/data/address-jurisdiction-prefixes.json
packages/anonymize/src/data/address-stop-keywords.json
packages/anonymize/src/data/address-unit-abbreviations.json
packages/anonymize/src/data/ambiguous-country-surfaces.json
packages/anonymize/src/data/clause-noun-heads.json
packages/anonymize/src/data/coreference-org-determiners.json
packages/anonymize/src/data/defined-term-heads.json
packages/anonymize/src/data/deny-list-filters.json
packages/anonymize/src/data/false-positive-shapes.json
packages/anonymize/src/data/language-scopes.json
packages/anonymize/src/data/legal-form-rule-words.json
packages/anonymize/src/data/legal-role-heads.cs.json
packages/anonymize/src/data/name-corpus-cjk.json
packages/anonymize/src/data/name-corpus-particles.json
packages/anonymize/src/data/organization-indicators.json
packages/anonymize/src/data/organization-unit-heads.json
packages/anonymize/src/data/person-stopwords.json
packages/anonymize/src/data/signing-clauses.json
packages/anonymize/src/detectors/address-seeds.ts
packages/anonymize/src/detectors/countries.ts
packages/anonymize/src/detectors/deny-list.ts
packages/anonymize/src/detectors/legal-forms.ts
packages/anonymize/src/detectors/regex.ts
packages/anonymize/src/detectors/triggers.ts
packages/anonymize/src/filters/confidence-boost.ts
packages/anonymize/src/filters/false-positives.ts
packages/anonymize/src/filters/hotword-rules.ts
packages/anonymize/src/index-shared.ts
packages/anonymize/src/language-scope.ts
packages/anonymize/src/native-default-config.ts
packages/anonymize/src/native-node.ts
packages/anonymize/src/native-pipeline.ts
packages/anonymize/src/native.ts
packages/anonymize/src/pipeline-cache-key.ts
packages/anonymize/src/pipeline.ts
packages/anonymize/src/types.ts
packages/anonymize/tsdown.config.ts
packages/anonymize/wasm/package.json
packages/cli/package.json
packages/cli/src/dictionary-scope.ts
packages/corpus/package.json
packages/data/config/address-boundaries.json
packages/data/config/address-context.json
packages/data/config/address-jurisdiction-prefixes.json
packages/data/config/address-stop-keywords.json
packages/data/config/address-unit-abbreviations.json
packages/data/config/ambiguous-country-surfaces.json
packages/data/config/clause-noun-heads.json
packages/data/config/coreference-org-determiners.json
packages/data/config/defined-term-heads.json
packages/data/config/deny-list-filters.json
packages/data/config/false-positive-shapes.json
packages/data/config/language-scopes.json
packages/data/config/legal-form-rule-words.json
packages/data/config/legal-role-heads.cs.json
packages/data/config/name-corpus-cjk.json
packages/data/config/name-corpus-particles.json
packages/data/config/organization-indicators.json
packages/data/config/organization-unit-heads.json
packages/data/config/person-stopwords.json
packages/data/config/signing-clauses.json
packages/data/dictionaries/index.ts
packages/data/package.json
rust-toolchain.toml
rustfmt.toml
turbo.json

coderabbitai · 2026-06-26T12:45:59Z

+fn nearest_right_non_address(
+  right_pos: usize,
+  existing_entities: &[PipelineEntity],
+) -> Option<usize> {
+  existing_entities
+    .iter()
+    .filter(|entity| non_address_label(&entity.label))
+    .filter_map(|entity| {
+      let start = usize::try_from(entity.start).ok()?;
+      let offset = start.saturating_sub(right_pos);
+      (offset > 0).then_some(offset)
+    })


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Stop expansion when a non-address entity starts at right_pos.

Line 981 ignores an adjacent non-address entity because offset == 0 is filtered out, allowing address expansion to run into that span.

Proposed fix

- let offset = start.saturating_sub(right_pos); - (offset > 0).then_some(offset) + (start >= right_pos).then_some(start.saturating_sub(right_pos))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fn nearest_right_non_address(

right_pos: usize,

existing_entities: &[PipelineEntity],

) -> Option<usize> {

existing_entities

.iter()

.filter(|entity| non_address_label(&entity.label))

.filter_map(|entity| {

let start = usize::try_from(entity.start).ok()?;

let offset = start.saturating_sub(right_pos);

(offset > 0).then_some(offset)

})

fn nearest_right_non_address(

right_pos: usize,

existing_entities: &[PipelineEntity],

) -> Option<usize> {

existing_entities

.iter()

.filter(|entity| non_address_label(&entity.label))

.filter_map(|entity| {

let start = usize::try_from(entity.start).ok()?;

(start >= right_pos).then_some(start.saturating_sub(right_pos))

})

coderabbitai · 2026-06-26T12:46:00Z

+    insert_at_or_push(&mut merged, insert_at, entity);
+  }
+
+  resolve_same_span_label_conflicts(&sanitize_entities(&merged))


🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Do not shift byte offsets in the source-less sanitize pass.

This call sanitizes from entity.text instead of the original source slice. That is not a safe invariant here: boundary.rs Lines 46-49 already handle entities whose text differs from full_text[start..end]. If one of those entities reaches this path, sanitize_entities can move start/end away from the real bytes, and later redaction will slice the wrong region. Keep this pass text-only, or defer offset-changing cleanup to sanitize_entities_with_source.

Safer direction

- resolve_same_span_label_conflicts(&sanitize_entities(&merged)) + resolve_same_span_label_conflicts(&merged)

Apply sanitize_entities_with_source only once full_text is available.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

resolve_same_span_label_conflicts(&sanitize_entities(&merged))

resolve_same_span_label_conflicts(&merged)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/anonymize-core/src/resolution/merge.rs` at line 74, The source-less sanitization path in merge resolution should not adjust byte offsets because it may rewrite entities based on entity.text instead of the original source slice. Update the merge flow around resolve_same_span_label_conflicts to keep this pass text-only, and move any offset-changing cleanup to sanitize_entities_with_source when full_text is available. Make sure the entities coming from boundary handling remain aligned with their original start/end bytes before any redaction step.

…process lifecycle

…o ref, env var namespace, port retry

jan-kubica added 30 commits June 24, 2026 08:24

feat: add anonymize core crate

1e9c1c9

fix: satisfy stable clippy

43328b8

feat: add core search layer

b916fa1

chore: allow internal core crate license checks

c9d4936

fix: harden core redaction contracts

c664ed0

docs: clarify core contracts

78996c9

fix: normalize identifier cues

194188e

feat: add core entity resolution

b273830

feat: add core boundary resolution

392ad16

chore: split core resolution modules

fb57f7e

feat: add core match processors

d6d4089

feat: add core search normalization

30d8db7

feat: support literal pattern options

57f7b45

feat: add prepared core search

b523d78

feat: add static core redaction

5e9f987

feat: add core language bindings

fcbb328

feat: share native adapter contract

5bb0c41

fix: tighten native redaction contracts

cdcac7d

test: add migration fixture gate

dc3c4a7

feat: port custom deny-list slice

609e410

fix: tighten cli dictionary scope type

03caa5d

test: report migration fixture runtime coverage

fd14f11

feat: wire native static redaction

f2b322a

perf: skip ts search for native prep

33d5973

fix: clean migration benchmark helper

163ee75

feat: wire native static redaction package

3f7d5fa

feat: add core prepared packages

e31c763

chore: update prepared search core pin

204326a

fix: satisfy fixture perf lint

ee93da3

chore: update dependencies

d81cefe

jan-kubica added 15 commits June 26, 2026 11:58

feat: cache default native pipeline

8b8e92e

fix: align native trigger edge cases

c8e274f

fix: simplify native binding loading

b05a87a

chore: pin rust toolchain

6513d94

fix: tighten native review edge cases

343a82f

fix: restore native build outputs

e65cb02

fix: guard overlapping deny-list names

c02b0f0

test: add rust primitive properties

286a9da

test: strengthen rust primitive properties

47f1068

test: reject stale prepared search artifacts

c72d45a

chore: pin text-search artifact identity fix

ad398cb

test: strengthen search artifact properties

7dc2669

chore: pin text-search overlap artifact fix

f5a8437

test: gate native fixture parity

a8a6c32

test: strengthen primitive invariants

fdff5b3

sourcery-ai Bot reviewed Jun 26, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Your Name added 10 commits June 26, 2026 15:11

docs: GLiNER2 PII integration design spec

4890666

fix: tolerate CRLF in version sync regexes, add Gemini code review CI

58ea623

docs: address spec review — canonical labels list, Python discovery, …

d9bb306

…process lifecycle

docs: fix uvicorn import path in spec

8aa3a00

docs: GLiNER2 PII integration Rust sidecar design spec

bd1ccde

docs: fix spec review issues — label mapping resolved to TS-only, rep…

aec95c0

…o ref, env var namespace, port retry

docs: fix reverse map collision — prefer original requested label

611cdaf

fix: address 15 CodeRabbit review items across crate sources

d6eb7e2

fix: cargo fmt and gate gemini CI on secret

d39f1ae

fix: use github.base_ref in parity steps and scope dep-review licenses

6ed68bc

jamon8888 merged commit f53920d into main Jun 26, 2026
6 checks passed

	resolve_same_span_label_conflicts(&sanitize_entities(&merged))
	resolve_same_span_label_conflicts(&merged)

Conversation

jamon8888 commented Jun 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jamon8888 commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

github-actions Bot commented Jun 26, 2026 •

edited

Loading