Skip to content

test: strengthen primitive invariants and search artifacts#1

Merged
jamon8888 merged 105 commits into
mainfrom
pr-217
Jun 26, 2026
Merged

test: strengthen primitive invariants and search artifacts#1
jamon8888 merged 105 commits into
mainfrom
pr-217

Conversation

@jamon8888

@jamon8888 jamon8888 commented Jun 26, 2026

Copy link
Copy Markdown
Owner

Port of PR stella#217 from stella/anonymize

  • test: strengthen primitive invariants
  • test: gate native fixture parity
  • chore: pin text-search overlap artifact fix
  • test: strengthen search artifact properties
  • chore: pin text-search artifact identity fix

Summary by CodeRabbit

  • New Features

    • Expanded anonymization support with improved detection for dates, money amounts, legal forms, names, addresses, coreferences, hotwords, and false-positive filtering.
    • Added redaction/deanonymization improvements, placeholder generation, boundary cleanup, and optional diagnostics for deeper review.
    • Enabled packaged native artifacts and broader Rust/Python version syncing for more consistent builds.
  • Bug Fixes

    • Improved entity matching accuracy and overlap handling, reducing missed or duplicated redactions.
    • Added additional validation around search artifacts and packaged outputs.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, we are unable to review this pull request

The GitHub API does not allow us to fetch diffs exceeding 20000 lines

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@jamon8888, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 4 minutes and 52 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 3bd1d037-00e6-43bb-89cf-f9d99234d29f

📥 Commits

Reviewing files that changed from the base of the PR and between fdff5b3 and 6ed68bc.

📒 Files selected for processing (16)
  • .github/tools/sync-runtime-version.mjs
  • .github/workflows/ci.yml
  • .github/workflows/dependency-review.yml
  • crates/anonymize-adapter-contract/src/lib.rs
  • crates/anonymize-core/src/legal_forms.rs
  • crates/anonymize-core/src/name_corpus.rs
  • crates/anonymize-core/src/normalize.rs
  • crates/anonymize-core/src/prepared.rs
  • crates/anonymize-core/src/processors.rs
  • crates/anonymize-core/src/redact.rs
  • crates/anonymize-core/src/resolution/boundary.rs
  • crates/anonymize-core/src/resolution/sanitize.rs
  • crates/anonymize-core/src/resolution/types.rs
  • crates/anonymize-core/src/search.rs
  • docs/superpowers/specs/2026-06-26-gliner2-pii-integration-design.md
  • docs/superpowers/specs/2026-06-26-gliner2-pii-integration-rust-design.md
📝 Walkthrough

Walkthrough

Updated workspace, CI, and dependency-policy configuration. Added core anonymization pipeline modules for search, normalization, entity processing, detectors, cleanup, prepared-search orchestration, and adapter conversion.

Changes

Build and release configuration

Layer / File(s) Summary
Workspace manifests and lint policy
Cargo.toml, .cargo/config.toml, clippy.toml, .gitignore, crates/anonymize-core/Cargo.toml, crates/anonymize-adapter-contract/Cargo.toml
Workspace membership, shared package metadata, lint policies, build profiles, crate manifests, and ignore rules were updated.
CI, version sync, and packlist
.github/tools/{check-packlist.mjs, sync-runtime-version.mjs}, .github/workflows/{ci.yml, dependency-review.yml}
Version synchronization, dependency-review policy, packlist checks, and CI workflow steps were updated.

Anonymization pipeline

Layer / File(s) Summary
Shared types and search foundation
crates/anonymize-core/src/{lib.rs, resolution/types.rs, byte_offsets.rs, artifact_bytes.rs, search.rs, normalize.rs, placeholders.rs, anchored.rs}, crates/anonymize-core/data/identifier-cues.txt
Shared pipeline types, search/normalization/serialization primitives, placeholder reuse, anchored extraction scaffolding, and identifier-cue data were added.
Match processors
crates/anonymize-core/src/{processors.rs, false_positives.rs}
Regex, deny-list, gazetteer, country, and false-positive processing were added for pipeline entities.
Anchored detectors
crates/anonymize-core/src/{dates.rs, money.rs, hotwords.rs}
Anchored date, monetary amount, and hotword detectors were added.
Address and legal-form detectors
crates/anonymize-core/src/{address_context.rs, address_seeds.rs, legal_forms.rs}
Address context, address seed, and legal-form detection from configured text and match heuristics were added.
Coreference and name corpus
crates/anonymize-core/src/{coreference.rs, name_corpus.rs}
Coreference propagation and supplemental name detection from configured term lists and span similarity rules were added.
Boundary cleanup and redaction
crates/anonymize-core/src/{resolution/common.rs, resolution/boundary.rs, resolution/merge.rs, resolution/sanitize.rs, resolution/mod.rs, redact.rs}, crates/anonymize-core/data/{address-final-abbrevs.txt, legal-period-suffixes.txt}
Boundary normalization, entity merge/dedup, sanitization, and redaction logic were added with supporting suffix data.
Prepared search bridge
crates/anonymize-core/src/prepared.rs, crates/anonymize-adapter-contract/{src/lib.rs, examples/native_adapter_perf.rs}
Prepared-search orchestration and adapter conversion code were added.

Sequence Diagram(s)

sequenceDiagram
  participant native_adapter_perf
  participant PreparedSearch
  participant SearchIndex
  participant merge_and_dedup
  participant redact_text
  native_adapter_perf->>PreparedSearch: redact_static_entities(full_text, operators)
  PreparedSearch->>SearchIndex: find_matches(full_text)
  PreparedSearch->>merge_and_dedup: merge_and_dedup(entities)
  PreparedSearch->>redact_text: redact_text(full_text, entities, config)
  PreparedSearch-->>native_adapter_perf: RedactionResult
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐰 I hopped through lints with a carrot grin,
through search maps, redactions, and version sync spin.
New burrows of dates and names now gleam,
the cozy code garden now hums in a stream.
Hop-hop—safe trails for every entity dream.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.30% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is concise and names real parts of the change set, especially search artifacts and test/invariant hardening.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-217

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown

Dependency Review

The following issues were found:

  • ✅ 0 vulnerable package(s)
  • ✅ 0 package(s) with incompatible licenses
  • ✅ 0 package(s) with invalid SPDX license definitions
  • ⚠️ 32 package(s) with unknown licenses.
  • ⚠️ 3 packages with OpenSSF Scorecard issues.

View full job summary

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/tools/sync-runtime-version.mjs:
- Around line 96-100: The version-sync regexes in syncTextVersion and the
related Cargo.toml/Cargo.lock patterns only match LF newlines, so CRLF checkouts
fail to sync correctly. Update the regexes used in sync-runtime-version.mjs to
tolerate Windows line endings, either by matching \r?\n in the relevant capture
patterns or by normalizing file text before applying them, so the existing sync
logic works across environments.

In @.github/workflows/ci.yml:
- Around line 84-99: The two parity steps in the CI workflow are hardcoding a
fetch of origin main, which can compare PRs against the wrong baseline on
non-main targets. Update the Migration fixture parity and Native fixture parity
steps to use the PR base branch dynamically via github.base_ref, with a fallback
to main, so the git fetch in these workflow jobs always matches the actual PR
target branch.

In @.github/workflows/dependency-review.yml:
- Around line 45-48: The allow-dependencies-licenses entries are currently broad
package waivers, so scope each Cargo package in dependency-review-action by
adding the specific license qualifier via the dependency-review workflow
configuration. Update the allow list in dependency-review.yml for
stella-aho-corasick-core, stella-fuzzy-search-core, and stella-regex-set-core to
use the appropriate ?license=<identifier> form, or remove them entirely if the
existing Apache-2.0 WITH LLVM-exception allowance already covers the intended
cases.

In `@crates/anonymize-adapter-contract/src/lib.rs`:
- Around line 741-743: prepared_search_package_digest currently returns the
header-declared digest from prepared_search_package_parts without validating it
against the payload, so update it to verify the package before returning the
digest. Reuse the existing verification flow used by
prepared_search_package_from_bytes, specifically
verify_prepared_search_package_digest, and ensure the function only returns the
digest after the payload integrity check succeeds. Keep the change localized to
prepared_search_package_digest and related helpers in this module.

In `@crates/anonymize-core/src/legal_forms.rs`:
- Around line 1358-1363: The uppercase check in the prefix-trimming logic is
being applied to the lowercased buffer, so it can never reliably detect the
original capitalization. Update the trimming path in the legal form parsing
logic to inspect the corresponding character(s) from the original text buffer
instead of the lowercased one, keeping the existing whitespace handling in
place. Use the same `leading_ws_len`/`after` flow, but base the
`char::is_uppercase` guard on the original source string so direct-prefix
trimming only succeeds when the original text באמת has an uppercase
continuation.

In `@crates/anonymize-core/src/name_corpus.rs`:
- Around line 557-565: The low-confidence person-name branch in name scoring is
being lost because the final `(score >=
HIGH_CONFIDENCE_NAME_SCORE).then_some(score)` gate filters it out. Update the
scoring flow in the `name_corpus` logic so the `LOW_CONFIDENCE_NAME_SCORE` path
is returned/emitted instead of being discarded, while keeping the existing
high-confidence behavior for stronger matches.

In `@crates/anonymize-core/src/normalize.rs`:
- Around line 296-304: The `find_ethereum_address` matcher is too permissive
because it accepts any 42-character slice after `0x`, even when the address is
embedded in a longer hex token. Update `find_ethereum_address` in `normalize.rs`
to use token-boundary checks like the Bech32/Base58 matching paths, so it only
returns an Ethereum address when the `0x`-prefixed 40-hex sequence is standalone
and not followed by additional hex characters. Ensure the logic still validates
the hex payload while rejecting longer embedded tokens.

In `@crates/anonymize-core/src/prepared.rs`:
- Around line 1767-1793: Validate the compact literal slice bounds against the
artifact-backed literal index even when literal_patterns is empty. Update the
guard in prepared.rs around the validate_slice_bounds calls so the checks for
slices.deny_list, slices.street_types, slices.gazetteer, slices.countries, and
slices.hotwords still run whenever literal artifacts are enabled, using the
actual literals.len() or equivalent artifact index length rather than relying
only on config.literal_patterns.len().
- Around line 1590-1601: split_regex_patterns is routing any unclaimed pattern
into regex, which can let indexes outside slices.regex or overlapping other
slices slip through and desync regex_meta from prepared artifacts. Update
split_regex_patterns to validate each pattern_index against the declared slices
before pushing it, using the existing pattern_index, slices.legal_forms, and
slices.triggers checks plus an explicit slices.regex membership/range check, and
reject or error on any pattern that does not belong to exactly one declared
slice.

In `@crates/anonymize-core/src/processors.rs`:
- Around line 1842-1857: The gazetteer span expansion in the processor logic is
too permissive because it extends any exact hit followed by a short
space-delimited token, so tighten the check before mutating end in the relevant
span-extension path. Update the logic around offsets.slice,
next_space_offset_after_initial, and the GazetteerExtension return to require
artifact-backed extension metadata or a stricter validation of the suffix token
before returning the expanded span. Keep the existing exact-hit handling, but
only allow extension when the extra token is explicitly supported by the
gazetteer artifact metadata or a stronger suffix rule.
- Around line 107-113: `StringGroups` currently derives `Deserialize`, which
bypasses the bounds checks performed by `from_table_indices` and can let invalid
group indexes slip through. Replace the derived deserialization on
`StringGroups` with a manual `Deserialize` implementation that first reads
`table` and `groups`, then reconstructs the value through `from_table_indices`
(or the same validation logic) so any out-of-range indices fail immediately.
Keep `StringGroup::iter` unchanged and use `StringGroups` as the validation
entry point.

In `@crates/anonymize-core/src/redact.rs`:
- Around line 95-97: The deanonymise logic in redact.rs is doing repeated global
replacements on the growing output, which can recursively expand
placeholder-like text inside restored values. Update the loop in deanonymise to
base replacements on the original redacted_text and apply each redaction_map
entry in a single pass so previously restored content is not rewritten by later
iterations; use the existing deanonymise/redaction_map flow and keep the
placeholder-to-original mapping intact.

In `@crates/anonymize-core/src/resolution/boundary.rs`:
- Around line 81-149: `resolve_cross_label_overlaps` mutates entity boundaries
in place but continues to assume `sorted` stays ordered by `start`, which can
break later overlap checks in chained/3-way overlap cases. After changing
`right_mut.start` or `left_mut.end`, re-establish ordering for the affected
portion (or restart the sweep from the modified region) before continuing, so
subsequent comparisons in `resolve_cross_label_overlaps` only evaluate truly
overlapping spans and do not truncate unrelated entities.
- Around line 177-226: The locked-entity branch in boundary resolution is not
updating the same-label tracking, so a later span can still merge past a locked
entity and cause it to be dropped by remove_nested_same_label. In the main loop
in boundary merging logic, make the has_locked_boundary(entity) path also
refresh last_by_label for entity.label to the newly appended index so locked
spans become merge barriers for subsequent same-label entities. Keep the fix
localized to the sorted-entity iteration and the merge decision flow that uses
last_by_label, gap_occupied, and merge_into_previous.

In `@crates/anonymize-core/src/resolution/merge.rs`:
- Line 74: The source-less sanitization path in merge resolution should not
adjust byte offsets because it may rewrite entities based on entity.text instead
of the original source slice. Update the merge flow around
resolve_same_span_label_conflicts to keep this pass text-only, and move any
offset-changing cleanup to sanitize_entities_with_source when full_text is
available. Make sure the entities coming from boundary handling remain aligned
with their original start/end bytes before any redaction step.

In `@crates/anonymize-core/src/resolution/types.rs`:
- Around line 27-35: Add a Serde rename rule to SourceDetail so its serialized
form matches the adapter contract’s kebab-case mapping. Update the enum
definition for SourceDetail to use #[serde(rename_all = "kebab-case")] alongside
the existing derives, ensuring variants like CustomDenyList serialize
consistently as kebab-case in all paths such as diagnostics and logs.

In `@crates/anonymize-core/src/search.rs`:
- Around line 227-254: The empty-pattern branch in new_with_artifacts is
incorrectly reusing any non-empty artifacts.slots as an all-literal search,
which can restore stale detector state instead of rejecting it. Update
SearchIndex::new_with_artifacts to only take the all-literal path when the
artifacts are explicitly known to be literal-compatible, and otherwise fail the
parity check when patterns is empty but persisted artifacts remain. Use the
existing new_with_artifacts, new_all_literal_with_artifacts, and
SearchIndexArtifacts/slots flow to locate and tighten this decision.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 55181e53-7cb0-4ab9-b422-356991e3bdfc

📥 Commits

Reviewing files that changed from the base of the PR and between 4924c74 and fdff5b3.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • bun.lock is excluded by !**/*.lock
📒 Files selected for processing (149)
  • .cargo/config.toml
  • .github/tools/check-packlist.mjs
  • .github/tools/sync-runtime-version.mjs
  • .github/workflows/ci.yml
  • .github/workflows/dependency-review.yml
  • .gitignore
  • Cargo.toml
  • clippy.toml
  • crates/anonymize-adapter-contract/Cargo.toml
  • crates/anonymize-adapter-contract/examples/native_adapter_perf.rs
  • crates/anonymize-adapter-contract/src/lib.rs
  • crates/anonymize-core/Cargo.toml
  • crates/anonymize-core/data/address-final-abbrevs.txt
  • crates/anonymize-core/data/identifier-cues.txt
  • crates/anonymize-core/data/legal-period-suffixes.txt
  • crates/anonymize-core/src/address_context.rs
  • crates/anonymize-core/src/address_seeds.rs
  • crates/anonymize-core/src/anchored.rs
  • crates/anonymize-core/src/artifact_bytes.rs
  • crates/anonymize-core/src/byte_offsets.rs
  • crates/anonymize-core/src/coreference.rs
  • crates/anonymize-core/src/dates.rs
  • crates/anonymize-core/src/diagnostics.rs
  • crates/anonymize-core/src/false_positives.rs
  • crates/anonymize-core/src/hotwords.rs
  • crates/anonymize-core/src/legal_forms.rs
  • crates/anonymize-core/src/lib.rs
  • crates/anonymize-core/src/money.rs
  • crates/anonymize-core/src/name_corpus.rs
  • crates/anonymize-core/src/normalize.rs
  • crates/anonymize-core/src/placeholders.rs
  • crates/anonymize-core/src/prepared.rs
  • crates/anonymize-core/src/processors.rs
  • crates/anonymize-core/src/redact.rs
  • crates/anonymize-core/src/resolution/boundary.rs
  • crates/anonymize-core/src/resolution/common.rs
  • crates/anonymize-core/src/resolution/merge.rs
  • crates/anonymize-core/src/resolution/mod.rs
  • crates/anonymize-core/src/resolution/sanitize.rs
  • crates/anonymize-core/src/resolution/types.rs
  • crates/anonymize-core/src/search.rs
  • crates/anonymize-core/src/signatures.rs
  • crates/anonymize-core/src/triggers.rs
  • crates/anonymize-core/src/types.rs
  • crates/anonymize-core/src/validators.rs
  • crates/anonymize-core/src/zones.rs
  • crates/anonymize-core/tests/address_seed_parity.rs
  • crates/anonymize-core/tests/false_positive_parity.rs
  • crates/anonymize-core/tests/normalize.rs
  • crates/anonymize-core/tests/prepared.rs
  • crates/anonymize-core/tests/primitives_properties.rs
  • crates/anonymize-core/tests/processors.rs
  • crates/anonymize-core/tests/redaction.rs
  • crates/anonymize-core/tests/resolution.rs
  • crates/anonymize-core/tests/search.rs
  • crates/anonymize-core/tests/trigger_parity.rs
  • crates/anonymize-napi/Cargo.toml
  • crates/anonymize-napi/build.rs
  • crates/anonymize-napi/src/lib.rs
  • crates/anonymize-py/Cargo.toml
  • crates/anonymize-py/build.rs
  • crates/anonymize-py/pyproject.toml
  • crates/anonymize-py/src/lib.rs
  • package.json
  • packages/anonymize/.gitignore
  • packages/anonymize/README.md
  • packages/anonymize/index.cjs
  • packages/anonymize/package.json
  • packages/anonymize/scripts/build-native-node.mjs
  • packages/anonymize/scripts/build-native-pipeline-package.mjs
  • packages/anonymize/scripts/dist-smoke.mjs
  • packages/anonymize/scripts/migration-fixture-perf.mjs
  • packages/anonymize/scripts/native-adapter-perf.mjs
  • packages/anonymize/src/__test__/countries.test.ts
  • packages/anonymize/src/__test__/dictionary-bundle.test.ts
  • packages/anonymize/src/__test__/load-dictionaries.ts
  • packages/anonymize/src/__test__/native-adapter-parity.test.ts
  • packages/anonymize/src/__test__/native-node.test.ts
  • packages/anonymize/src/__test__/pipeline-config.test.ts
  • packages/anonymize/src/build-unified-search.ts
  • packages/anonymize/src/context.ts
  • packages/anonymize/src/data/address-boundaries.json
  • packages/anonymize/src/data/address-context.json
  • packages/anonymize/src/data/address-jurisdiction-prefixes.json
  • packages/anonymize/src/data/address-stop-keywords.json
  • packages/anonymize/src/data/address-unit-abbreviations.json
  • packages/anonymize/src/data/ambiguous-country-surfaces.json
  • packages/anonymize/src/data/clause-noun-heads.json
  • packages/anonymize/src/data/coreference-org-determiners.json
  • packages/anonymize/src/data/defined-term-heads.json
  • packages/anonymize/src/data/deny-list-filters.json
  • packages/anonymize/src/data/false-positive-shapes.json
  • packages/anonymize/src/data/language-scopes.json
  • packages/anonymize/src/data/legal-form-rule-words.json
  • packages/anonymize/src/data/legal-role-heads.cs.json
  • packages/anonymize/src/data/name-corpus-cjk.json
  • packages/anonymize/src/data/name-corpus-particles.json
  • packages/anonymize/src/data/organization-indicators.json
  • packages/anonymize/src/data/organization-unit-heads.json
  • packages/anonymize/src/data/person-stopwords.json
  • packages/anonymize/src/data/signing-clauses.json
  • packages/anonymize/src/detectors/address-seeds.ts
  • packages/anonymize/src/detectors/countries.ts
  • packages/anonymize/src/detectors/deny-list.ts
  • packages/anonymize/src/detectors/legal-forms.ts
  • packages/anonymize/src/detectors/regex.ts
  • packages/anonymize/src/detectors/triggers.ts
  • packages/anonymize/src/filters/confidence-boost.ts
  • packages/anonymize/src/filters/false-positives.ts
  • packages/anonymize/src/filters/hotword-rules.ts
  • packages/anonymize/src/index-shared.ts
  • packages/anonymize/src/language-scope.ts
  • packages/anonymize/src/native-default-config.ts
  • packages/anonymize/src/native-node.ts
  • packages/anonymize/src/native-pipeline.ts
  • packages/anonymize/src/native.ts
  • packages/anonymize/src/pipeline-cache-key.ts
  • packages/anonymize/src/pipeline.ts
  • packages/anonymize/src/types.ts
  • packages/anonymize/tsdown.config.ts
  • packages/anonymize/wasm/package.json
  • packages/cli/package.json
  • packages/cli/src/dictionary-scope.ts
  • packages/corpus/package.json
  • packages/data/config/address-boundaries.json
  • packages/data/config/address-context.json
  • packages/data/config/address-jurisdiction-prefixes.json
  • packages/data/config/address-stop-keywords.json
  • packages/data/config/address-unit-abbreviations.json
  • packages/data/config/ambiguous-country-surfaces.json
  • packages/data/config/clause-noun-heads.json
  • packages/data/config/coreference-org-determiners.json
  • packages/data/config/defined-term-heads.json
  • packages/data/config/deny-list-filters.json
  • packages/data/config/false-positive-shapes.json
  • packages/data/config/language-scopes.json
  • packages/data/config/legal-form-rule-words.json
  • packages/data/config/legal-role-heads.cs.json
  • packages/data/config/name-corpus-cjk.json
  • packages/data/config/name-corpus-particles.json
  • packages/data/config/organization-indicators.json
  • packages/data/config/organization-unit-heads.json
  • packages/data/config/person-stopwords.json
  • packages/data/config/signing-clauses.json
  • packages/data/dictionaries/index.ts
  • packages/data/package.json
  • rust-toolchain.toml
  • rustfmt.toml
  • turbo.json

Comment thread .github/tools/sync-runtime-version.mjs
Comment thread .github/workflows/ci.yml
Comment thread .github/workflows/dependency-review.yml Outdated
Comment thread crates/anonymize-adapter-contract/src/lib.rs
Comment on lines +971 to +982
fn nearest_right_non_address(
right_pos: usize,
existing_entities: &[PipelineEntity],
) -> Option<usize> {
existing_entities
.iter()
.filter(|entity| non_address_label(&entity.label))
.filter_map(|entity| {
let start = usize::try_from(entity.start).ok()?;
let offset = start.saturating_sub(right_pos);
(offset > 0).then_some(offset)
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Stop expansion when a non-address entity starts at right_pos.

Line 981 ignores an adjacent non-address entity because offset == 0 is filtered out, allowing address expansion to run into that span.

Proposed fix
-      let offset = start.saturating_sub(right_pos);
-      (offset > 0).then_some(offset)
+      (start >= right_pos).then_some(start.saturating_sub(right_pos))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn nearest_right_non_address(
right_pos: usize,
existing_entities: &[PipelineEntity],
) -> Option<usize> {
existing_entities
.iter()
.filter(|entity| non_address_label(&entity.label))
.filter_map(|entity| {
let start = usize::try_from(entity.start).ok()?;
let offset = start.saturating_sub(right_pos);
(offset > 0).then_some(offset)
})
fn nearest_right_non_address(
right_pos: usize,
existing_entities: &[PipelineEntity],
) -> Option<usize> {
existing_entities
.iter()
.filter(|entity| non_address_label(&entity.label))
.filter_map(|entity| {
let start = usize::try_from(entity.start).ok()?;
(start >= right_pos).then_some(start.saturating_sub(right_pos))
})

Comment thread crates/anonymize-core/src/resolution/boundary.rs
Comment thread crates/anonymize-core/src/resolution/boundary.rs
insert_at_or_push(&mut merged, insert_at, entity);
}

resolve_same_span_label_conflicts(&sanitize_entities(&merged))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Do not shift byte offsets in the source-less sanitize pass.

This call sanitizes from entity.text instead of the original source slice. That is not a safe invariant here: boundary.rs Lines 46-49 already handle entities whose text differs from full_text[start..end]. If one of those entities reaches this path, sanitize_entities can move start/end away from the real bytes, and later redaction will slice the wrong region. Keep this pass text-only, or defer offset-changing cleanup to sanitize_entities_with_source.

Safer direction
-  resolve_same_span_label_conflicts(&sanitize_entities(&merged))
+  resolve_same_span_label_conflicts(&merged)

Apply sanitize_entities_with_source only once full_text is available.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resolve_same_span_label_conflicts(&sanitize_entities(&merged))
resolve_same_span_label_conflicts(&merged)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/anonymize-core/src/resolution/merge.rs` at line 74, The source-less
sanitization path in merge resolution should not adjust byte offsets because it
may rewrite entities based on entity.text instead of the original source slice.
Update the merge flow around resolve_same_span_label_conflicts to keep this pass
text-only, and move any offset-changing cleanup to sanitize_entities_with_source
when full_text is available. Make sure the entities coming from boundary
handling remain aligned with their original start/end bytes before any redaction
step.

Comment thread crates/anonymize-core/src/resolution/types.rs
Comment thread crates/anonymize-core/src/search.rs
@jamon8888 jamon8888 merged commit f53920d into main Jun 26, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants