Conversation
|
Warning Review limit reached
More reviews will be available in 4 minutes and 52 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (16)
📝 WalkthroughWalkthroughUpdated workspace, CI, and dependency-policy configuration. Added core anonymization pipeline modules for search, normalization, entity processing, detectors, cleanup, prepared-search orchestration, and adapter conversion. ChangesBuild and release configuration
Anonymization pipeline
Sequence Diagram(s)sequenceDiagram
participant native_adapter_perf
participant PreparedSearch
participant SearchIndex
participant merge_and_dedup
participant redact_text
native_adapter_perf->>PreparedSearch: redact_static_entities(full_text, operators)
PreparedSearch->>SearchIndex: find_matches(full_text)
PreparedSearch->>merge_and_dedup: merge_and_dedup(entities)
PreparedSearch->>redact_text: redact_text(full_text, entities, config)
PreparedSearch-->>native_adapter_perf: RedactionResult
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Dependency ReviewThe following issues were found:
|
There was a problem hiding this comment.
Actionable comments posted: 18
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/tools/sync-runtime-version.mjs:
- Around line 96-100: The version-sync regexes in syncTextVersion and the
related Cargo.toml/Cargo.lock patterns only match LF newlines, so CRLF checkouts
fail to sync correctly. Update the regexes used in sync-runtime-version.mjs to
tolerate Windows line endings, either by matching \r?\n in the relevant capture
patterns or by normalizing file text before applying them, so the existing sync
logic works across environments.
In @.github/workflows/ci.yml:
- Around line 84-99: The two parity steps in the CI workflow are hardcoding a
fetch of origin main, which can compare PRs against the wrong baseline on
non-main targets. Update the Migration fixture parity and Native fixture parity
steps to use the PR base branch dynamically via github.base_ref, with a fallback
to main, so the git fetch in these workflow jobs always matches the actual PR
target branch.
In @.github/workflows/dependency-review.yml:
- Around line 45-48: The allow-dependencies-licenses entries are currently broad
package waivers, so scope each Cargo package in dependency-review-action by
adding the specific license qualifier via the dependency-review workflow
configuration. Update the allow list in dependency-review.yml for
stella-aho-corasick-core, stella-fuzzy-search-core, and stella-regex-set-core to
use the appropriate ?license=<identifier> form, or remove them entirely if the
existing Apache-2.0 WITH LLVM-exception allowance already covers the intended
cases.
In `@crates/anonymize-adapter-contract/src/lib.rs`:
- Around line 741-743: prepared_search_package_digest currently returns the
header-declared digest from prepared_search_package_parts without validating it
against the payload, so update it to verify the package before returning the
digest. Reuse the existing verification flow used by
prepared_search_package_from_bytes, specifically
verify_prepared_search_package_digest, and ensure the function only returns the
digest after the payload integrity check succeeds. Keep the change localized to
prepared_search_package_digest and related helpers in this module.
In `@crates/anonymize-core/src/legal_forms.rs`:
- Around line 1358-1363: The uppercase check in the prefix-trimming logic is
being applied to the lowercased buffer, so it can never reliably detect the
original capitalization. Update the trimming path in the legal form parsing
logic to inspect the corresponding character(s) from the original text buffer
instead of the lowercased one, keeping the existing whitespace handling in
place. Use the same `leading_ws_len`/`after` flow, but base the
`char::is_uppercase` guard on the original source string so direct-prefix
trimming only succeeds when the original text באמת has an uppercase
continuation.
In `@crates/anonymize-core/src/name_corpus.rs`:
- Around line 557-565: The low-confidence person-name branch in name scoring is
being lost because the final `(score >=
HIGH_CONFIDENCE_NAME_SCORE).then_some(score)` gate filters it out. Update the
scoring flow in the `name_corpus` logic so the `LOW_CONFIDENCE_NAME_SCORE` path
is returned/emitted instead of being discarded, while keeping the existing
high-confidence behavior for stronger matches.
In `@crates/anonymize-core/src/normalize.rs`:
- Around line 296-304: The `find_ethereum_address` matcher is too permissive
because it accepts any 42-character slice after `0x`, even when the address is
embedded in a longer hex token. Update `find_ethereum_address` in `normalize.rs`
to use token-boundary checks like the Bech32/Base58 matching paths, so it only
returns an Ethereum address when the `0x`-prefixed 40-hex sequence is standalone
and not followed by additional hex characters. Ensure the logic still validates
the hex payload while rejecting longer embedded tokens.
In `@crates/anonymize-core/src/prepared.rs`:
- Around line 1767-1793: Validate the compact literal slice bounds against the
artifact-backed literal index even when literal_patterns is empty. Update the
guard in prepared.rs around the validate_slice_bounds calls so the checks for
slices.deny_list, slices.street_types, slices.gazetteer, slices.countries, and
slices.hotwords still run whenever literal artifacts are enabled, using the
actual literals.len() or equivalent artifact index length rather than relying
only on config.literal_patterns.len().
- Around line 1590-1601: split_regex_patterns is routing any unclaimed pattern
into regex, which can let indexes outside slices.regex or overlapping other
slices slip through and desync regex_meta from prepared artifacts. Update
split_regex_patterns to validate each pattern_index against the declared slices
before pushing it, using the existing pattern_index, slices.legal_forms, and
slices.triggers checks plus an explicit slices.regex membership/range check, and
reject or error on any pattern that does not belong to exactly one declared
slice.
In `@crates/anonymize-core/src/processors.rs`:
- Around line 1842-1857: The gazetteer span expansion in the processor logic is
too permissive because it extends any exact hit followed by a short
space-delimited token, so tighten the check before mutating end in the relevant
span-extension path. Update the logic around offsets.slice,
next_space_offset_after_initial, and the GazetteerExtension return to require
artifact-backed extension metadata or a stricter validation of the suffix token
before returning the expanded span. Keep the existing exact-hit handling, but
only allow extension when the extra token is explicitly supported by the
gazetteer artifact metadata or a stronger suffix rule.
- Around line 107-113: `StringGroups` currently derives `Deserialize`, which
bypasses the bounds checks performed by `from_table_indices` and can let invalid
group indexes slip through. Replace the derived deserialization on
`StringGroups` with a manual `Deserialize` implementation that first reads
`table` and `groups`, then reconstructs the value through `from_table_indices`
(or the same validation logic) so any out-of-range indices fail immediately.
Keep `StringGroup::iter` unchanged and use `StringGroups` as the validation
entry point.
In `@crates/anonymize-core/src/redact.rs`:
- Around line 95-97: The deanonymise logic in redact.rs is doing repeated global
replacements on the growing output, which can recursively expand
placeholder-like text inside restored values. Update the loop in deanonymise to
base replacements on the original redacted_text and apply each redaction_map
entry in a single pass so previously restored content is not rewritten by later
iterations; use the existing deanonymise/redaction_map flow and keep the
placeholder-to-original mapping intact.
In `@crates/anonymize-core/src/resolution/boundary.rs`:
- Around line 81-149: `resolve_cross_label_overlaps` mutates entity boundaries
in place but continues to assume `sorted` stays ordered by `start`, which can
break later overlap checks in chained/3-way overlap cases. After changing
`right_mut.start` or `left_mut.end`, re-establish ordering for the affected
portion (or restart the sweep from the modified region) before continuing, so
subsequent comparisons in `resolve_cross_label_overlaps` only evaluate truly
overlapping spans and do not truncate unrelated entities.
- Around line 177-226: The locked-entity branch in boundary resolution is not
updating the same-label tracking, so a later span can still merge past a locked
entity and cause it to be dropped by remove_nested_same_label. In the main loop
in boundary merging logic, make the has_locked_boundary(entity) path also
refresh last_by_label for entity.label to the newly appended index so locked
spans become merge barriers for subsequent same-label entities. Keep the fix
localized to the sorted-entity iteration and the merge decision flow that uses
last_by_label, gap_occupied, and merge_into_previous.
In `@crates/anonymize-core/src/resolution/merge.rs`:
- Line 74: The source-less sanitization path in merge resolution should not
adjust byte offsets because it may rewrite entities based on entity.text instead
of the original source slice. Update the merge flow around
resolve_same_span_label_conflicts to keep this pass text-only, and move any
offset-changing cleanup to sanitize_entities_with_source when full_text is
available. Make sure the entities coming from boundary handling remain aligned
with their original start/end bytes before any redaction step.
In `@crates/anonymize-core/src/resolution/types.rs`:
- Around line 27-35: Add a Serde rename rule to SourceDetail so its serialized
form matches the adapter contract’s kebab-case mapping. Update the enum
definition for SourceDetail to use #[serde(rename_all = "kebab-case")] alongside
the existing derives, ensuring variants like CustomDenyList serialize
consistently as kebab-case in all paths such as diagnostics and logs.
In `@crates/anonymize-core/src/search.rs`:
- Around line 227-254: The empty-pattern branch in new_with_artifacts is
incorrectly reusing any non-empty artifacts.slots as an all-literal search,
which can restore stale detector state instead of rejecting it. Update
SearchIndex::new_with_artifacts to only take the all-literal path when the
artifacts are explicitly known to be literal-compatible, and otherwise fail the
parity check when patterns is empty but persisted artifacts remain. Use the
existing new_with_artifacts, new_all_literal_with_artifacts, and
SearchIndexArtifacts/slots flow to locate and tighten this decision.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 55181e53-7cb0-4ab9-b422-356991e3bdfc
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.lockbun.lockis excluded by!**/*.lock
📒 Files selected for processing (149)
.cargo/config.toml.github/tools/check-packlist.mjs.github/tools/sync-runtime-version.mjs.github/workflows/ci.yml.github/workflows/dependency-review.yml.gitignoreCargo.tomlclippy.tomlcrates/anonymize-adapter-contract/Cargo.tomlcrates/anonymize-adapter-contract/examples/native_adapter_perf.rscrates/anonymize-adapter-contract/src/lib.rscrates/anonymize-core/Cargo.tomlcrates/anonymize-core/data/address-final-abbrevs.txtcrates/anonymize-core/data/identifier-cues.txtcrates/anonymize-core/data/legal-period-suffixes.txtcrates/anonymize-core/src/address_context.rscrates/anonymize-core/src/address_seeds.rscrates/anonymize-core/src/anchored.rscrates/anonymize-core/src/artifact_bytes.rscrates/anonymize-core/src/byte_offsets.rscrates/anonymize-core/src/coreference.rscrates/anonymize-core/src/dates.rscrates/anonymize-core/src/diagnostics.rscrates/anonymize-core/src/false_positives.rscrates/anonymize-core/src/hotwords.rscrates/anonymize-core/src/legal_forms.rscrates/anonymize-core/src/lib.rscrates/anonymize-core/src/money.rscrates/anonymize-core/src/name_corpus.rscrates/anonymize-core/src/normalize.rscrates/anonymize-core/src/placeholders.rscrates/anonymize-core/src/prepared.rscrates/anonymize-core/src/processors.rscrates/anonymize-core/src/redact.rscrates/anonymize-core/src/resolution/boundary.rscrates/anonymize-core/src/resolution/common.rscrates/anonymize-core/src/resolution/merge.rscrates/anonymize-core/src/resolution/mod.rscrates/anonymize-core/src/resolution/sanitize.rscrates/anonymize-core/src/resolution/types.rscrates/anonymize-core/src/search.rscrates/anonymize-core/src/signatures.rscrates/anonymize-core/src/triggers.rscrates/anonymize-core/src/types.rscrates/anonymize-core/src/validators.rscrates/anonymize-core/src/zones.rscrates/anonymize-core/tests/address_seed_parity.rscrates/anonymize-core/tests/false_positive_parity.rscrates/anonymize-core/tests/normalize.rscrates/anonymize-core/tests/prepared.rscrates/anonymize-core/tests/primitives_properties.rscrates/anonymize-core/tests/processors.rscrates/anonymize-core/tests/redaction.rscrates/anonymize-core/tests/resolution.rscrates/anonymize-core/tests/search.rscrates/anonymize-core/tests/trigger_parity.rscrates/anonymize-napi/Cargo.tomlcrates/anonymize-napi/build.rscrates/anonymize-napi/src/lib.rscrates/anonymize-py/Cargo.tomlcrates/anonymize-py/build.rscrates/anonymize-py/pyproject.tomlcrates/anonymize-py/src/lib.rspackage.jsonpackages/anonymize/.gitignorepackages/anonymize/README.mdpackages/anonymize/index.cjspackages/anonymize/package.jsonpackages/anonymize/scripts/build-native-node.mjspackages/anonymize/scripts/build-native-pipeline-package.mjspackages/anonymize/scripts/dist-smoke.mjspackages/anonymize/scripts/migration-fixture-perf.mjspackages/anonymize/scripts/native-adapter-perf.mjspackages/anonymize/src/__test__/countries.test.tspackages/anonymize/src/__test__/dictionary-bundle.test.tspackages/anonymize/src/__test__/load-dictionaries.tspackages/anonymize/src/__test__/native-adapter-parity.test.tspackages/anonymize/src/__test__/native-node.test.tspackages/anonymize/src/__test__/pipeline-config.test.tspackages/anonymize/src/build-unified-search.tspackages/anonymize/src/context.tspackages/anonymize/src/data/address-boundaries.jsonpackages/anonymize/src/data/address-context.jsonpackages/anonymize/src/data/address-jurisdiction-prefixes.jsonpackages/anonymize/src/data/address-stop-keywords.jsonpackages/anonymize/src/data/address-unit-abbreviations.jsonpackages/anonymize/src/data/ambiguous-country-surfaces.jsonpackages/anonymize/src/data/clause-noun-heads.jsonpackages/anonymize/src/data/coreference-org-determiners.jsonpackages/anonymize/src/data/defined-term-heads.jsonpackages/anonymize/src/data/deny-list-filters.jsonpackages/anonymize/src/data/false-positive-shapes.jsonpackages/anonymize/src/data/language-scopes.jsonpackages/anonymize/src/data/legal-form-rule-words.jsonpackages/anonymize/src/data/legal-role-heads.cs.jsonpackages/anonymize/src/data/name-corpus-cjk.jsonpackages/anonymize/src/data/name-corpus-particles.jsonpackages/anonymize/src/data/organization-indicators.jsonpackages/anonymize/src/data/organization-unit-heads.jsonpackages/anonymize/src/data/person-stopwords.jsonpackages/anonymize/src/data/signing-clauses.jsonpackages/anonymize/src/detectors/address-seeds.tspackages/anonymize/src/detectors/countries.tspackages/anonymize/src/detectors/deny-list.tspackages/anonymize/src/detectors/legal-forms.tspackages/anonymize/src/detectors/regex.tspackages/anonymize/src/detectors/triggers.tspackages/anonymize/src/filters/confidence-boost.tspackages/anonymize/src/filters/false-positives.tspackages/anonymize/src/filters/hotword-rules.tspackages/anonymize/src/index-shared.tspackages/anonymize/src/language-scope.tspackages/anonymize/src/native-default-config.tspackages/anonymize/src/native-node.tspackages/anonymize/src/native-pipeline.tspackages/anonymize/src/native.tspackages/anonymize/src/pipeline-cache-key.tspackages/anonymize/src/pipeline.tspackages/anonymize/src/types.tspackages/anonymize/tsdown.config.tspackages/anonymize/wasm/package.jsonpackages/cli/package.jsonpackages/cli/src/dictionary-scope.tspackages/corpus/package.jsonpackages/data/config/address-boundaries.jsonpackages/data/config/address-context.jsonpackages/data/config/address-jurisdiction-prefixes.jsonpackages/data/config/address-stop-keywords.jsonpackages/data/config/address-unit-abbreviations.jsonpackages/data/config/ambiguous-country-surfaces.jsonpackages/data/config/clause-noun-heads.jsonpackages/data/config/coreference-org-determiners.jsonpackages/data/config/defined-term-heads.jsonpackages/data/config/deny-list-filters.jsonpackages/data/config/false-positive-shapes.jsonpackages/data/config/language-scopes.jsonpackages/data/config/legal-form-rule-words.jsonpackages/data/config/legal-role-heads.cs.jsonpackages/data/config/name-corpus-cjk.jsonpackages/data/config/name-corpus-particles.jsonpackages/data/config/organization-indicators.jsonpackages/data/config/organization-unit-heads.jsonpackages/data/config/person-stopwords.jsonpackages/data/config/signing-clauses.jsonpackages/data/dictionaries/index.tspackages/data/package.jsonrust-toolchain.tomlrustfmt.tomlturbo.json
| fn nearest_right_non_address( | ||
| right_pos: usize, | ||
| existing_entities: &[PipelineEntity], | ||
| ) -> Option<usize> { | ||
| existing_entities | ||
| .iter() | ||
| .filter(|entity| non_address_label(&entity.label)) | ||
| .filter_map(|entity| { | ||
| let start = usize::try_from(entity.start).ok()?; | ||
| let offset = start.saturating_sub(right_pos); | ||
| (offset > 0).then_some(offset) | ||
| }) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Stop expansion when a non-address entity starts at right_pos.
Line 981 ignores an adjacent non-address entity because offset == 0 is filtered out, allowing address expansion to run into that span.
Proposed fix
- let offset = start.saturating_sub(right_pos);
- (offset > 0).then_some(offset)
+ (start >= right_pos).then_some(start.saturating_sub(right_pos))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fn nearest_right_non_address( | |
| right_pos: usize, | |
| existing_entities: &[PipelineEntity], | |
| ) -> Option<usize> { | |
| existing_entities | |
| .iter() | |
| .filter(|entity| non_address_label(&entity.label)) | |
| .filter_map(|entity| { | |
| let start = usize::try_from(entity.start).ok()?; | |
| let offset = start.saturating_sub(right_pos); | |
| (offset > 0).then_some(offset) | |
| }) | |
| fn nearest_right_non_address( | |
| right_pos: usize, | |
| existing_entities: &[PipelineEntity], | |
| ) -> Option<usize> { | |
| existing_entities | |
| .iter() | |
| .filter(|entity| non_address_label(&entity.label)) | |
| .filter_map(|entity| { | |
| let start = usize::try_from(entity.start).ok()?; | |
| (start >= right_pos).then_some(start.saturating_sub(right_pos)) | |
| }) |
| insert_at_or_push(&mut merged, insert_at, entity); | ||
| } | ||
|
|
||
| resolve_same_span_label_conflicts(&sanitize_entities(&merged)) |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift
Do not shift byte offsets in the source-less sanitize pass.
This call sanitizes from entity.text instead of the original source slice. That is not a safe invariant here: boundary.rs Lines 46-49 already handle entities whose text differs from full_text[start..end]. If one of those entities reaches this path, sanitize_entities can move start/end away from the real bytes, and later redaction will slice the wrong region. Keep this pass text-only, or defer offset-changing cleanup to sanitize_entities_with_source.
Safer direction
- resolve_same_span_label_conflicts(&sanitize_entities(&merged))
+ resolve_same_span_label_conflicts(&merged)Apply sanitize_entities_with_source only once full_text is available.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| resolve_same_span_label_conflicts(&sanitize_entities(&merged)) | |
| resolve_same_span_label_conflicts(&merged) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/anonymize-core/src/resolution/merge.rs` at line 74, The source-less
sanitization path in merge resolution should not adjust byte offsets because it
may rewrite entities based on entity.text instead of the original source slice.
Update the merge flow around resolve_same_span_label_conflicts to keep this pass
text-only, and move any offset-changing cleanup to sanitize_entities_with_source
when full_text is available. Make sure the entities coming from boundary
handling remain aligned with their original start/end bytes before any redaction
step.
…process lifecycle
…o ref, env var namespace, port retry
Port of PR stella#217 from stella/anonymize
Summary by CodeRabbit
New Features
Bug Fixes