feat: add anonymize core crate#217
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
341eb7f to
b916fa1
Compare
Dependency ReviewThe following issues were found:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b916fa1630
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c664ed09a6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fcbb328f84
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fd14f11c0c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 163ee75fca
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4fc897746a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 31dc9df699
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 32ef71ca81
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4d6303598b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2fe0ef7984
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3b43389136
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f76f9cbd85
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eac07ba045
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1d494e92f2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5688ee13d1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6435474f06
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3641361a0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99460dbb97
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d343b8c25d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f7ec00f449
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c39163994a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Rust workspace and lint configuration, core search and resolution primitives, detection rules for dates, money, legal forms, addresses, names, coreference, and hotwords, plus prepared-search orchestration and adapter-contract serialization. ChangesRust anonymization pipeline
Sequence Diagram(s)sequenceDiagram
participant PreparedSearch
participant SearchIndex
participant enforce_boundary_consistency
participant sanitize_entities
participant filter_entity_false_positives
participant redact_text
PreparedSearch->>SearchIndex: find_matches(full_text)
SearchIndex-->>PreparedSearch: PreparedSearchMatches
PreparedSearch->>enforce_boundary_consistency: normalize PipelineEntity spans
PreparedSearch->>sanitize_entities: clean entity text
PreparedSearch->>filter_entity_false_positives: drop rejected entities
PreparedSearch->>redact_text: build redacted_text
Estimated code review effort🎯 5 (Critical) | ⏱️ ~90+ minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
🧹 Nitpick comments (4)
crates/anonymize-adapter-contract/examples/native_adapter_perf.rs (1)
39-49: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winMove operator JSON parsing out of the timed run loop.
runMscurrently includes repeatedserde_jsonparsing and operator conversion for every iteration/case, so the metric is not isolated toPreparedSearch::redact_static_entities.Proposed refactor
+ let run_cases = payload + .cases + .iter() + .map(|item| -> Result<_, Box<dyn std::error::Error>> { + let operators = item + .operators_json + .as_deref() + .map(serde_json::from_str::<BindingOperatorConfig>) + .transpose()?; + let operators = operator_config_from_binding(operators)?; + Ok((&item.text, operators)) + }) + .collect::<Result<Vec<_>, _>>()?; + let run_start = Instant::now(); let mut entity_count = 0_usize; for _ in 0..payload.iterations { - for item in &payload.cases { - let operators = item - .operators_json - .as_deref() - .map(serde_json::from_str::<BindingOperatorConfig>) - .transpose()?; - let operators = operator_config_from_binding(operators)?; - let result = prepared.redact_static_entities(&item.text, &operators)?; + for (text, operators) in &run_cases { + let result = prepared.redact_static_entities(text, operators)?; entity_count = entity_count.saturating_add(result.redaction.entity_count); } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/anonymize-adapter-contract/examples/native_adapter_perf.rs` around lines 39 - 49, Move the operator JSON parsing and conversion out of the timed section in native_adapter_perf.rs so runMs only measures PreparedSearch::redact_static_entities. Precompute each case’s operators (the serde_json::from_str::<BindingOperatorConfig> and operator_config_from_binding work) before run_start is recorded, then reuse the parsed result inside the payload.iterations loop. Keep the timing around the redact call only, and use the existing payload.cases, prepared.redact_static_entities, and run_start flow to locate the refactor.crates/anonymize-core/src/prepared.rs (1)
1923-1956: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueDead
elsebranch invalidate_hotword_config.Line 1928 returns
Ok(())whenhotword_data.is_none(), so thelet Some(data) = &config.hotword_data else { … }at lines 1932‑1936 can never hit itselsearm. Collapse to a direct bind.♻️ Simplify
- if config.hotword_data.is_none() { - return Ok(()); - } - - let Some(data) = &config.hotword_data else { - return Err(Error::MissingStaticData { - field: "hotword_data", - }); - }; + let Some(data) = &config.hotword_data else { + return Ok(()); + };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/anonymize-core/src/prepared.rs` around lines 1923 - 1956, The `validate_hotword_config` function contains a dead `else` branch because it already returns `Ok(())` when `config.hotword_data` is `None`, so the `let Some(data) = &config.hotword_data else { ... }` fallback can never run. Simplify the control flow by removing the unreachable `else` arm and binding `data` directly from `config.hotword_data` in `validate_hotword_config`, keeping the existing rule and hotword validation logic unchanged.crates/anonymize-core/src/byte_offsets.rs (1)
42-47: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winTie
sliceto the text owned byByteOffsets.
start/endare validated againstself.text, but the range is applied to a separatefull_text. Remove the extra parameter and sliceself.textso callers cannot accidentally validate one string and read another.Proposed refactor
pub(crate) fn slice( &self, - full_text: &str, start: u32, end: u32, ) -> Result<String> { @@ Ok( - full_text + self.text .get(start_byte..end_byte) .ok_or(Error::InvalidSpan { start, end })? .to_owned(),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/anonymize-core/src/byte_offsets.rs` around lines 42 - 47, The ByteOffsets::slice method currently validates offsets against self.text but slices a separate full_text argument, which can desync validation from the actual data being read. Remove the extra parameter from slice, update the implementation to use self.text for the substring extraction, and adjust any callers so they only pass start/end and cannot mix different source strings.crates/anonymize-core/src/resolution/boundary.rs (1)
361-397: 🚀 Performance & Scalability | 🔵 TrivialAvoid rescanning
spansinword_start_at/word_end_at.
Both helpers linearly search the wholespansslice on every step of a per-character loop, which makes each call quadratic on long inputs. Sincespansis sorted,partition_pointor a tracked index would keep the scan linear, andfix_partial_wordscalls both helpers for every entity.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/anonymize-core/src/resolution/boundary.rs` around lines 361 - 397, The helper functions word_start_at and word_end_at are rescanning the entire sorted spans slice on every loop iteration, making fix_partial_words much slower on long inputs. Update these helpers to avoid repeated linear searches by using the ordering of spans with partition_point or by carrying a moving index as cursor advances, and keep the logic in boundary.rs centered around the existing word_start_at/word_end_at and fix_partial_words flow.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/ci.yml:
- Around line 46-49: The Rust setup in the CI workflow is using the moving
stable channel, which can change formatting or lint behavior unexpectedly.
Update the Setup Rust step in the workflow or add a repo-level
rust-toolchain.toml so the toolchain version is explicitly pinned, and make sure
the rustup install/default commands use that fixed version consistently.
In `@crates/anonymize-adapter-contract/src/lib.rs`:
- Line 1816: The diagnostic offset conversion in convert_diagnostic_offsets is
leaving raw byte offsets in UTF-16 binding results when conversion fails, which
breaks the adapter contract. Update the UTF-16 paths that call
convert_diagnostic_offsets in the diagnostic binding flow so offsets are never
mixed, and make the conversion fail explicitly or otherwise ensure every
diagnostic offset is converted before returning result.diagnostics.events. Use
the existing symbols convert_diagnostic_offsets, *_to_utf16_binding, and
result.diagnostics.events to locate and fix all affected call sites.
In `@crates/anonymize-core/src/address_context.rs`:
- Line 51: The bare-house stopword matching in AddressContext is inconsistent
because `bare_house_stopwords` is stored without normalization while the regex
can capture capitalized words. Update the `AddressContext` setup and the
lookup/comparison path so both the configured stopwords and incoming values are
lowercased before being collected and checked, ensuring entries like `may` match
`May 1` consistently.
In `@crates/anonymize-core/src/address_seeds.rs`:
- Around line 21-25: The AddressSeedData deserialization currently fails when
any of its list fields are omitted, because the vectors are required instead of
defaulting to empty. Update AddressSeedData so boundary_words, br_cep_cue_words,
and unit_abbreviations deserialize with defaults, matching the optional
static-data behavior used elsewhere in the crate; keep the fix focused on the
AddressSeedData struct and its serde annotations.
In `@crates/anonymize-core/src/anchored.rs`:
- Around line 93-107: The anchored extraction flow is using raw SearchMatch
offsets as if they were byte indices, which causes incorrect slicing in anchored
rules after non-ASCII text. Update AnchorSpan and the anchored path in
anchored.rs so anchor_span converts or carries validated byte indices, and
ensure extract plus the rule extraction path use ByteOffsets or equivalent
byte-safe slicing before any str::get calls. Keep the fix localized around
anchor_span, AnchorSpan, and pub(crate) fn extract so all anchored rule spans
remain aligned with the original text.
In `@crates/anonymize-core/src/dates.rs`:
- Around line 170-185: The date_entity helper is passing byte-based start/end
values into PipelineEntity::detected, which breaks the UTF-16 offset contract.
Update date_entity in dates.rs to convert the local &str byte positions back
into UTF-16 offsets before building the PipelineEntity, using the existing
str_slice/full_text context so detected date spans stay aligned after non-ASCII
text.
In `@crates/anonymize-core/src/false_positives.rs`:
- Around line 619-633: The ambiguous address-component check in
is_only_ambiguous_component is still using the original text, so capitalized
terms like Street can slip through when filters are lowercase. Update the
matching path around find_ambiguous_component_occurrence and the stripped-text
check to compare case-insensitively, consistent with has_address_component’s
lowercasing behavior, and apply the same fix to the other affected branch in the
same function.
In `@crates/anonymize-core/src/legal_forms.rs`:
- Around line 109-145: The legal-form span collection is currently carrying byte
offsets from the candidate-building logic into the published `PipelineEntity`
spans, which violates the UTF-16 redaction contract. Update the span creation
flow in the legal-forms path so `Candidate` offsets are converted to UTF-16
before being emitted, and make sure the final `PipelineEntity` construction uses
those UTF-16 positions rather than raw slice indices. Keep the conversion
centralized near the code that builds and publishes candidates so the byte-based
Rust slicing remains internal while all external offsets stay UTF-16.
In `@crates/anonymize-core/src/money.rs`:
- Around line 553-567: The money_entity helper is returning byte-based start/end
positions instead of UTF-16 offsets, which can misalign redaction spans for
non-ASCII text. Update money_entity to convert the detected byte span into
UTF-16 offsets before calling PipelineEntity::detected, using the existing
full_text/start/end context and keeping the str_slice-derived detected value
unchanged.
In `@crates/anonymize-core/src/normalize.rs`:
- Around line 396-409: The generic identifier normalization in normalize.rs is
letting trailing prose be merged into the key when whitespace is allowed, so
update the token scanning logic around is_generic_identifier and the final
return from last_valid/compact to stop at the last valid token boundary instead
of accepting the fully expanded string. Adjust the loop that uses
is_identifier_separator and the final predicate(&compact) check so
label-specific stopping rules or a boundary check prevent cases like a valid
identifier followed by words from normalizing into one concatenated key.
In `@crates/anonymize-core/src/processors.rs`:
- Around line 1367-1372: The entity span extension logic in ExtendedName/related
offset handling is mixing byte lengths with PipelineEntity offsets, which causes
drift for non-ASCII text. Update the arithmetic in the affected offset-extension
paths to use UTF-16 code-unit counts for offset deltas, and keep byte indices
only for local string slicing via ByteOffsets. Review the ExtendedName
construction and the other offset-adjustment blocks referenced in the comment so
all added/subtracted suffix or district lengths are derived consistently from
UTF-16 length calculations.
In `@crates/anonymize-core/src/resolution/sanitize.rs`:
- Around line 108-117: The span adjustment in sanitize.rs is using UTF-8 byte
lengths for PipelineEntity offsets, which breaks the UTF-16 offset contract.
Update the logic around the display_text, start, and end calculations to measure
the trimmed prefix and cleaned text in UTF-16 units instead of byte_len, while
keeping the existing sanitization flow in sanitize() intact. Make sure the
cloned entity’s start/end fields are derived from UTF-16 code unit counts so
redaction stays aligned for non-ASCII text like José.
In `@crates/anonymize-core/src/search.rs`:
- Around line 128-130: `read_slots` in `search.rs` preallocates `slots` directly
from the serialized `count`, which can be attacker-controlled and cause an
oversized allocation. Update the deserialization flow around
`reader.read_usize()` and the `for _ in 0..count` loop to either validate
`count` against the remaining input before allocating or build `slots`
incrementally without `Vec::with_capacity(count)`.
---
Nitpick comments:
In `@crates/anonymize-adapter-contract/examples/native_adapter_perf.rs`:
- Around line 39-49: Move the operator JSON parsing and conversion out of the
timed section in native_adapter_perf.rs so runMs only measures
PreparedSearch::redact_static_entities. Precompute each case’s operators (the
serde_json::from_str::<BindingOperatorConfig> and operator_config_from_binding
work) before run_start is recorded, then reuse the parsed result inside the
payload.iterations loop. Keep the timing around the redact call only, and use
the existing payload.cases, prepared.redact_static_entities, and run_start flow
to locate the refactor.
In `@crates/anonymize-core/src/byte_offsets.rs`:
- Around line 42-47: The ByteOffsets::slice method currently validates offsets
against self.text but slices a separate full_text argument, which can desync
validation from the actual data being read. Remove the extra parameter from
slice, update the implementation to use self.text for the substring extraction,
and adjust any callers so they only pass start/end and cannot mix different
source strings.
In `@crates/anonymize-core/src/prepared.rs`:
- Around line 1923-1956: The `validate_hotword_config` function contains a dead
`else` branch because it already returns `Ok(())` when `config.hotword_data` is
`None`, so the `let Some(data) = &config.hotword_data else { ... }` fallback can
never run. Simplify the control flow by removing the unreachable `else` arm and
binding `data` directly from `config.hotword_data` in `validate_hotword_config`,
keeping the existing rule and hotword validation logic unchanged.
In `@crates/anonymize-core/src/resolution/boundary.rs`:
- Around line 361-397: The helper functions word_start_at and word_end_at are
rescanning the entire sorted spans slice on every loop iteration, making
fix_partial_words much slower on long inputs. Update these helpers to avoid
repeated linear searches by using the ordering of spans with partition_point or
by carrying a moving index as cursor advances, and keep the logic in boundary.rs
centered around the existing word_start_at/word_end_at and fix_partial_words
flow.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 7048aa2c-0513-4cbb-ba50-7a27dbf8ddc6
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.lockbun.lockis excluded by!**/*.lock
📒 Files selected for processing (146)
.cargo/config.toml.github/tools/check-packlist.mjs.github/tools/sync-runtime-version.mjs.github/workflows/ci.yml.github/workflows/dependency-review.yml.gitignoreCargo.tomlclippy.tomlcrates/anonymize-adapter-contract/Cargo.tomlcrates/anonymize-adapter-contract/examples/native_adapter_perf.rscrates/anonymize-adapter-contract/src/lib.rscrates/anonymize-core/Cargo.tomlcrates/anonymize-core/data/address-final-abbrevs.txtcrates/anonymize-core/data/identifier-cues.txtcrates/anonymize-core/data/legal-period-suffixes.txtcrates/anonymize-core/src/address_context.rscrates/anonymize-core/src/address_seeds.rscrates/anonymize-core/src/anchored.rscrates/anonymize-core/src/artifact_bytes.rscrates/anonymize-core/src/byte_offsets.rscrates/anonymize-core/src/coreference.rscrates/anonymize-core/src/dates.rscrates/anonymize-core/src/diagnostics.rscrates/anonymize-core/src/false_positives.rscrates/anonymize-core/src/hotwords.rscrates/anonymize-core/src/legal_forms.rscrates/anonymize-core/src/lib.rscrates/anonymize-core/src/money.rscrates/anonymize-core/src/name_corpus.rscrates/anonymize-core/src/normalize.rscrates/anonymize-core/src/placeholders.rscrates/anonymize-core/src/prepared.rscrates/anonymize-core/src/processors.rscrates/anonymize-core/src/redact.rscrates/anonymize-core/src/resolution/boundary.rscrates/anonymize-core/src/resolution/common.rscrates/anonymize-core/src/resolution/merge.rscrates/anonymize-core/src/resolution/mod.rscrates/anonymize-core/src/resolution/sanitize.rscrates/anonymize-core/src/resolution/types.rscrates/anonymize-core/src/search.rscrates/anonymize-core/src/signatures.rscrates/anonymize-core/src/triggers.rscrates/anonymize-core/src/types.rscrates/anonymize-core/src/validators.rscrates/anonymize-core/src/zones.rscrates/anonymize-core/tests/address_seed_parity.rscrates/anonymize-core/tests/false_positive_parity.rscrates/anonymize-core/tests/normalize.rscrates/anonymize-core/tests/prepared.rscrates/anonymize-core/tests/processors.rscrates/anonymize-core/tests/redaction.rscrates/anonymize-core/tests/resolution.rscrates/anonymize-core/tests/search.rscrates/anonymize-core/tests/trigger_parity.rscrates/anonymize-napi/Cargo.tomlcrates/anonymize-napi/build.rscrates/anonymize-napi/src/lib.rscrates/anonymize-py/Cargo.tomlcrates/anonymize-py/build.rscrates/anonymize-py/pyproject.tomlcrates/anonymize-py/src/lib.rspackage.jsonpackages/anonymize/.gitignorepackages/anonymize/README.mdpackages/anonymize/index.cjspackages/anonymize/package.jsonpackages/anonymize/scripts/build-native-node.mjspackages/anonymize/scripts/build-native-pipeline-package.mjspackages/anonymize/scripts/dist-smoke.mjspackages/anonymize/scripts/migration-fixture-perf.mjspackages/anonymize/scripts/native-adapter-perf.mjspackages/anonymize/src/__test__/countries.test.tspackages/anonymize/src/__test__/dictionary-bundle.test.tspackages/anonymize/src/__test__/load-dictionaries.tspackages/anonymize/src/__test__/native-adapter-parity.test.tspackages/anonymize/src/__test__/native-node.test.tspackages/anonymize/src/__test__/pipeline-config.test.tspackages/anonymize/src/build-unified-search.tspackages/anonymize/src/context.tspackages/anonymize/src/data/address-boundaries.jsonpackages/anonymize/src/data/address-context.jsonpackages/anonymize/src/data/address-jurisdiction-prefixes.jsonpackages/anonymize/src/data/address-stop-keywords.jsonpackages/anonymize/src/data/address-unit-abbreviations.jsonpackages/anonymize/src/data/ambiguous-country-surfaces.jsonpackages/anonymize/src/data/clause-noun-heads.jsonpackages/anonymize/src/data/coreference-org-determiners.jsonpackages/anonymize/src/data/defined-term-heads.jsonpackages/anonymize/src/data/deny-list-filters.jsonpackages/anonymize/src/data/false-positive-shapes.jsonpackages/anonymize/src/data/language-scopes.jsonpackages/anonymize/src/data/legal-form-rule-words.jsonpackages/anonymize/src/data/legal-role-heads.cs.jsonpackages/anonymize/src/data/name-corpus-cjk.jsonpackages/anonymize/src/data/name-corpus-particles.jsonpackages/anonymize/src/data/organization-indicators.jsonpackages/anonymize/src/data/organization-unit-heads.jsonpackages/anonymize/src/data/person-stopwords.jsonpackages/anonymize/src/data/signing-clauses.jsonpackages/anonymize/src/detectors/address-seeds.tspackages/anonymize/src/detectors/countries.tspackages/anonymize/src/detectors/deny-list.tspackages/anonymize/src/detectors/legal-forms.tspackages/anonymize/src/detectors/regex.tspackages/anonymize/src/detectors/triggers.tspackages/anonymize/src/filters/confidence-boost.tspackages/anonymize/src/filters/false-positives.tspackages/anonymize/src/filters/hotword-rules.tspackages/anonymize/src/index-shared.tspackages/anonymize/src/language-scope.tspackages/anonymize/src/native-default-config.tspackages/anonymize/src/native-node.tspackages/anonymize/src/native-pipeline.tspackages/anonymize/src/native.tspackages/anonymize/src/pipeline-cache-key.tspackages/anonymize/src/pipeline.tspackages/anonymize/src/types.tspackages/anonymize/tsdown.config.tspackages/anonymize/wasm/package.jsonpackages/cli/package.jsonpackages/cli/src/dictionary-scope.tspackages/corpus/package.jsonpackages/data/config/address-boundaries.jsonpackages/data/config/address-context.jsonpackages/data/config/address-jurisdiction-prefixes.jsonpackages/data/config/address-stop-keywords.jsonpackages/data/config/address-unit-abbreviations.jsonpackages/data/config/ambiguous-country-surfaces.jsonpackages/data/config/clause-noun-heads.jsonpackages/data/config/coreference-org-determiners.jsonpackages/data/config/defined-term-heads.jsonpackages/data/config/deny-list-filters.jsonpackages/data/config/false-positive-shapes.jsonpackages/data/config/language-scopes.jsonpackages/data/config/legal-form-rule-words.jsonpackages/data/config/legal-role-heads.cs.jsonpackages/data/config/name-corpus-cjk.jsonpackages/data/config/name-corpus-particles.jsonpackages/data/config/organization-indicators.jsonpackages/data/config/organization-unit-heads.jsonpackages/data/config/person-stopwords.jsonpackages/data/config/signing-clauses.jsonpackages/data/dictionaries/index.tspackages/data/package.jsonrustfmt.toml
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/anonymize-core/src/processors.rs`:
- Line 733: The deny-list gap slicing in processors.rs can fail when adjacent
entries in name_hits overlap, because the loop in the gap-building logic calls
offsets.slice(prev.end, next.start)? with prev.end > next.start. Update the
name-hit handling in this section to guard against overlapping spans before
slicing, either by skipping/merging overlapping hits or clamping the gap to a
valid span. Keep the fix localized to the gap construction around offsets.slice
and the name_hits iteration so the deny-list pass no longer aborts on
InvalidSpan.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 96ee159f-0112-40db-82cb-d980969dcf71
📒 Files selected for processing (21)
.github/workflows/ci.ymlcrates/anonymize-adapter-contract/examples/native_adapter_perf.rscrates/anonymize-adapter-contract/src/lib.rscrates/anonymize-core/src/address_context.rscrates/anonymize-core/src/address_seeds.rscrates/anonymize-core/src/byte_offsets.rscrates/anonymize-core/src/coreference.rscrates/anonymize-core/src/false_positives.rscrates/anonymize-core/src/normalize.rscrates/anonymize-core/src/prepared.rscrates/anonymize-core/src/processors.rscrates/anonymize-core/src/redact.rscrates/anonymize-core/src/resolution/boundary.rscrates/anonymize-core/src/resolution/sanitize.rscrates/anonymize-core/src/resolution/types.rscrates/anonymize-core/src/search.rscrates/anonymize-core/src/triggers.rscrates/anonymize-core/tests/false_positive_parity.rscrates/anonymize-core/tests/prepared.rscrates/anonymize-core/tests/redaction.rsrust-toolchain.toml
🚧 Files skipped from review as they are similar to previous changes (14)
- .github/workflows/ci.yml
- crates/anonymize-adapter-contract/examples/native_adapter_perf.rs
- crates/anonymize-core/src/resolution/types.rs
- crates/anonymize-core/src/redact.rs
- crates/anonymize-core/src/coreference.rs
- crates/anonymize-core/src/resolution/sanitize.rs
- crates/anonymize-core/src/resolution/boundary.rs
- crates/anonymize-core/src/normalize.rs
- crates/anonymize-core/src/address_context.rs
- crates/anonymize-core/src/false_positives.rs
- crates/anonymize-core/src/address_seeds.rs
- crates/anonymize-core/src/prepared.rs
- crates/anonymize-core/src/search.rs
- crates/anonymize-adapter-contract/src/lib.rs
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a06b5289e0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
436e78c to
0aca8a7
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 913d09e2c9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let limit = | ||
| advance_char_boundary(full_text, anchor, MAX_WITNESS_SCAN_BYTES); |
There was a problem hiding this comment.
Measure the witness scan window in text units
When an IN WITNESS WHEREOF preamble contains multibyte prose before the terminating ./blank line, this byte cap can stop the native scan before the sentence terminator even though it is still within the TypeScript detector's 600 UTF-16-code-unit slice. In that case the Rust signature detector never calls try_emit_forward_lines, so the signer immediately below the preamble is left unredacted in native while the TS path emits it; compute the 600-unit window in the same text offsets and then map back to a byte boundary.
Useful? React with 👍 / 👎.
| first.is_uppercase() | ||
| && chars.take(30).all(|ch| { | ||
| ch.is_alphabetic() | ||
| || matches!(ch, '\u{0300}'..='\u{036f}' | '.' | '\'' | '-' | '’') | ||
| }) |
There was a problem hiding this comment.
Reject overlong signature name tokens
For signature candidates with a single long capitalized token, chars.take(30).all(...) only validates the first 30 trailing characters and ignores any remaining characters. The TypeScript CAP_TOKEN caps each token at one uppercase character plus 30 following characters, so a value like /s/ Supercalifragilisticexpialidociousxxxx Smith can be accepted and redacted only by native; also check that no characters remain after the 30-character tail.
Useful? React with 👍 / 👎.
Summary
stella-anonymize-corecrateCC on behalf of @sok0
Summary by CodeRabbit