Skip to content

feat: add anonymize core crate#217

Open
jan-kubica wants to merge 130 commits into
mainfrom
codex/anonymize-core-redaction
Open

feat: add anonymize core crate#217
jan-kubica wants to merge 130 commits into
mainfrom
codex/anonymize-core-redaction

Conversation

@jan-kubica

@jan-kubica jan-kubica commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add a strict Cargo workspace and internal stella-anonymize-core crate
  • model redaction-domain behavior in Rust modules for placeholders, normalization, UTF-16 spans, and result construction
  • add a typed Rust search index over Stella literal, regex, and fuzzy core crates while preserving the existing UTF-16 offset contract
  • wire Rust format, lint, and test checks into the root scripts and CI without changing published TypeScript package exports

CC on behalf of @sok0

Summary by CodeRabbit

  • New Features
    • Expanded anonymization capabilities for dates, money, names, addresses, legal forms, and coreference-style references.
    • Introduced a prepared static detection/redaction pipeline with richer diagnostics support.
    • Added a new contract-layer adapter crate and a native adapter performance benchmark example.
  • Bug Fixes
    • Improved entity span normalization, boundary consistency, merging/deduplication, placeholder mapping, and false-positive filtering.
    • Added stricter payload/package validation and artifact verification.
  • Chores
    • Enhanced CI with Rust toolchain checks and additional performance runs; updated Rust/Cargo configuration, license/package checks, and version sync behavior.

@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@jan-kubica jan-kubica force-pushed the codex/anonymize-core-redaction branch from 341eb7f to b916fa1 Compare June 24, 2026 07:24
@jan-kubica jan-kubica marked this pull request as ready for review June 24, 2026 07:25
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown

Dependency Review

The following issues were found:

  • ✅ 0 vulnerable package(s)
  • ✅ 0 package(s) with incompatible licenses
  • ✅ 0 package(s) with invalid SPDX license definitions
  • ⚠️ 32 package(s) with unknown licenses.
  • ⚠️ 4 packages with OpenSSF Scorecard issues.

View full job summary

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b916fa1630

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/placeholders.rs Outdated
Comment thread crates/anonymize-core/src/normalize.rs Outdated
Comment thread crates/anonymize-core/src/redact.rs Outdated
Comment thread crates/anonymize-core/src/search.rs Outdated
Comment thread crates/anonymize-core/src/redact.rs Outdated
Comment thread crates/anonymize-core/src/normalize.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c664ed09a6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/normalize.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fcbb328f84

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/prepared.rs Outdated
Comment thread crates/anonymize-core/src/resolution/boundary.rs Outdated
Comment thread crates/anonymize-core/src/redact.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd14f11c0c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/anonymize/scripts/migration-fixture-perf.mjs Outdated
Comment thread crates/anonymize-napi/src/lib.rs Outdated
Comment thread crates/anonymize-core/src/resolution/sanitize.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 163ee75fca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/diagnostics.rs Outdated
Comment thread crates/anonymize-napi/src/lib.rs
Comment thread packages/anonymize/src/build-unified-search.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4fc897746a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/anonymize/scripts/migration-fixture-perf.mjs Outdated
Comment thread crates/anonymize-napi/src/lib.rs Outdated
Comment thread crates/anonymize-core/src/false_positives.rs Outdated
Comment thread crates/anonymize-adapter-contract/src/lib.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 31dc9df699

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/search.rs
Comment thread packages/data/dictionaries/index.ts Outdated
Comment thread packages/data/dictionaries/index.ts
Comment thread crates/anonymize-core/src/prepared.rs Outdated
Comment thread crates/anonymize-core/src/triggers.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 32ef71ca81

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/triggers.rs Outdated
Comment thread packages/anonymize/src/build-unified-search.ts
Comment thread crates/anonymize-core/src/prepared.rs Outdated
Comment thread crates/anonymize-core/src/prepared.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4d6303598b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/address_seeds.rs Outdated
Comment thread crates/anonymize-core/src/triggers.rs Outdated
Comment thread crates/anonymize-py/src/lib.rs Outdated
Comment thread packages/anonymize/src/build-unified-search.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2fe0ef7984

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/false_positives.rs
Comment thread crates/anonymize-core/src/triggers.rs
Comment thread crates/anonymize-core/src/false_positives.rs
Comment thread crates/anonymize-core/src/triggers.rs Outdated
Comment thread crates/anonymize-core/src/false_positives.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3b43389136

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/anonymize/src/native-node.ts Outdated
Comment thread crates/anonymize-core/src/false_positives.rs
Comment thread packages/anonymize/package.json
Comment thread crates/anonymize-core/src/prepared.rs Outdated
Comment thread crates/anonymize-core/src/triggers.rs Outdated
Comment thread crates/anonymize-core/src/triggers.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f76f9cbd85

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-adapter-contract/src/lib.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eac07ba045

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/prepared.rs Outdated
Comment thread packages/anonymize/src/native-pipeline.ts
Comment thread packages/anonymize/src/build-unified-search.ts

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d494e92f2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/anonymize/src/build-unified-search.ts
Comment thread crates/anonymize-core/src/address_seeds.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5688ee13d1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/anonymize/package.json Outdated
Comment thread crates/anonymize-core/src/address_context.rs Outdated
Comment thread crates/anonymize-core/src/prepared.rs
Comment thread crates/anonymize-core/src/prepared.rs Outdated
Comment thread crates/anonymize-core/src/address_context.rs Outdated
Comment thread crates/anonymize-core/src/address_seeds.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6435474f06

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/hotwords.rs Outdated
Comment thread crates/anonymize-core/src/address_context.rs Outdated
Comment thread packages/anonymize/src/build-unified-search.ts Outdated
Comment thread crates/anonymize-core/src/dates.rs
Comment thread crates/anonymize-core/src/processors.rs Outdated
Comment thread crates/anonymize-core/src/address_seeds.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3641361a0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/anonymize/src/pipeline-cache-key.ts
Comment thread crates/anonymize-core/src/address_context.rs Outdated
Comment thread packages/anonymize/src/native-pipeline.ts Outdated
Comment thread crates/anonymize-core/src/address_seeds.rs Outdated
Comment thread crates/anonymize-core/src/signatures.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99460dbb97

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/anonymize/src/build-unified-search.ts Outdated
Comment thread packages/anonymize/src/build-unified-search.ts Outdated
Comment thread crates/anonymize-core/src/prepared.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d343b8c25d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/anonymize/src/native-pipeline.ts
Comment thread packages/anonymize/src/build-unified-search.ts Outdated
Comment thread packages/anonymize/src/detectors/regex.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7ec00f449

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/anonymize/package.json
Comment thread crates/anonymize-napi/src/lib.rs
Comment thread crates/anonymize-core/src/money.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c39163994a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/triggers.rs
Comment thread crates/anonymize-core/src/triggers.rs
Comment thread crates/anonymize-core/src/triggers.rs Outdated
Comment thread crates/anonymize-core/src/triggers.rs Outdated
Comment thread crates/anonymize-core/src/triggers.rs Outdated
Comment thread crates/anonymize-core/src/triggers.rs
Comment thread packages/anonymize/src/build-unified-search.ts Outdated
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Rust workspace and lint configuration, core search and resolution primitives, detection rules for dates, money, legal forms, addresses, names, coreference, and hotwords, plus prepared-search orchestration and adapter-contract serialization.

Changes

Rust anonymization pipeline

Layer / File(s) Summary
Workspace and CI
.cargo/config.toml, Cargo.toml, clippy.toml, .gitignore, crates/anonymize-core/Cargo.toml, crates/anonymize-adapter-contract/Cargo.toml, .github/tools/*, .github/workflows/*
Adds workspace metadata, lint settings, crate manifests, ignore patterns, packlist validation, runtime version syncing, and CI workflow updates.
Core search and entity primitives
crates/anonymize-core/src/artifact_bytes.rs, crates/anonymize-core/src/byte_offsets.rs, crates/anonymize-core/src/search.rs, crates/anonymize-core/src/normalize.rs, crates/anonymize-core/src/placeholders.rs, crates/anonymize-core/src/resolution/*.rs, crates/anonymize-core/src/lib.rs
Adds artifact framing, byte-offset helpers, search indexing, normalization, placeholder allocation, pipeline entity types, boundary enforcement, merge/sanitize logic, diagnostics, and redaction exports.
Processors and anchored rules
crates/anonymize-core/src/processors.rs, crates/anonymize-core/src/anchored.rs, crates/anonymize-core/src/dates.rs, crates/anonymize-core/src/money.rs, crates/anonymize-core/src/legal_forms.rs
Adds regex and deny-list processors, anchored extraction, date parsing, monetary extraction, and legal-form detection.
Address, name, coreference, and hotword rules
crates/anonymize-core/src/address_context.rs, crates/anonymize-core/src/address_seeds.rs, crates/anonymize-core/src/name_corpus.rs, crates/anonymize-core/src/coreference.rs, crates/anonymize-core/src/hotwords.rs
Adds address context and seed extraction, supplemental name detection, coreference propagation, and hotword adjustments.
False-positive filtering
crates/anonymize-core/src/false_positives.rs
Adds false-positive rejection and normalization heuristics for detected entities.
Prepared search and adapter contract
crates/anonymize-core/src/prepared.rs, crates/anonymize-adapter-contract/src/lib.rs, crates/anonymize-adapter-contract/examples/native_adapter_perf.rs
Adds prepared-search build, matching, detection, redaction, package serialization, diagnostics mapping, and the native adapter performance example.

Sequence Diagram(s)

sequenceDiagram
  participant PreparedSearch
  participant SearchIndex
  participant enforce_boundary_consistency
  participant sanitize_entities
  participant filter_entity_false_positives
  participant redact_text

  PreparedSearch->>SearchIndex: find_matches(full_text)
  SearchIndex-->>PreparedSearch: PreparedSearchMatches
  PreparedSearch->>enforce_boundary_consistency: normalize PipelineEntity spans
  PreparedSearch->>sanitize_entities: clean entity text
  PreparedSearch->>filter_entity_false_positives: drop rejected entities
  PreparedSearch->>redact_text: build redacted_text
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.30% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the primary change: introducing the new anonymize core crate.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/anonymize-core-redaction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🧹 Nitpick comments (4)
crates/anonymize-adapter-contract/examples/native_adapter_perf.rs (1)

39-49: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Move operator JSON parsing out of the timed run loop.

runMs currently includes repeated serde_json parsing and operator conversion for every iteration/case, so the metric is not isolated to PreparedSearch::redact_static_entities.

Proposed refactor
+  let run_cases = payload
+    .cases
+    .iter()
+    .map(|item| -> Result<_, Box<dyn std::error::Error>> {
+      let operators = item
+        .operators_json
+        .as_deref()
+        .map(serde_json::from_str::<BindingOperatorConfig>)
+        .transpose()?;
+      let operators = operator_config_from_binding(operators)?;
+      Ok((&item.text, operators))
+    })
+    .collect::<Result<Vec<_>, _>>()?;
+
   let run_start = Instant::now();
   let mut entity_count = 0_usize;
   for _ in 0..payload.iterations {
-    for item in &payload.cases {
-      let operators = item
-        .operators_json
-        .as_deref()
-        .map(serde_json::from_str::<BindingOperatorConfig>)
-        .transpose()?;
-      let operators = operator_config_from_binding(operators)?;
-      let result = prepared.redact_static_entities(&item.text, &operators)?;
+    for (text, operators) in &run_cases {
+      let result = prepared.redact_static_entities(text, operators)?;
       entity_count = entity_count.saturating_add(result.redaction.entity_count);
     }
   }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/anonymize-adapter-contract/examples/native_adapter_perf.rs` around
lines 39 - 49, Move the operator JSON parsing and conversion out of the timed
section in native_adapter_perf.rs so runMs only measures
PreparedSearch::redact_static_entities. Precompute each case’s operators (the
serde_json::from_str::<BindingOperatorConfig> and operator_config_from_binding
work) before run_start is recorded, then reuse the parsed result inside the
payload.iterations loop. Keep the timing around the redact call only, and use
the existing payload.cases, prepared.redact_static_entities, and run_start flow
to locate the refactor.
crates/anonymize-core/src/prepared.rs (1)

1923-1956: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Dead else branch in validate_hotword_config.

Line 1928 returns Ok(()) when hotword_data.is_none(), so the let Some(data) = &config.hotword_data else { … } at lines 1932‑1936 can never hit its else arm. Collapse to a direct bind.

♻️ Simplify
-  if config.hotword_data.is_none() {
-    return Ok(());
-  }
-
-  let Some(data) = &config.hotword_data else {
-    return Err(Error::MissingStaticData {
-      field: "hotword_data",
-    });
-  };
+  let Some(data) = &config.hotword_data else {
+    return Ok(());
+  };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/anonymize-core/src/prepared.rs` around lines 1923 - 1956, The
`validate_hotword_config` function contains a dead `else` branch because it
already returns `Ok(())` when `config.hotword_data` is `None`, so the `let
Some(data) = &config.hotword_data else { ... }` fallback can never run. Simplify
the control flow by removing the unreachable `else` arm and binding `data`
directly from `config.hotword_data` in `validate_hotword_config`, keeping the
existing rule and hotword validation logic unchanged.
crates/anonymize-core/src/byte_offsets.rs (1)

42-47: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Tie slice to the text owned by ByteOffsets.

start/end are validated against self.text, but the range is applied to a separate full_text. Remove the extra parameter and slice self.text so callers cannot accidentally validate one string and read another.

Proposed refactor
   pub(crate) fn slice(
     &self,
-    full_text: &str,
     start: u32,
     end: u32,
   ) -> Result<String> {
@@
     Ok(
-      full_text
+      self.text
         .get(start_byte..end_byte)
         .ok_or(Error::InvalidSpan { start, end })?
         .to_owned(),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/anonymize-core/src/byte_offsets.rs` around lines 42 - 47, The
ByteOffsets::slice method currently validates offsets against self.text but
slices a separate full_text argument, which can desync validation from the
actual data being read. Remove the extra parameter from slice, update the
implementation to use self.text for the substring extraction, and adjust any
callers so they only pass start/end and cannot mix different source strings.
crates/anonymize-core/src/resolution/boundary.rs (1)

361-397: 🚀 Performance & Scalability | 🔵 Trivial

Avoid rescanning spans in word_start_at/word_end_at.
Both helpers linearly search the whole spans slice on every step of a per-character loop, which makes each call quadratic on long inputs. Since spans is sorted, partition_point or a tracked index would keep the scan linear, and fix_partial_words calls both helpers for every entity.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/anonymize-core/src/resolution/boundary.rs` around lines 361 - 397, The
helper functions word_start_at and word_end_at are rescanning the entire sorted
spans slice on every loop iteration, making fix_partial_words much slower on
long inputs. Update these helpers to avoid repeated linear searches by using the
ordering of spans with partition_point or by carrying a moving index as cursor
advances, and keep the logic in boundary.rs centered around the existing
word_start_at/word_end_at and fix_partial_words flow.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/ci.yml:
- Around line 46-49: The Rust setup in the CI workflow is using the moving
stable channel, which can change formatting or lint behavior unexpectedly.
Update the Setup Rust step in the workflow or add a repo-level
rust-toolchain.toml so the toolchain version is explicitly pinned, and make sure
the rustup install/default commands use that fixed version consistently.

In `@crates/anonymize-adapter-contract/src/lib.rs`:
- Line 1816: The diagnostic offset conversion in convert_diagnostic_offsets is
leaving raw byte offsets in UTF-16 binding results when conversion fails, which
breaks the adapter contract. Update the UTF-16 paths that call
convert_diagnostic_offsets in the diagnostic binding flow so offsets are never
mixed, and make the conversion fail explicitly or otherwise ensure every
diagnostic offset is converted before returning result.diagnostics.events. Use
the existing symbols convert_diagnostic_offsets, *_to_utf16_binding, and
result.diagnostics.events to locate and fix all affected call sites.

In `@crates/anonymize-core/src/address_context.rs`:
- Line 51: The bare-house stopword matching in AddressContext is inconsistent
because `bare_house_stopwords` is stored without normalization while the regex
can capture capitalized words. Update the `AddressContext` setup and the
lookup/comparison path so both the configured stopwords and incoming values are
lowercased before being collected and checked, ensuring entries like `may` match
`May 1` consistently.

In `@crates/anonymize-core/src/address_seeds.rs`:
- Around line 21-25: The AddressSeedData deserialization currently fails when
any of its list fields are omitted, because the vectors are required instead of
defaulting to empty. Update AddressSeedData so boundary_words, br_cep_cue_words,
and unit_abbreviations deserialize with defaults, matching the optional
static-data behavior used elsewhere in the crate; keep the fix focused on the
AddressSeedData struct and its serde annotations.

In `@crates/anonymize-core/src/anchored.rs`:
- Around line 93-107: The anchored extraction flow is using raw SearchMatch
offsets as if they were byte indices, which causes incorrect slicing in anchored
rules after non-ASCII text. Update AnchorSpan and the anchored path in
anchored.rs so anchor_span converts or carries validated byte indices, and
ensure extract plus the rule extraction path use ByteOffsets or equivalent
byte-safe slicing before any str::get calls. Keep the fix localized around
anchor_span, AnchorSpan, and pub(crate) fn extract so all anchored rule spans
remain aligned with the original text.

In `@crates/anonymize-core/src/dates.rs`:
- Around line 170-185: The date_entity helper is passing byte-based start/end
values into PipelineEntity::detected, which breaks the UTF-16 offset contract.
Update date_entity in dates.rs to convert the local &str byte positions back
into UTF-16 offsets before building the PipelineEntity, using the existing
str_slice/full_text context so detected date spans stay aligned after non-ASCII
text.

In `@crates/anonymize-core/src/false_positives.rs`:
- Around line 619-633: The ambiguous address-component check in
is_only_ambiguous_component is still using the original text, so capitalized
terms like Street can slip through when filters are lowercase. Update the
matching path around find_ambiguous_component_occurrence and the stripped-text
check to compare case-insensitively, consistent with has_address_component’s
lowercasing behavior, and apply the same fix to the other affected branch in the
same function.

In `@crates/anonymize-core/src/legal_forms.rs`:
- Around line 109-145: The legal-form span collection is currently carrying byte
offsets from the candidate-building logic into the published `PipelineEntity`
spans, which violates the UTF-16 redaction contract. Update the span creation
flow in the legal-forms path so `Candidate` offsets are converted to UTF-16
before being emitted, and make sure the final `PipelineEntity` construction uses
those UTF-16 positions rather than raw slice indices. Keep the conversion
centralized near the code that builds and publishes candidates so the byte-based
Rust slicing remains internal while all external offsets stay UTF-16.

In `@crates/anonymize-core/src/money.rs`:
- Around line 553-567: The money_entity helper is returning byte-based start/end
positions instead of UTF-16 offsets, which can misalign redaction spans for
non-ASCII text. Update money_entity to convert the detected byte span into
UTF-16 offsets before calling PipelineEntity::detected, using the existing
full_text/start/end context and keeping the str_slice-derived detected value
unchanged.

In `@crates/anonymize-core/src/normalize.rs`:
- Around line 396-409: The generic identifier normalization in normalize.rs is
letting trailing prose be merged into the key when whitespace is allowed, so
update the token scanning logic around is_generic_identifier and the final
return from last_valid/compact to stop at the last valid token boundary instead
of accepting the fully expanded string. Adjust the loop that uses
is_identifier_separator and the final predicate(&compact) check so
label-specific stopping rules or a boundary check prevent cases like a valid
identifier followed by words from normalizing into one concatenated key.

In `@crates/anonymize-core/src/processors.rs`:
- Around line 1367-1372: The entity span extension logic in ExtendedName/related
offset handling is mixing byte lengths with PipelineEntity offsets, which causes
drift for non-ASCII text. Update the arithmetic in the affected offset-extension
paths to use UTF-16 code-unit counts for offset deltas, and keep byte indices
only for local string slicing via ByteOffsets. Review the ExtendedName
construction and the other offset-adjustment blocks referenced in the comment so
all added/subtracted suffix or district lengths are derived consistently from
UTF-16 length calculations.

In `@crates/anonymize-core/src/resolution/sanitize.rs`:
- Around line 108-117: The span adjustment in sanitize.rs is using UTF-8 byte
lengths for PipelineEntity offsets, which breaks the UTF-16 offset contract.
Update the logic around the display_text, start, and end calculations to measure
the trimmed prefix and cleaned text in UTF-16 units instead of byte_len, while
keeping the existing sanitization flow in sanitize() intact. Make sure the
cloned entity’s start/end fields are derived from UTF-16 code unit counts so
redaction stays aligned for non-ASCII text like José.

In `@crates/anonymize-core/src/search.rs`:
- Around line 128-130: `read_slots` in `search.rs` preallocates `slots` directly
from the serialized `count`, which can be attacker-controlled and cause an
oversized allocation. Update the deserialization flow around
`reader.read_usize()` and the `for _ in 0..count` loop to either validate
`count` against the remaining input before allocating or build `slots`
incrementally without `Vec::with_capacity(count)`.

---

Nitpick comments:
In `@crates/anonymize-adapter-contract/examples/native_adapter_perf.rs`:
- Around line 39-49: Move the operator JSON parsing and conversion out of the
timed section in native_adapter_perf.rs so runMs only measures
PreparedSearch::redact_static_entities. Precompute each case’s operators (the
serde_json::from_str::<BindingOperatorConfig> and operator_config_from_binding
work) before run_start is recorded, then reuse the parsed result inside the
payload.iterations loop. Keep the timing around the redact call only, and use
the existing payload.cases, prepared.redact_static_entities, and run_start flow
to locate the refactor.

In `@crates/anonymize-core/src/byte_offsets.rs`:
- Around line 42-47: The ByteOffsets::slice method currently validates offsets
against self.text but slices a separate full_text argument, which can desync
validation from the actual data being read. Remove the extra parameter from
slice, update the implementation to use self.text for the substring extraction,
and adjust any callers so they only pass start/end and cannot mix different
source strings.

In `@crates/anonymize-core/src/prepared.rs`:
- Around line 1923-1956: The `validate_hotword_config` function contains a dead
`else` branch because it already returns `Ok(())` when `config.hotword_data` is
`None`, so the `let Some(data) = &config.hotword_data else { ... }` fallback can
never run. Simplify the control flow by removing the unreachable `else` arm and
binding `data` directly from `config.hotword_data` in `validate_hotword_config`,
keeping the existing rule and hotword validation logic unchanged.

In `@crates/anonymize-core/src/resolution/boundary.rs`:
- Around line 361-397: The helper functions word_start_at and word_end_at are
rescanning the entire sorted spans slice on every loop iteration, making
fix_partial_words much slower on long inputs. Update these helpers to avoid
repeated linear searches by using the ordering of spans with partition_point or
by carrying a moving index as cursor advances, and keep the logic in boundary.rs
centered around the existing word_start_at/word_end_at and fix_partial_words
flow.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 7048aa2c-0513-4cbb-ba50-7a27dbf8ddc6

📥 Commits

Reviewing files that changed from the base of the PR and between 4924c74 and 8b8e92e.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • bun.lock is excluded by !**/*.lock
📒 Files selected for processing (146)
  • .cargo/config.toml
  • .github/tools/check-packlist.mjs
  • .github/tools/sync-runtime-version.mjs
  • .github/workflows/ci.yml
  • .github/workflows/dependency-review.yml
  • .gitignore
  • Cargo.toml
  • clippy.toml
  • crates/anonymize-adapter-contract/Cargo.toml
  • crates/anonymize-adapter-contract/examples/native_adapter_perf.rs
  • crates/anonymize-adapter-contract/src/lib.rs
  • crates/anonymize-core/Cargo.toml
  • crates/anonymize-core/data/address-final-abbrevs.txt
  • crates/anonymize-core/data/identifier-cues.txt
  • crates/anonymize-core/data/legal-period-suffixes.txt
  • crates/anonymize-core/src/address_context.rs
  • crates/anonymize-core/src/address_seeds.rs
  • crates/anonymize-core/src/anchored.rs
  • crates/anonymize-core/src/artifact_bytes.rs
  • crates/anonymize-core/src/byte_offsets.rs
  • crates/anonymize-core/src/coreference.rs
  • crates/anonymize-core/src/dates.rs
  • crates/anonymize-core/src/diagnostics.rs
  • crates/anonymize-core/src/false_positives.rs
  • crates/anonymize-core/src/hotwords.rs
  • crates/anonymize-core/src/legal_forms.rs
  • crates/anonymize-core/src/lib.rs
  • crates/anonymize-core/src/money.rs
  • crates/anonymize-core/src/name_corpus.rs
  • crates/anonymize-core/src/normalize.rs
  • crates/anonymize-core/src/placeholders.rs
  • crates/anonymize-core/src/prepared.rs
  • crates/anonymize-core/src/processors.rs
  • crates/anonymize-core/src/redact.rs
  • crates/anonymize-core/src/resolution/boundary.rs
  • crates/anonymize-core/src/resolution/common.rs
  • crates/anonymize-core/src/resolution/merge.rs
  • crates/anonymize-core/src/resolution/mod.rs
  • crates/anonymize-core/src/resolution/sanitize.rs
  • crates/anonymize-core/src/resolution/types.rs
  • crates/anonymize-core/src/search.rs
  • crates/anonymize-core/src/signatures.rs
  • crates/anonymize-core/src/triggers.rs
  • crates/anonymize-core/src/types.rs
  • crates/anonymize-core/src/validators.rs
  • crates/anonymize-core/src/zones.rs
  • crates/anonymize-core/tests/address_seed_parity.rs
  • crates/anonymize-core/tests/false_positive_parity.rs
  • crates/anonymize-core/tests/normalize.rs
  • crates/anonymize-core/tests/prepared.rs
  • crates/anonymize-core/tests/processors.rs
  • crates/anonymize-core/tests/redaction.rs
  • crates/anonymize-core/tests/resolution.rs
  • crates/anonymize-core/tests/search.rs
  • crates/anonymize-core/tests/trigger_parity.rs
  • crates/anonymize-napi/Cargo.toml
  • crates/anonymize-napi/build.rs
  • crates/anonymize-napi/src/lib.rs
  • crates/anonymize-py/Cargo.toml
  • crates/anonymize-py/build.rs
  • crates/anonymize-py/pyproject.toml
  • crates/anonymize-py/src/lib.rs
  • package.json
  • packages/anonymize/.gitignore
  • packages/anonymize/README.md
  • packages/anonymize/index.cjs
  • packages/anonymize/package.json
  • packages/anonymize/scripts/build-native-node.mjs
  • packages/anonymize/scripts/build-native-pipeline-package.mjs
  • packages/anonymize/scripts/dist-smoke.mjs
  • packages/anonymize/scripts/migration-fixture-perf.mjs
  • packages/anonymize/scripts/native-adapter-perf.mjs
  • packages/anonymize/src/__test__/countries.test.ts
  • packages/anonymize/src/__test__/dictionary-bundle.test.ts
  • packages/anonymize/src/__test__/load-dictionaries.ts
  • packages/anonymize/src/__test__/native-adapter-parity.test.ts
  • packages/anonymize/src/__test__/native-node.test.ts
  • packages/anonymize/src/__test__/pipeline-config.test.ts
  • packages/anonymize/src/build-unified-search.ts
  • packages/anonymize/src/context.ts
  • packages/anonymize/src/data/address-boundaries.json
  • packages/anonymize/src/data/address-context.json
  • packages/anonymize/src/data/address-jurisdiction-prefixes.json
  • packages/anonymize/src/data/address-stop-keywords.json
  • packages/anonymize/src/data/address-unit-abbreviations.json
  • packages/anonymize/src/data/ambiguous-country-surfaces.json
  • packages/anonymize/src/data/clause-noun-heads.json
  • packages/anonymize/src/data/coreference-org-determiners.json
  • packages/anonymize/src/data/defined-term-heads.json
  • packages/anonymize/src/data/deny-list-filters.json
  • packages/anonymize/src/data/false-positive-shapes.json
  • packages/anonymize/src/data/language-scopes.json
  • packages/anonymize/src/data/legal-form-rule-words.json
  • packages/anonymize/src/data/legal-role-heads.cs.json
  • packages/anonymize/src/data/name-corpus-cjk.json
  • packages/anonymize/src/data/name-corpus-particles.json
  • packages/anonymize/src/data/organization-indicators.json
  • packages/anonymize/src/data/organization-unit-heads.json
  • packages/anonymize/src/data/person-stopwords.json
  • packages/anonymize/src/data/signing-clauses.json
  • packages/anonymize/src/detectors/address-seeds.ts
  • packages/anonymize/src/detectors/countries.ts
  • packages/anonymize/src/detectors/deny-list.ts
  • packages/anonymize/src/detectors/legal-forms.ts
  • packages/anonymize/src/detectors/regex.ts
  • packages/anonymize/src/detectors/triggers.ts
  • packages/anonymize/src/filters/confidence-boost.ts
  • packages/anonymize/src/filters/false-positives.ts
  • packages/anonymize/src/filters/hotword-rules.ts
  • packages/anonymize/src/index-shared.ts
  • packages/anonymize/src/language-scope.ts
  • packages/anonymize/src/native-default-config.ts
  • packages/anonymize/src/native-node.ts
  • packages/anonymize/src/native-pipeline.ts
  • packages/anonymize/src/native.ts
  • packages/anonymize/src/pipeline-cache-key.ts
  • packages/anonymize/src/pipeline.ts
  • packages/anonymize/src/types.ts
  • packages/anonymize/tsdown.config.ts
  • packages/anonymize/wasm/package.json
  • packages/cli/package.json
  • packages/cli/src/dictionary-scope.ts
  • packages/corpus/package.json
  • packages/data/config/address-boundaries.json
  • packages/data/config/address-context.json
  • packages/data/config/address-jurisdiction-prefixes.json
  • packages/data/config/address-stop-keywords.json
  • packages/data/config/address-unit-abbreviations.json
  • packages/data/config/ambiguous-country-surfaces.json
  • packages/data/config/clause-noun-heads.json
  • packages/data/config/coreference-org-determiners.json
  • packages/data/config/defined-term-heads.json
  • packages/data/config/deny-list-filters.json
  • packages/data/config/false-positive-shapes.json
  • packages/data/config/language-scopes.json
  • packages/data/config/legal-form-rule-words.json
  • packages/data/config/legal-role-heads.cs.json
  • packages/data/config/name-corpus-cjk.json
  • packages/data/config/name-corpus-particles.json
  • packages/data/config/organization-indicators.json
  • packages/data/config/organization-unit-heads.json
  • packages/data/config/person-stopwords.json
  • packages/data/config/signing-clauses.json
  • packages/data/dictionaries/index.ts
  • packages/data/package.json
  • rustfmt.toml

Comment thread .github/workflows/ci.yml Outdated
Comment thread crates/anonymize-adapter-contract/src/lib.rs Outdated
Comment thread crates/anonymize-core/src/address_context.rs Outdated
Comment thread crates/anonymize-core/src/address_seeds.rs
Comment thread crates/anonymize-core/src/anchored.rs
Comment thread crates/anonymize-core/src/money.rs
Comment thread crates/anonymize-core/src/normalize.rs
Comment thread crates/anonymize-core/src/processors.rs Outdated
Comment thread crates/anonymize-core/src/resolution/sanitize.rs
Comment thread crates/anonymize-core/src/search.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/anonymize-core/src/processors.rs`:
- Line 733: The deny-list gap slicing in processors.rs can fail when adjacent
entries in name_hits overlap, because the loop in the gap-building logic calls
offsets.slice(prev.end, next.start)? with prev.end > next.start. Update the
name-hit handling in this section to guard against overlapping spans before
slicing, either by skipping/merging overlapping hits or clamping the gap to a
valid span. Keep the fix localized to the gap construction around offsets.slice
and the name_hits iteration so the deny-list pass no longer aborts on
InvalidSpan.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 96ee159f-0112-40db-82cb-d980969dcf71

📥 Commits

Reviewing files that changed from the base of the PR and between b05a87a and 343a82f.

📒 Files selected for processing (21)
  • .github/workflows/ci.yml
  • crates/anonymize-adapter-contract/examples/native_adapter_perf.rs
  • crates/anonymize-adapter-contract/src/lib.rs
  • crates/anonymize-core/src/address_context.rs
  • crates/anonymize-core/src/address_seeds.rs
  • crates/anonymize-core/src/byte_offsets.rs
  • crates/anonymize-core/src/coreference.rs
  • crates/anonymize-core/src/false_positives.rs
  • crates/anonymize-core/src/normalize.rs
  • crates/anonymize-core/src/prepared.rs
  • crates/anonymize-core/src/processors.rs
  • crates/anonymize-core/src/redact.rs
  • crates/anonymize-core/src/resolution/boundary.rs
  • crates/anonymize-core/src/resolution/sanitize.rs
  • crates/anonymize-core/src/resolution/types.rs
  • crates/anonymize-core/src/search.rs
  • crates/anonymize-core/src/triggers.rs
  • crates/anonymize-core/tests/false_positive_parity.rs
  • crates/anonymize-core/tests/prepared.rs
  • crates/anonymize-core/tests/redaction.rs
  • rust-toolchain.toml
🚧 Files skipped from review as they are similar to previous changes (14)
  • .github/workflows/ci.yml
  • crates/anonymize-adapter-contract/examples/native_adapter_perf.rs
  • crates/anonymize-core/src/resolution/types.rs
  • crates/anonymize-core/src/redact.rs
  • crates/anonymize-core/src/coreference.rs
  • crates/anonymize-core/src/resolution/sanitize.rs
  • crates/anonymize-core/src/resolution/boundary.rs
  • crates/anonymize-core/src/normalize.rs
  • crates/anonymize-core/src/address_context.rs
  • crates/anonymize-core/src/false_positives.rs
  • crates/anonymize-core/src/address_seeds.rs
  • crates/anonymize-core/src/prepared.rs
  • crates/anonymize-core/src/search.rs
  • crates/anonymize-adapter-contract/src/lib.rs

Comment thread crates/anonymize-core/src/processors.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a06b5289e0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/anonymize-core/src/coreference.rs Outdated
Comment thread packages/anonymize/src/build-unified-search.ts
Comment thread packages/anonymize/src/native-pipeline.ts
Comment thread crates/anonymize-core/src/prepared.rs
Comment thread crates/anonymize-core/src/normalize.rs Outdated
@jan-kubica jan-kubica force-pushed the codex/anonymize-core-redaction branch from 436e78c to 0aca8a7 Compare June 27, 2026 15:46

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 913d09e2c9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +203 to +204
let limit =
advance_char_boundary(full_text, anchor, MAX_WITNESS_SCAN_BYTES);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Measure the witness scan window in text units

When an IN WITNESS WHEREOF preamble contains multibyte prose before the terminating ./blank line, this byte cap can stop the native scan before the sentence terminator even though it is still within the TypeScript detector's 600 UTF-16-code-unit slice. In that case the Rust signature detector never calls try_emit_forward_lines, so the signer immediately below the preamble is left unredacted in native while the TS path emits it; compute the 600-unit window in the same text offsets and then map back to a byte boundary.

Useful? React with 👍 / 👎.

Comment on lines +332 to +336
first.is_uppercase()
&& chars.take(30).all(|ch| {
ch.is_alphabetic()
|| matches!(ch, '\u{0300}'..='\u{036f}' | '.' | '\'' | '-' | '’')
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject overlong signature name tokens

For signature candidates with a single long capitalized token, chars.take(30).all(...) only validates the first 30 trailing characters and ignores any remaining characters. The TypeScript CAP_TOKEN caps each token at one uppercase character plus 30 following characters, so a value like /s/ Supercalifragilisticexpialidociousxxxx Smith can be accepted and redacted only by native; also check that no characters remain after the 30-character tail.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant