Skip to content

perf: skip unused prepared regex compiles#118

Merged
jan-kubica merged 5 commits into
mainfrom
codex/skip-fast-regex-individual-compile
Jun 29, 2026
Merged

perf: skip unused prepared regex compiles#118
jan-kubica merged 5 commits into
mainfrom
codex/skip-fast-regex-individual-compile

Conversation

@jan-kubica

@jan-kubica jan-kubica commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

CC on behalf of @sok0

Summary

  • skip individual regex compilation when a pattern can stay on the prepared fast path
  • preserve existing prepared artifact validation and matching behavior

Checks

  • cargo fmt --all -- --check
  • cargo clippy -p stella-regex-set-core --all-targets --all-features -- -D warnings
  • cargo test -p stella-regex-set-core
  • cargo clippy --workspace --all-targets --all-features -- -D warnings
  • cargo test --workspace

Summary by CodeRabbit

  • New Features

    • Faster pattern preparation now skips extra work when safe, improving build performance for supported cases.
  • Bug Fixes

    • Added broader validation and property-based checks to ensure prepared and unprepared pattern handling produce the same results.
    • Improved reliability of pattern parsing checks in the fast path.
  • Chores

    • Updated third-party notices and software bill of materials to reflect current dependencies and licenses.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@jan-kubica, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 36 minutes

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable usage-based reviews in Billing to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information, and refer to the rate limits docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 462f19b9-661f-402a-8ee2-3bd681890377

📥 Commits

Reviewing files that changed from the base of the PR and between d3011b5 and 41d56c9.

📒 Files selected for processing (1)
  • crates/core/src/lib.rs
📝 Walkthrough

Walkthrough

Adds can_skip_individual_fast_probe and related helpers (meta_regex_can_parse, PreparedMode::is_load) to allow RegexSet::build to skip per-pattern MetaRegex compilation when prepared mode is Load. Includes a new proptest property test and unit test. Provenance artifacts are regenerated to include proptest's transitive dependencies.

Changes

RegexSet fast-path optimization and tests

Layer / File(s) Summary
Fast-path skip helpers
crates/core/src/lib.rs
Adds meta_regex_can_parse, PreparedMode::is_load, and can_skip_individual_fast_probe to determine whether per-pattern MetaRegex compilation can be skipped for fast-path patterns in Load prepared mode.
RegexSet::build routing refactor
crates/core/src/lib.rs
Refactors per-pattern routing so can_skip_individual_fast_probe gates direct insertion into fast_info without calling MetaRegex::new; updates PatternInfo construction for both fast and slow branches.
proptest dependency and tests
crates/core/Cargo.toml, crates/core/src/lib.rs
Adds proptest = "1" as a dev-dependency; adds test_case_error helper, a property test comparing prepared vs. unprepared RegexSet results across multiple APIs, and a unit test for can_skip_individual_fast_probe outcomes.

Provenance artifacts update

Layer / File(s) Summary
Regenerated provenance
provenance/THIRD-PARTY-NOTICES.txt, provenance/report.json, provenance/sbom.cdx.json
Updates third-party notices, bumps dependency count from 46 to 75 in report.json, and regenerates the CycloneDX SBOM components and dependency graph to include proptest and its transitive crates.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: avoiding unnecessary prepared regex compilation for a performance improvement.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/skip-fast-regex-individual-compile

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a meta_regex_can_parse helper to optimize RegexSet compilation by bypassing the full MetaRegex::new check for fast-path patterns. However, the reviewer identified a critical correctness issue where syntax validation alone does not guarantee successful compilation (e.g., due to size limits), potentially causing fallback path regressions. The reviewer suggested restricting this optimization to when loading from a prepared artifact.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread crates/core/src/lib.rs Outdated
Comment on lines +1941 to +1952
if !needs_slow && meta_regex_can_parse(&core) {
fast_cores.push(dfa_core);
fast_info.push(PatternInfo {
original_index: usize_to_u32("Pattern index", i)?,
verifier,
boundaries: eb,
individual: None,
has_internal_b: false,
// Fast path patterns never query individual or fallback state.
fancy_fallback: None,
});
} else if let Ok(individual) = MetaRegex::new(&core) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Correctness Issue: Fallback Path Regression for Large/Complex Patterns

Using meta_regex_can_parse only verifies that the pattern is syntactically valid. It does not guarantee that MetaRegex::new (or MetaRegex::new_many) will successfully compile it, as compilation can still fail due to engine-specific constraints (such as the default 10MB size limit in regex-automata).

The Problem

Previously, if a pattern failed individual MetaRegex compilation (e.g., due to size limits), it would safely fall back to the fancy-regex path in the else block. With this change, such a pattern will now bypass the individual compilation check, get added to fast_cores, and subsequently cause the entire RegexSet compilation to fail during build_prepared_multi.

The Solution

We should only skip individual compilation when we are loading from a prepared artifact (PreparedMode::Load). In that mode, we already know the patterns successfully compiled when the artifact was originally captured, making it 100% safe to skip individual compilation. For unprepared compilation (PreparedMode::None or PreparedMode::Capture), we should still perform the individual compilation check to preserve the fallback behavior.

      let is_load = matches!(prepared, PreparedMode::Load { .. });
      let mut individual = None;
      let can_fast = !needs_slow && (is_load && meta_regex_can_parse(&core) || (!is_load && {
        if let Ok(ind) = MetaRegex::new(&core) {
          individual = Some(ind);
          true
        } else {
          false
        }
      }));

      if can_fast {
        fast_cores.push(dfa_core);
        fast_info.push(PatternInfo {
          original_index: usize_to_u32("Pattern index", i)?,
          verifier,
          boundaries: eb,
          individual: None,
          has_internal_b: false,
          // Fast path patterns never query individual or fallback state.
          fancy_fallback: None,
        });
      } else if let Ok(individual) = match individual {
        Some(ind) => Ok(ind),
        None => MetaRegex::new(&core),
      } {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC on behalf of @sok0

Agreed, implemented in 3572e34. Fresh and capture builds now still probe MetaRegex::new so unsupported patterns can fall back; only prepared loads use the parse-only fast shortcut. Added regression coverage for that mode split.

@jan-kubica jan-kubica marked this pull request as ready for review June 29, 2026 04:43

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9c37fe32d2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/core/src/lib.rs Outdated
Comment on lines +1941 to +1942
if !needs_slow && meta_regex_can_parse(&core) {
fast_cores.push(dfa_core);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve fallback for parseable build failures

When a fast-path pattern parses successfully but cannot be built by MetaRegex (for example, a very large alternation or other pattern that trips regex-automata build limits), this branch now pushes it into fast_cores and the later build_prepared_multi call returns an error. Before this change, the per-pattern MetaRegex::new(&core) failure sent that pattern through the fallback/chunking path below, so these parseable-but-too-large patterns could still be accepted; this change turns them into construction failures instead of falling back.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC on behalf of @sok0

Agreed, this is addressed in 3572e34. Fresh and capture builds still probe MetaRegex::new so parseable-but-too-large patterns keep the fallback path; only prepared loads use the parse-only shortcut. The branch also has property coverage comparing prepared and unprepared behavior for generated literal sets.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
provenance/sbom.cdx.json (1)

3157-3190: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Restore the wasip2 license block.

Line 3157 introduces pkg:cargo/wasip2@1.0.4+wasi-0.2.12 without a licenses entry, but Line 78 of provenance/THIRD-PARTY-NOTICES.txt and Line 14 of provenance/report.json both treat it as a licensed dependency. That leaves the machine-readable SBOM incomplete and makes the reported licensed-dependency count overstated. Regenerate this component with the same SPDX expression carried by the notice file.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@provenance/sbom.cdx.json` around lines 3157 - 3190, The SBOM entry for wasip2
is missing its licenses metadata, so the component in provenance/sbom.cdx.json
is incomplete. Update the wasip2 component block to restore the same SPDX
license expression already associated with this dependency in the notice/report
artifacts, and keep the existing identifiers like bom-ref, purl, and version
unchanged while adding the licenses entry back to the component definition.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/core/src/lib.rs`:
- Around line 118-119: The load-mode skip in meta_regex_can_parse is only doing
a syntax parse, which can disagree with MetaRegex::new(&core) and cause patterns
to be routed into fast_cores differently than prepare. Update
meta_regex_can_parse (and the Load-mode branch that uses it) to use the same
buildability check as MetaRegex construction, or otherwise persist and reuse the
original routing decision from the prepared artifact stream so both paths stay
aligned.

In `@provenance/sbom.cdx.json`:
- Around line 547-552: The SBOM is serializing combined licenses as a raw
license.id string instead of a valid SPDX expression, causing inconsistency with
the rest of the file. Update the SBOM generation path that emits these entries
so the combined-license cases use the same license.expression field as other
multi-license records, or split them into separate single-license entries. Make
sure the generator handles the affected combined-license values in
provenance/sbom.cdx.json consistently before regenerating the document.

---

Outside diff comments:
In `@provenance/sbom.cdx.json`:
- Around line 3157-3190: The SBOM entry for wasip2 is missing its licenses
metadata, so the component in provenance/sbom.cdx.json is incomplete. Update the
wasip2 component block to restore the same SPDX license expression already
associated with this dependency in the notice/report artifacts, and keep the
existing identifiers like bom-ref, purl, and version unchanged while adding the
licenses entry back to the component definition.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: cd68aba3-fbc7-4e0c-8a61-213b000c600a

📥 Commits

Reviewing files that changed from the base of the PR and between 75b6a7f and d3011b5.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • crates/core/Cargo.toml
  • crates/core/src/lib.rs
  • provenance/THIRD-PARTY-NOTICES.txt
  • provenance/report.json
  • provenance/sbom.cdx.json

Comment thread crates/core/src/lib.rs
Comment thread provenance/sbom.cdx.json
@jan-kubica jan-kubica merged commit ed49294 into main Jun 29, 2026
23 checks passed
@jan-kubica jan-kubica deleted the codex/skip-fast-regex-individual-compile branch June 29, 2026 05:31
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 29, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant