perf: skip unused prepared regex compiles by jan-kubica · Pull Request #118 · stella/regex-set

jan-kubica · 2026-06-29T04:40:37Z

CC on behalf of @sok0

Summary

skip individual regex compilation when a pattern can stay on the prepared fast path
preserve existing prepared artifact validation and matching behavior

Checks

cargo fmt --all -- --check
cargo clippy -p stella-regex-set-core --all-targets --all-features -- -D warnings
cargo test -p stella-regex-set-core
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo test --workspace

Summary by CodeRabbit

New Features
- Faster pattern preparation now skips extra work when safe, improving build performance for supported cases.
Bug Fixes
- Added broader validation and property-based checks to ensure prepared and unprepared pattern handling produce the same results.
- Improved reliability of pattern parsing checks in the fast path.
Chores
- Updated third-party notices and software bill of materials to reflect current dependencies and licenses.

coderabbitai · 2026-06-29T04:40:45Z

Warning

Review limit reached

@jan-kubica, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 36 minutes

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable usage-based reviews in Billing to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information, and refer to the rate limits docs for additional details.

Review details

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 462f19b9-661f-402a-8ee2-3bd681890377

📥 Commits

Reviewing files that changed from the base of the PR and between d3011b5 and 41d56c9.

📒 Files selected for processing (1)

crates/core/src/lib.rs

📝 Walkthrough

Walkthrough

Adds can_skip_individual_fast_probe and related helpers (meta_regex_can_parse, PreparedMode::is_load) to allow RegexSet::build to skip per-pattern MetaRegex compilation when prepared mode is Load. Includes a new proptest property test and unit test. Provenance artifacts are regenerated to include proptest's transitive dependencies.

Changes

RegexSet fast-path optimization and tests

Layer / File(s)	Summary
Fast-path skip helpers `crates/core/src/lib.rs`	Adds `meta_regex_can_parse`, `PreparedMode::is_load`, and `can_skip_individual_fast_probe` to determine whether per-pattern `MetaRegex` compilation can be skipped for fast-path patterns in `Load` prepared mode.
RegexSet::build routing refactor `crates/core/src/lib.rs`	Refactors per-pattern routing so `can_skip_individual_fast_probe` gates direct insertion into `fast_info` without calling `MetaRegex::new`; updates `PatternInfo` construction for both fast and slow branches.
proptest dependency and tests `crates/core/Cargo.toml`, `crates/core/src/lib.rs`	Adds `proptest = "1"` as a dev-dependency; adds `test_case_error` helper, a property test comparing prepared vs. unprepared `RegexSet` results across multiple APIs, and a unit test for `can_skip_individual_fast_probe` outcomes.

Provenance artifacts update

Layer / File(s)	Summary
Regenerated provenance `provenance/THIRD-PARTY-NOTICES.txt`, `provenance/report.json`, `provenance/sbom.cdx.json`	Updates third-party notices, bumps dependency count from 46 to 75 in report.json, and regenerates the CycloneDX SBOM components and dependency graph to include proptest and its transitive crates.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly matches the main change: avoiding unnecessary prepared regex compilation for a performance improvement.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/skip-fast-regex-individual-compile

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces a meta_regex_can_parse helper to optimize RegexSet compilation by bypassing the full MetaRegex::new check for fast-path patterns. However, the reviewer identified a critical correctness issue where syntax validation alone does not guarantee successful compilation (e.g., due to size limits), potentially causing fallback path regressions. The reviewer suggested restricting this optimization to when loading from a prepared artifact.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-29T04:41:56Z

+      if !needs_slow && meta_regex_can_parse(&core) {
+        fast_cores.push(dfa_core);
+        fast_info.push(PatternInfo {
+          original_index: usize_to_u32("Pattern index", i)?,
+          verifier,
+          boundaries: eb,
+          individual: None,
+          has_internal_b: false,
+          // Fast path patterns never query individual or fallback state.
+          fancy_fallback: None,
+        });
+      } else if let Ok(individual) = MetaRegex::new(&core) {


Correctness Issue: Fallback Path Regression for Large/Complex Patterns

Using meta_regex_can_parse only verifies that the pattern is syntactically valid. It does not guarantee that MetaRegex::new (or MetaRegex::new_many) will successfully compile it, as compilation can still fail due to engine-specific constraints (such as the default 10MB size limit in regex-automata).

The Problem

Previously, if a pattern failed individual MetaRegex compilation (e.g., due to size limits), it would safely fall back to the fancy-regex path in the else block. With this change, such a pattern will now bypass the individual compilation check, get added to fast_cores, and subsequently cause the entire RegexSet compilation to fail during build_prepared_multi.

The Solution

We should only skip individual compilation when we are loading from a prepared artifact (PreparedMode::Load). In that mode, we already know the patterns successfully compiled when the artifact was originally captured, making it 100% safe to skip individual compilation. For unprepared compilation (PreparedMode::None or PreparedMode::Capture), we should still perform the individual compilation check to preserve the fallback behavior.

let is_load = matches!(prepared, PreparedMode::Load { .. }); let mut individual = None; let can_fast = !needs_slow && (is_load && meta_regex_can_parse(&core) || (!is_load && { if let Ok(ind) = MetaRegex::new(&core) { individual = Some(ind); true } else { false } })); if can_fast { fast_cores.push(dfa_core); fast_info.push(PatternInfo { original_index: usize_to_u32("Pattern index", i)?, verifier, boundaries: eb, individual: None, has_internal_b: false, // Fast path patterns never query individual or fallback state. fancy_fallback: None, }); } else if let Ok(individual) = match individual { Some(ind) => Ok(ind), None => MetaRegex::new(&core), } {

CC on behalf of @sok0

Agreed, implemented in 3572e34. Fresh and capture builds now still probe MetaRegex::new so unsupported patterns can fall back; only prepared loads use the parse-only fast shortcut. Added regression coverage for that mode split.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9c37fe32d2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-29T04:48:39Z

+      if !needs_slow && meta_regex_can_parse(&core) {
+        fast_cores.push(dfa_core);


Preserve fallback for parseable build failures

When a fast-path pattern parses successfully but cannot be built by MetaRegex (for example, a very large alternation or other pattern that trips regex-automata build limits), this branch now pushes it into fast_cores and the later build_prepared_multi call returns an error. Before this change, the per-pattern MetaRegex::new(&core) failure sent that pattern through the fallback/chunking path below, so these parseable-but-too-large patterns could still be accepted; this change turns them into construction failures instead of falling back.

Useful? React with 👍 / 👎.

CC on behalf of @sok0

Agreed, this is addressed in 3572e34. Fresh and capture builds still probe MetaRegex::new so parseable-but-too-large patterns keep the fallback path; only prepared loads use the parse-only shortcut. The branch also has property coverage comparing prepared and unprepared behavior for generated literal sets.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

provenance/sbom.cdx.json (1)
3157-3190: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Restore the wasip2 license block.

Line 3157 introduces pkg:cargo/wasip2@1.0.4+wasi-0.2.12 without a licenses entry, but Line 78 of provenance/THIRD-PARTY-NOTICES.txt and Line 14 of provenance/report.json both treat it as a licensed dependency. That leaves the machine-readable SBOM incomplete and makes the reported licensed-dependency count overstated. Regenerate this component with the same SPDX expression carried by the notice file.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@provenance/sbom.cdx.json` around lines 3157 - 3190, The SBOM entry for wasip2
is missing its licenses metadata, so the component in provenance/sbom.cdx.json
is incomplete. Update the wasip2 component block to restore the same SPDX
license expression already associated with this dependency in the notice/report
artifacts, and keep the existing identifiers like bom-ref, purl, and version
unchanged while adding the licenses entry back to the component definition.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/core/src/lib.rs`:
- Around line 118-119: The load-mode skip in meta_regex_can_parse is only doing
a syntax parse, which can disagree with MetaRegex::new(&core) and cause patterns
to be routed into fast_cores differently than prepare. Update
meta_regex_can_parse (and the Load-mode branch that uses it) to use the same
buildability check as MetaRegex construction, or otherwise persist and reuse the
original routing decision from the prepared artifact stream so both paths stay
aligned.

In `@provenance/sbom.cdx.json`:
- Around line 547-552: The SBOM is serializing combined licenses as a raw
license.id string instead of a valid SPDX expression, causing inconsistency with
the rest of the file. Update the SBOM generation path that emits these entries
so the combined-license cases use the same license.expression field as other
multi-license records, or split them into separate single-license entries. Make
sure the generator handles the affected combined-license values in
provenance/sbom.cdx.json consistently before regenerating the document.

---

Outside diff comments:
In `@provenance/sbom.cdx.json`:
- Around line 3157-3190: The SBOM entry for wasip2 is missing its licenses
metadata, so the component in provenance/sbom.cdx.json is incomplete. Update the
wasip2 component block to restore the same SPDX license expression already
associated with this dependency in the notice/report artifacts, and keep the
existing identifiers like bom-ref, purl, and version unchanged while adding the
licenses entry back to the component definition.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: cd68aba3-fbc7-4e0c-8a61-213b000c600a

📥 Commits

Reviewing files that changed from the base of the PR and between 75b6a7f and d3011b5.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (5)

crates/core/Cargo.toml
crates/core/src/lib.rs
provenance/THIRD-PARTY-NOTICES.txt
provenance/report.json
provenance/sbom.cdx.json

perf: skip unused fast regex compile

9c37fe3

gemini-code-assist Bot reviewed Jun 29, 2026

View reviewed changes

jan-kubica marked this pull request as ready for review June 29, 2026 04:43

fix: preserve prepared regex fallback probe

3572e34

chatgpt-codex-connector Bot reviewed Jun 29, 2026

View reviewed changes

jan-kubica and others added 2 commits June 29, 2026 06:59

test: add prepared regex property coverage

0d4e29b

chore: refresh provenance artifacts [skip ci]

d3011b5

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread crates/core/src/lib.rs

Comment thread provenance/sbom.cdx.json

fix: align prepared regex load fallback

41d56c9

jan-kubica merged commit ed49294 into main Jun 29, 2026
23 checks passed

jan-kubica deleted the codex/skip-fast-regex-individual-compile branch June 29, 2026 05:31

github-actions Bot locked and limited conversation to collaborators Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: skip unused prepared regex compiles#118

perf: skip unused prepared regex compiles#118
jan-kubica merged 5 commits into
mainfrom
codex/skip-fast-regex-individual-compile

jan-kubica commented Jun 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

jan-kubica Jun 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 29, 2026

Uh oh!

jan-kubica Jun 29, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if !needs_slow && meta_regex_can_parse(&core) {
		fast_cores.push(dfa_core);

Uh oh!

Conversation

jan-kubica commented Jun 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checks

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Correctness Issue: Fallback Path Regression for Large/Complex Patterns

The Problem

The Solution

Uh oh!

jan-kubica Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

jan-kubica Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jan-kubica commented Jun 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading