perf: improve prepared search diagnostics and warming by jan-kubica · Pull Request #75 · stella/text-search

jan-kubica · 2026-06-29T04:40:36Z

CC on behalf of @sok0

Summary

add build/find diagnostics for prepared text-search slots
warm lazy regex slots in parallel and keep small literal prefilters inline
parallelize long multi-engine searches and widen split literal chunks

Checks

cargo fmt --all -- --check
cargo clippy -p stella-text-search-core --all-targets --all-features -- -D warnings
cargo test -p stella-text-search-core
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo test --workspace

coderabbitai · 2026-06-29T04:40:45Z

Warning

Review limit reached

@jan-kubica, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 23 minutes

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable usage-based reviews in Billing to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information, and refer to the rate limits docs for additional details.

Review details

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a666826c-09dc-4bcb-bac5-e52b996d93aa

📥 Commits

Reviewing files that changed from the base of the PR and between 0e44094 and c9ea954.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (3)

crates/core/Cargo.toml
crates/core/src/lib.rs
crates/core/tests/text_search.rs

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/parallel-lazy-regex-warm

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces several enhancements to the text search engine, including parallel find capabilities, parallel lazy regex warming, inline literal prefilters, and detailed build and find statistics. It also adds a byte-windowed prefilter search to restrict regex matching to a radius around literal hits. The review feedback highlights three key areas for improvement: resolving a load imbalance in parallel warming by chunking only uninitialized lazy engines, eliminating a performance bottleneck in Unicode case-insensitive search by avoiding full-haystack allocations, and optimizing inline prefilter matching to use ASCII case-insensitive search whenever the needle is ASCII.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-29T04:43:06Z

  pub fn warm_lazy_regex(&self) -> Result<()> {
-    for engine in &self.engines {
-      warm_engine_lazy_regex(engine)?;
-    }
-    Ok(())
+    let lazy_count = self
+      .engines
+      .iter()
+      .filter(|engine| engine_has_uninitialized_lazy_regex(engine))
+      .count();
+    if lazy_count < PARALLEL_LAZY_REGEX_WARM_MIN_SLOTS {
+      return warm_engines_lazy_regex(&self.engines);
+    }
+
+    let workers = std::thread::available_parallelism()
+      .map_or(1, usize::from)
+      .min(lazy_count);
+    if workers <= 1 {
+      return warm_engines_lazy_regex(&self.engines);
+    }
+
+    let chunk_size = self.engines.len().div_ceil(workers);
+    std::thread::scope(|scope| {
+      let mut handles = Vec::with_capacity(workers);
+      for chunk in self.engines.chunks(chunk_size) {
+        handles.push(scope.spawn(move || warm_engines_lazy_regex(chunk)));
+      }
+      for handle in handles {
+        handle.join().map_err(|_| {
+          Error::BuildRegex(String::from("Lazy regex warm-up panicked"))
+        })??;
+      }
+      Ok(())
+    })
  }


Load Imbalance in Parallel Warming

Currently, chunk_size is calculated based on the total number of engines (self.engines.len()), but the loop only warms up uninitialized lazy regexes. If the uninitialized lazy regexes are clustered together (e.g., at the beginning of the engines list), one thread will end up doing all the warming sequentially while other threads do nothing.

To ensure optimal load balancing, we should chunk only the collected lazy_engines that actually need warming. Additionally, we can avoid iterating over all engines again in the sequential fallback paths by directly warming the collected lazy_engines.

pub fn warm_lazy_regex(&self) -> Result<()> { let lazy_engines: Vec<&EngineSlot> = self .engines .iter() .filter(|engine| engine_has_uninitialized_lazy_regex(engine)) .collect(); let lazy_count = lazy_engines.len(); if lazy_count < PARALLEL_LAZY_REGEX_WARM_MIN_SLOTS { for engine in lazy_engines { warm_engine_lazy_regex(engine)?; } return Ok(()); } let workers = std::thread::available_parallelism() .map_or(1, usize::from) .min(lazy_count); if workers <= 1 { for engine in lazy_engines { warm_engine_lazy_regex(engine)?; } return Ok(()); } let chunk_size = lazy_count.div_ceil(workers); std::thread::scope(|scope| { let mut handles = Vec::with_capacity(workers); for chunk in lazy_engines.chunks(chunk_size) { handles.push(scope.spawn(move || { for engine in chunk { warm_engine_lazy_regex(engine)?; } Ok::<(), Error>(()) })); } for handle in handles { handle.join().map_err(|_| { Error::BuildRegex(String::from("Lazy regex warm-up panicked")) })??; } Ok(()) }) }

CC on behalf of @sok0

Agreed, implemented in 26def7f. warm_lazy_regex now collects the uninitialized lazy regex slots first, warms only those slots, and chunks that smaller set for better load balancing.

gemini-code-assist · 2026-06-29T04:43:06Z

+fn contains_unicode_case_insensitive(haystack: &str, needle: &str) -> bool {
+  let needle_lower = needle.to_lowercase();
+  let haystack_lower = haystack.to_lowercase();
+  if haystack_lower.contains(&needle_lower) {
+    return true;
+  }
+
+  let needle_upper = needle.to_uppercase();
+  if needle_upper == needle_lower {
+    return false;
+  }
+  haystack.to_uppercase().contains(&needle_upper)
+}


Performance Bottleneck: Allocating the Entire Haystack for Unicode Case-Insensitive Search

Calling to_lowercase() and to_uppercase() on the entire haystack allocates new strings of the same size as the haystack. If the haystack is large (e.g., megabytes of text), this will cause significant memory allocations and CPU overhead, defeating the purpose of a fast inline prefilter.

We can implement a completely allocation-free case-insensitive substring search for the haystack by iterating over its character indices and performing character-by-character case-folded comparisons.

fn contains_unicode_case_insensitive(haystack: &str, needle: &str) -> bool { if needle.is_empty() { return true; } let needle_lower: Vec<char> = needle.chars().flat_map(|c| c.to_lowercase()).collect(); let needle_upper: Vec<char> = needle.chars().flat_map(|c| c.to_uppercase()).collect(); let has_upper = needle_lower != needle_upper; for (start_idx, _) in haystack.char_indices() { let mut haystack_iter = haystack[start_idx..].chars().flat_map(|c| c.to_lowercase()); let mut matched = true; for &n_char in &needle_lower { if haystack_iter.next() != Some(n_char) { matched = false; break; } } if matched { return true; } if has_upper { let mut haystack_iter = haystack[start_idx..].chars().flat_map(|c| c.to_uppercase()); let mut matched = true; for &n_char in &needle_upper { if haystack_iter.next() != Some(n_char) { matched = false; break; } } if matched { return true; } } } false }

CC on behalf of @sok0

Agreed, implemented in 26def7f. The Unicode case-insensitive inline prefilter now compares folded character streams without allocating a full folded haystack, with regression coverage for Unicode case folds.

gemini-code-assist · 2026-06-29T04:43:06Z

+fn inline_literal_prefilter_matches(
+  haystack: &str,
+  needle: &str,
+  case_insensitive: bool,
+) -> bool {
+  if haystack.contains(needle) {
+    return true;
+  }
+  if !case_insensitive {
+    return false;
+  }
+  if haystack.is_ascii() && needle.is_ascii() {
+    return contains_ignore_ascii_case(haystack.as_bytes(), needle.as_bytes());
+  }
+  contains_unicode_case_insensitive(haystack, needle)
+}


Unnecessary Unicode Case-Insensitive Search for ASCII Needles

Since UTF-8 is a self-synchronizing code, any ASCII byte sequence can never partially match a multi-byte UTF-8 character. Therefore, if the needle is ASCII, it is 100% safe and correct to use contains_ignore_ascii_case even if the haystack is non-ASCII.

This optimization avoids falling back to the expensive, allocating contains_unicode_case_insensitive path when the haystack contains Unicode characters but the prefilter needle is ASCII.

Suggested change

fn inline_literal_prefilter_matches(

haystack: &str,

needle: &str,

case_insensitive: bool,

) -> bool {

if haystack.contains(needle) {

return true;

}

if !case_insensitive {

return false;

}

if haystack.is_ascii() && needle.is_ascii() {

return contains_ignore_ascii_case(haystack.as_bytes(), needle.as_bytes());

}

contains_unicode_case_insensitive(haystack, needle)

}

fn inline_literal_prefilter_matches(

haystack: &str,

needle: &str,

case_insensitive: bool,

) -> bool {

if haystack.contains(needle) {

return true;

}

if !case_insensitive {

return false;

}

if needle.is_ascii() {

return contains_ignore_ascii_case(haystack.as_bytes(), needle.as_bytes());

}

contains_unicode_case_insensitive(haystack, needle)

}

CC on behalf of @sok0

Implemented with an adjustment in 26def7f: ASCII needles now use a byte-level case-insensitive scan as a fast positive path. For non-ASCII haystacks, a miss still falls through to Unicode folding so regex case-fold semantics are preserved, e.g. ſ matching s.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 26f44e1586

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-29T04:47:20Z

+      if touches_internal_window_edge(window, haystack.len(), start, end) {
+        continue;
+      }


Do not drop exact matches at window boundaries

When prefilter_window_bytes is Some(0), a valid exact cue match after byte 0 is skipped here. For example, a lazy regex foo with prefilter_any = ["foo"] scans the window containing just foo; the local regex match maps back to the window's start/end, so touches_internal_window_edge returns true and find_iter("xfoo") incorrectly returns no match even though the unbounded engine would match.

Useful? React with 👍 / 👎.

CC on behalf of @sok0

Agreed, implemented in 26def7f. Edge-touching window candidates now trigger full-context verification instead of being dropped, and exact internal-edge matches are covered by regression tests.

chatgpt-codex-connector · 2026-06-29T04:47:20Z

+      if !match_contains_prefilter_hit(start, end, &hits) {
+        continue;
+      }


Keep windowed matches near the prefilter hit

This check requires the regex match itself to contain the literal prefilter hit, but prefilter_any is a gate and prefilter_window_bytes is described as a radius around those hits. A contextual cue such as prefilter_any = ["Invoice"] with a date regex and a window large enough to include Invoice date 2026-06-29 will scan the date, then drop it because the date span does not contain Invoice, making the windowed mode miss valid nearby matches.

Useful? React with 👍 / 👎.

CC on behalf of @sok0

Agreed, implemented in 26def7f. Windowed mode now keeps matches inside the prefilter window rather than requiring the final match span to contain the cue, with a contextual date fixture and generated bounded/unbounded parity coverage.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 26def7fa61

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-29T05:04:52Z

+    }
+    LiteralPrefilter::Many(engine) => {
+      let packed = engine
+        .find_iter_packed_bytes(haystack)


Use overlapping hits for windowed prefilters

When a lazy regex uses prefilter_window_bytes, this non-overlapping Aho pass can miss overlapping cue occurrences, so valid matches may never be scanned. For example, with prefilter_any = ["aa"], radius 0, regex a, and haystack aaa, the prefilter reports only the first aa window (0..2) and misses the overlapping cue at 1..3, so the regex match at byte 2 is dropped even though the unwindowed lazy regex would return it.

Useful? React with 👍 / 👎.

CC on behalf of @sok0

Agreed, addressed in 9d1bafc. Windowed literal prefilters now collect overlapping cue hits for both single-literal and Aho-backed prefilters, and there is a regression comparing bounded and unbounded lazy regex results for the aa / aaa overlap case.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9d1bafc647

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-29T05:22:25Z

+    let local = engine
+      .find_iter_packed_bytes(slice)
+      .map_err(|error| Error::BuildRegex(error.to_string()))?;


Verify zero-width context before dropping windowed matches

When a windowed lazy regex needs zero-width context outside the returned span, the truncated slice can produce no candidate at all, so the later touches_internal_window_edge fallback never runs. For example, regex (?<=x)foo with prefilter foo, radius 0, and haystack xfoo scans only foo and returns no local match, even though the full-context match foo is wholly inside the prefilter window. This makes find_iter/is_match false-negative for lookaround patterns unless the caller over-provisions the radius beyond the actual match span.

Useful? React with 👍 / 👎.

CC on behalf of @sok0

Agreed, addressed in c9ea954. Windowed lazy regex slots whose source contains lookaround now use full-context filtering over the prefilter windows, so zero-width context outside a radius-0 slice is preserved. Added a focused regression for (?<=x)foo with prefilter_any = ["foo"], radius 0, and xfoo, comparing bounded and unbounded behavior.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c9ea95431f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-29T05:32:04Z

+        .flatten()
+    });
+  let prefilter_window_needs_full_context = prefilter_window_bytes.is_some()
+    && values.iter().any(|pattern| has_lookaround(pattern));


Use full context for boundary-sensitive regex windows

When prefilter_window_bytes is set, this only forces full-haystack verification for lookaround patterns, but other zero-width assertions can also depend on bytes outside the scanned window. For example, a lazy regex \Bfoo with prefilter_any = ["foo"] and radius 0 should match xfoo; scanning only the foo slice produces no local candidate because \B is false at the slice start, so the later edge-touch fallback never runs and find_iter/is_match return a false negative.

Useful? React with 👍 / 👎.

jan-kubica added 6 commits June 28, 2026 00:10

perf: warm lazy regex slots in parallel

b928c4f

feat: expose text search build diagnostics

238a6a4

perf: inline small regex prefilters

a398e62

perf: parallelize long engine scans

28d307b

perf: instrument warm search slots

25174f4

perf: widen split literal chunks

26f44e1

gemini-code-assist Bot reviewed Jun 29, 2026

View reviewed changes

jan-kubica marked this pull request as ready for review June 29, 2026 04:43

chatgpt-codex-connector Bot reviewed Jun 29, 2026

View reviewed changes

fix: tighten prepared search windowing

26def7f

chatgpt-codex-connector Bot reviewed Jun 29, 2026

View reviewed changes

fix: include overlapping prefilter windows

9d1bafc

chatgpt-codex-connector Bot reviewed Jun 29, 2026

View reviewed changes

fix: keep lookaround context in regex windows

c9ea954

chatgpt-codex-connector Bot reviewed Jun 29, 2026

View reviewed changes

jan-kubica merged commit cb64e33 into main Jun 29, 2026
15 checks passed

jan-kubica deleted the codex/parallel-lazy-regex-warm branch June 29, 2026 05:32

github-actions Bot locked and limited conversation to collaborators Jun 29, 2026

Uh oh!

Conversation

jan-kubica commented Jun 29, 2026

Summary

Checks

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Load Imbalance in Parallel Warming

Uh oh!

jan-kubica Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Performance Bottleneck: Allocating the Entire Haystack for Unicode Case-Insensitive Search

Uh oh!

jan-kubica Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Unnecessary Unicode Case-Insensitive Search for ASCII Needles

Uh oh!

jan-kubica Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

jan-kubica Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

jan-kubica Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

jan-kubica Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

jan-kubica Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading