perf: improve prepared search diagnostics and warming#75
Conversation
|
Warning Review limit reached
Next review available in: 23 minutes Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable usage-based reviews in Billing to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information, and refer to the rate limits docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (3)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces several enhancements to the text search engine, including parallel find capabilities, parallel lazy regex warming, inline literal prefilters, and detailed build and find statistics. It also adds a byte-windowed prefilter search to restrict regex matching to a radius around literal hits. The review feedback highlights three key areas for improvement: resolving a load imbalance in parallel warming by chunking only uninitialized lazy engines, eliminating a performance bottleneck in Unicode case-insensitive search by avoiding full-haystack allocations, and optimizing inline prefilter matching to use ASCII case-insensitive search whenever the needle is ASCII.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| pub fn warm_lazy_regex(&self) -> Result<()> { | ||
| for engine in &self.engines { | ||
| warm_engine_lazy_regex(engine)?; | ||
| } | ||
| Ok(()) | ||
| let lazy_count = self | ||
| .engines | ||
| .iter() | ||
| .filter(|engine| engine_has_uninitialized_lazy_regex(engine)) | ||
| .count(); | ||
| if lazy_count < PARALLEL_LAZY_REGEX_WARM_MIN_SLOTS { | ||
| return warm_engines_lazy_regex(&self.engines); | ||
| } | ||
|
|
||
| let workers = std::thread::available_parallelism() | ||
| .map_or(1, usize::from) | ||
| .min(lazy_count); | ||
| if workers <= 1 { | ||
| return warm_engines_lazy_regex(&self.engines); | ||
| } | ||
|
|
||
| let chunk_size = self.engines.len().div_ceil(workers); | ||
| std::thread::scope(|scope| { | ||
| let mut handles = Vec::with_capacity(workers); | ||
| for chunk in self.engines.chunks(chunk_size) { | ||
| handles.push(scope.spawn(move || warm_engines_lazy_regex(chunk))); | ||
| } | ||
| for handle in handles { | ||
| handle.join().map_err(|_| { | ||
| Error::BuildRegex(String::from("Lazy regex warm-up panicked")) | ||
| })??; | ||
| } | ||
| Ok(()) | ||
| }) | ||
| } |
There was a problem hiding this comment.
Load Imbalance in Parallel Warming
Currently, chunk_size is calculated based on the total number of engines (self.engines.len()), but the loop only warms up uninitialized lazy regexes. If the uninitialized lazy regexes are clustered together (e.g., at the beginning of the engines list), one thread will end up doing all the warming sequentially while other threads do nothing.
To ensure optimal load balancing, we should chunk only the collected lazy_engines that actually need warming. Additionally, we can avoid iterating over all engines again in the sequential fallback paths by directly warming the collected lazy_engines.
pub fn warm_lazy_regex(&self) -> Result<()> {
let lazy_engines: Vec<&EngineSlot> = self
.engines
.iter()
.filter(|engine| engine_has_uninitialized_lazy_regex(engine))
.collect();
let lazy_count = lazy_engines.len();
if lazy_count < PARALLEL_LAZY_REGEX_WARM_MIN_SLOTS {
for engine in lazy_engines {
warm_engine_lazy_regex(engine)?;
}
return Ok(());
}
let workers = std::thread::available_parallelism()
.map_or(1, usize::from)
.min(lazy_count);
if workers <= 1 {
for engine in lazy_engines {
warm_engine_lazy_regex(engine)?;
}
return Ok(());
}
let chunk_size = lazy_count.div_ceil(workers);
std::thread::scope(|scope| {
let mut handles = Vec::with_capacity(workers);
for chunk in lazy_engines.chunks(chunk_size) {
handles.push(scope.spawn(move || {
for engine in chunk {
warm_engine_lazy_regex(engine)?;
}
Ok::<(), Error>(())
}));
}
for handle in handles {
handle.join().map_err(|_| {
Error::BuildRegex(String::from("Lazy regex warm-up panicked"))
})??;
}
Ok(())
})
}There was a problem hiding this comment.
CC on behalf of @sok0
Agreed, implemented in 26def7f. warm_lazy_regex now collects the uninitialized lazy regex slots first, warms only those slots, and chunks that smaller set for better load balancing.
| fn contains_unicode_case_insensitive(haystack: &str, needle: &str) -> bool { | ||
| let needle_lower = needle.to_lowercase(); | ||
| let haystack_lower = haystack.to_lowercase(); | ||
| if haystack_lower.contains(&needle_lower) { | ||
| return true; | ||
| } | ||
|
|
||
| let needle_upper = needle.to_uppercase(); | ||
| if needle_upper == needle_lower { | ||
| return false; | ||
| } | ||
| haystack.to_uppercase().contains(&needle_upper) | ||
| } |
There was a problem hiding this comment.
Performance Bottleneck: Allocating the Entire Haystack for Unicode Case-Insensitive Search
Calling to_lowercase() and to_uppercase() on the entire haystack allocates new strings of the same size as the haystack. If the haystack is large (e.g., megabytes of text), this will cause significant memory allocations and CPU overhead, defeating the purpose of a fast inline prefilter.
We can implement a completely allocation-free case-insensitive substring search for the haystack by iterating over its character indices and performing character-by-character case-folded comparisons.
fn contains_unicode_case_insensitive(haystack: &str, needle: &str) -> bool {
if needle.is_empty() {
return true;
}
let needle_lower: Vec<char> = needle.chars().flat_map(|c| c.to_lowercase()).collect();
let needle_upper: Vec<char> = needle.chars().flat_map(|c| c.to_uppercase()).collect();
let has_upper = needle_lower != needle_upper;
for (start_idx, _) in haystack.char_indices() {
let mut haystack_iter = haystack[start_idx..].chars().flat_map(|c| c.to_lowercase());
let mut matched = true;
for &n_char in &needle_lower {
if haystack_iter.next() != Some(n_char) {
matched = false;
break;
}
}
if matched {
return true;
}
if has_upper {
let mut haystack_iter = haystack[start_idx..].chars().flat_map(|c| c.to_uppercase());
let mut matched = true;
for &n_char in &needle_upper {
if haystack_iter.next() != Some(n_char) {
matched = false;
break;
}
}
if matched {
return true;
}
}
}
false
}There was a problem hiding this comment.
CC on behalf of @sok0
Agreed, implemented in 26def7f. The Unicode case-insensitive inline prefilter now compares folded character streams without allocating a full folded haystack, with regression coverage for Unicode case folds.
| fn inline_literal_prefilter_matches( | ||
| haystack: &str, | ||
| needle: &str, | ||
| case_insensitive: bool, | ||
| ) -> bool { | ||
| if haystack.contains(needle) { | ||
| return true; | ||
| } | ||
| if !case_insensitive { | ||
| return false; | ||
| } | ||
| if haystack.is_ascii() && needle.is_ascii() { | ||
| return contains_ignore_ascii_case(haystack.as_bytes(), needle.as_bytes()); | ||
| } | ||
| contains_unicode_case_insensitive(haystack, needle) | ||
| } |
There was a problem hiding this comment.
Unnecessary Unicode Case-Insensitive Search for ASCII Needles
Since UTF-8 is a self-synchronizing code, any ASCII byte sequence can never partially match a multi-byte UTF-8 character. Therefore, if the needle is ASCII, it is 100% safe and correct to use contains_ignore_ascii_case even if the haystack is non-ASCII.
This optimization avoids falling back to the expensive, allocating contains_unicode_case_insensitive path when the haystack contains Unicode characters but the prefilter needle is ASCII.
| fn inline_literal_prefilter_matches( | |
| haystack: &str, | |
| needle: &str, | |
| case_insensitive: bool, | |
| ) -> bool { | |
| if haystack.contains(needle) { | |
| return true; | |
| } | |
| if !case_insensitive { | |
| return false; | |
| } | |
| if haystack.is_ascii() && needle.is_ascii() { | |
| return contains_ignore_ascii_case(haystack.as_bytes(), needle.as_bytes()); | |
| } | |
| contains_unicode_case_insensitive(haystack, needle) | |
| } | |
| fn inline_literal_prefilter_matches( | |
| haystack: &str, | |
| needle: &str, | |
| case_insensitive: bool, | |
| ) -> bool { | |
| if haystack.contains(needle) { | |
| return true; | |
| } | |
| if !case_insensitive { | |
| return false; | |
| } | |
| if needle.is_ascii() { | |
| return contains_ignore_ascii_case(haystack.as_bytes(), needle.as_bytes()); | |
| } | |
| contains_unicode_case_insensitive(haystack, needle) | |
| } |
There was a problem hiding this comment.
CC on behalf of @sok0
Implemented with an adjustment in 26def7f: ASCII needles now use a byte-level case-insensitive scan as a fast positive path. For non-ASCII haystacks, a miss still falls through to Unicode folding so regex case-fold semantics are preserved, e.g. ſ matching s.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 26f44e1586
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if touches_internal_window_edge(window, haystack.len(), start, end) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Do not drop exact matches at window boundaries
When prefilter_window_bytes is Some(0), a valid exact cue match after byte 0 is skipped here. For example, a lazy regex foo with prefilter_any = ["foo"] scans the window containing just foo; the local regex match maps back to the window's start/end, so touches_internal_window_edge returns true and find_iter("xfoo") incorrectly returns no match even though the unbounded engine would match.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
CC on behalf of @sok0
Agreed, implemented in 26def7f. Edge-touching window candidates now trigger full-context verification instead of being dropped, and exact internal-edge matches are covered by regression tests.
| if !match_contains_prefilter_hit(start, end, &hits) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Keep windowed matches near the prefilter hit
This check requires the regex match itself to contain the literal prefilter hit, but prefilter_any is a gate and prefilter_window_bytes is described as a radius around those hits. A contextual cue such as prefilter_any = ["Invoice"] with a date regex and a window large enough to include Invoice date 2026-06-29 will scan the date, then drop it because the date span does not contain Invoice, making the windowed mode miss valid nearby matches.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
CC on behalf of @sok0
Agreed, implemented in 26def7f. Windowed mode now keeps matches inside the prefilter window rather than requiring the final match span to contain the cue, with a contextual date fixture and generated bounded/unbounded parity coverage.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 26def7fa61
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| LiteralPrefilter::Many(engine) => { | ||
| let packed = engine | ||
| .find_iter_packed_bytes(haystack) |
There was a problem hiding this comment.
Use overlapping hits for windowed prefilters
When a lazy regex uses prefilter_window_bytes, this non-overlapping Aho pass can miss overlapping cue occurrences, so valid matches may never be scanned. For example, with prefilter_any = ["aa"], radius 0, regex a, and haystack aaa, the prefilter reports only the first aa window (0..2) and misses the overlapping cue at 1..3, so the regex match at byte 2 is dropped even though the unwindowed lazy regex would return it.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
CC on behalf of @sok0
Agreed, addressed in 9d1bafc. Windowed literal prefilters now collect overlapping cue hits for both single-literal and Aho-backed prefilters, and there is a regression comparing bounded and unbounded lazy regex results for the aa / aaa overlap case.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9d1bafc647
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let local = engine | ||
| .find_iter_packed_bytes(slice) | ||
| .map_err(|error| Error::BuildRegex(error.to_string()))?; |
There was a problem hiding this comment.
Verify zero-width context before dropping windowed matches
When a windowed lazy regex needs zero-width context outside the returned span, the truncated slice can produce no candidate at all, so the later touches_internal_window_edge fallback never runs. For example, regex (?<=x)foo with prefilter foo, radius 0, and haystack xfoo scans only foo and returns no local match, even though the full-context match foo is wholly inside the prefilter window. This makes find_iter/is_match false-negative for lookaround patterns unless the caller over-provisions the radius beyond the actual match span.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
CC on behalf of @sok0
Agreed, addressed in c9ea954. Windowed lazy regex slots whose source contains lookaround now use full-context filtering over the prefilter windows, so zero-width context outside a radius-0 slice is preserved. Added a focused regression for (?<=x)foo with prefilter_any = ["foo"], radius 0, and xfoo, comparing bounded and unbounded behavior.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c9ea95431f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| .flatten() | ||
| }); | ||
| let prefilter_window_needs_full_context = prefilter_window_bytes.is_some() | ||
| && values.iter().any(|pattern| has_lookaround(pattern)); |
There was a problem hiding this comment.
Use full context for boundary-sensitive regex windows
When prefilter_window_bytes is set, this only forces full-haystack verification for lookaround patterns, but other zero-width assertions can also depend on bytes outside the scanned window. For example, a lazy regex \Bfoo with prefilter_any = ["foo"] and radius 0 should match xfoo; scanning only the foo slice produces no local candidate because \B is false at the slice start, so the later edge-touch fallback never runs and find_iter/is_match return a false negative.
Useful? React with 👍 / 👎.
CC on behalf of @sok0
Summary
Checks