Skip to content

Fix recursion limit check and parallel execution bugs#13

Open
stevei101 wants to merge 1 commit intodevelopfrom
fix-recursion-and-failfast-6220343659135641513
Open

Fix recursion limit check and parallel execution bugs#13
stevei101 wants to merge 1 commit intodevelopfrom
fix-recursion-and-failfast-6220343659135641513

Conversation

@stevei101
Copy link
Copy Markdown
Contributor

This PR fixes two bugs:

  1. Recursion Limit: The recursion limit check was performed before checking if the current node is END. This caused graphs that finished exactly on the limit to fail with RecursionLimit error. The fix moves the END check to the beginning of the loop.
  2. Parallel Execution: The FailFast and WaitN strategies in ParallelSubgraphs were implemented using sequential awaits, meaning they would wait for tasks in the order they were added, defeating the purpose of "fail fast" or "wait for first N". The fix uses select_all to process task completions as they happen.

PR created automatically by Jules for task 6220343659135641513 started by @stevei101

- In `src/runner.rs`, check for END node before recursion limit to allow finishing on the last iteration.
- In `src/orchestration/parallel.rs`, implement `FailFast` and `WaitN` strategies correctly using `futures::future::select_all` to wait for tasks as they complete instead of sequentially.

Co-authored-by: stevei101 <287896+stevei101@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses two distinct bugs in the oxidizedgraph execution engine: a recursion limit check ordering issue and inefficient parallel execution strategies.

Changes:

  • Reordered END node check to occur before iteration limit check in GraphRunner, preventing false RecursionLimit errors when graphs complete exactly at the max iteration boundary
  • Refactored FailFast and WaitN join strategies in ParallelSubgraphs to use futures::future::select_all instead of sequential awaits, enabling proper concurrent task processing based on completion order

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/runner.rs Moved END node check before iteration limit check to fix edge case where graphs finishing exactly at max_iterations would incorrectly fail
src/orchestration/parallel.rs Replaced sequential await loops with select_all-based concurrent processing for FailFast and WaitN strategies to handle task completion order correctly

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/runner.rs
Comment on lines +118 to +125
// Check for END node
if current_node == transitions::END {
info!(iterations = iterations, "Graph execution completed");
let guard = state
.read()
.map_err(|e| RuntimeError::InvalidState(e.to_string()))?;
return Ok(guard.clone());
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix correctly addresses the recursion limit bug for the basic GraphRunner, but the same bug exists in two other runner implementations that were not fixed:

  1. src/checkpoint/runner.rs (lines 193-227)
  2. src/events/runner.rs (lines 185-236)

Both of these files still check the iteration limit BEFORE checking for the END node, meaning they will incorrectly fail with a RecursionLimit error when a graph finishes exactly on the max_iterations limit. The same reordering fix should be applied to those files to ensure consistency across all runner implementations.

Copilot uses AI. Check for mistakes.
Comment thread src/runner.rs
Comment on lines +118 to +125
// Check for END node
if current_node == transitions::END {
info!(iterations = iterations, "Graph execution completed");
let guard = state
.read()
.map_err(|e| RuntimeError::InvalidState(e.to_string()))?;
return Ok(guard.clone());
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the existing test_max_iterations verifies that exceeding the limit causes an error, there should be a test case that specifically validates the bug fix: a graph that finishes exactly at the max_iterations limit should succeed, not fail with a RecursionLimit error. This would ensure the END check correctly happens before the iteration limit check.

Consider adding a test like:

  • Set max_iterations to 5
  • Create a graph that executes exactly 5 iterations then reaches END
  • Verify it completes successfully without a RecursionLimit error

Copilot uses AI. Check for mistakes.
Comment on lines 240 to 275
JoinStrategy::FailFast => {
if handles.is_empty() {
return Vec::new();
}

let mut futures: Vec<_> = handles
.into_iter()
.map(|(id, h)| {
let id_clone = id.clone();
Box::pin(async move {
let result = h.await.unwrap_or_else(|e| SubgraphResult::Failed {
subgraph_id: id_clone.clone(),
error: crate::error::RuntimeError::InvalidState(format!(
"Task panicked: {}",
e
)),
});
(id_clone, result)
})
})
.collect();

let mut results = Vec::new();
for (id, handle) in handles {
let result = handle.await.unwrap_or_else(|e| SubgraphResult::Failed {
subgraph_id: id.clone(),
error: crate::error::RuntimeError::InvalidState(format!(
"Task panicked: {}",
e
)),
});

while !futures.is_empty() {
let ((id, result), _, remaining) = futures::future::select_all(futures).await;
futures = remaining;

let is_failed = result.is_failed();
results.push((id, result));

if is_failed {
break;
return results;
}
}
results
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FailFast strategy has been significantly refactored to use select_all for proper concurrent processing, but there are no tests validating this behavior. Tests should verify:

  1. That FailFast returns immediately when any subgraph fails (not waiting for all to complete)
  2. That results are collected in completion order, not spawn order
  3. That the first failure triggers an early return

This is important to ensure the bug fix actually resolves the issue described in the PR where sequential awaits defeated the purpose of fail-fast.

Copilot uses AI. Check for mistakes.
Comment on lines 277 to 316
JoinStrategy::WaitN(n) => {
if handles.is_empty() {
return Vec::new();
}

let mut futures: Vec<_> = handles
.into_iter()
.map(|(id, h)| {
let id_clone = id.clone();
Box::pin(async move {
let result = h.await.unwrap_or_else(|e| SubgraphResult::Failed {
subgraph_id: id_clone.clone(),
error: crate::error::RuntimeError::InvalidState(format!(
"Task panicked: {}",
e
)),
});
(id_clone, result)
})
})
.collect();

let mut results = Vec::new();
let mut completed = 0;
for (id, handle) in handles {
if completed >= *n {
break;
}
let result = handle.await.unwrap_or_else(|e| SubgraphResult::Failed {
subgraph_id: id.clone(),
error: crate::error::RuntimeError::InvalidState(format!(
"Task panicked: {}",
e
)),
});
let mut completed_count = 0;

while !futures.is_empty() {
let ((id, result), _, remaining) = futures::future::select_all(futures).await;
futures = remaining;

if result.is_completed() {
completed += 1;
completed_count += 1;
}
results.push((id, result));

if completed_count >= *n {
return results;
}
}
results
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WaitN strategy has been significantly refactored to use select_all for proper concurrent processing, but there are no tests validating this behavior. Tests should verify:

  1. That WaitN returns after exactly N successful completions (not waiting for all)
  2. That results are collected in completion order, not spawn order
  3. That failed subgraphs don't count toward the N completed threshold
  4. Edge cases like N greater than the number of subgraphs

This is important to ensure the bug fix resolves the issue where sequential awaits would wait for tasks in spawn order rather than completion order.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants