Skip to content

Make rlm subcalls parallel with thread safety and semaphore#136

Open
h4shk4t wants to merge 7 commits into
alexzhang13:mainfrom
h4shk4t:parallel-rlm-subcalls
Open

Make rlm subcalls parallel with thread safety and semaphore#136
h4shk4t wants to merge 7 commits into
alexzhang13:mainfrom
h4shk4t:parallel-rlm-subcalls

Conversation

@h4shk4t
Copy link
Copy Markdown

@h4shk4t h4shk4t commented Mar 11, 2026

Summary

Adds parallel execution support for rlm_query_batched, allowing child RLM subcalls to run concurrently instead of sequentially. This provides significant speedup when the parent model fans out multiple independent queries (e.g., answering 3 independent questions simultaneously).

Changes

rlm/core/rlm.py

  • New max_concurrent_subcalls parameter (default 1 = sequential, backward-compatible)
  • New event callbacks: on_subcall_start, on_subcall_complete, on_iteration_start, on_iteration_complete for live progress tracking (Optional, I added this for better debugging - can remove this if out of scope)
  • Thread-safe _subcall: added _state_lock to protect _cumulative_cost and error tracking when children run in parallel across threads
  • Child RLMs propagate concurrency settings and callbacks to their own children

rlm/environments/local_repl.py

  • New _rlm_query_batched_parallel: dispatches subcalls via a short-lived ThreadPoolExecutor, collects results in original prompt order via as_completed
  • Two-layer concurrency control:
  • Local: ThreadPoolExecutor(max_workers=max_concurrent_subcalls) bounds children per batch call
  • Global: process-wide threading.Semaphore(16) bounds total concurrent children across all depths, preventing thread/memory explosion with deep recursion
  • Thread-safe _pending_llm_calls appends via _calls_lock
  • Fixed _temp_cwd(): restores to self.original_cwd (set at init) instead of os.getcwd(), which could capture another thread's temp dir that later gets deleted ([Errno 2])

tests/test_rlm_query.py

TestRlmQueryBatchedParallel: 6 tests covering parallel speedup, result ordering, partial failure handling, pending call tracking, sequential fallback (max_concurrent=1), and single-prompt short-circuit
TestGlobalSemaphoreBounding: 2 tests verifying the global semaphore limits concurrent children and is shared across REPL instances

Design decisions

  • Sequential by default: max_concurrent_subcalls=1 has zero overhead — no locks, no thread pool, no semaphore. Existing behavior is unchanged.
  • No pool-within-pool deadlock: each depth level creates its own short-lived ThreadPoolExecutor. The global semaphore bounds work, not threads — workers acquire a slot, call subcall_fn, release.
  • No os.chdir lock: I explored Lock (deadlocks same-thread re-entry in sequential mode) and RLock (deadlocks cross-thread in parallel mode). The root cause was os.getcwd() capturing stale paths, fixed by always restoring to a stable directory.

Usage

from rlm import RLM

# Sequential (default) — identical to previous behavior
rlm = RLM(
    backend="azure_openai",
    backend_kwargs={"model_name": "gpt-4o-mini"},
    max_depth=2,
)
result = rlm.completion("Answer these 3 questions using rlm_query_batched...")
# Children run one at a time: total ~30s for 3 subcalls @ ~10s each

# Parallel — just set max_concurrent_subcalls
rlm = RLM(
    backend="azure_openai",
    backend_kwargs={"model_name": "gpt-4o-mini"},
    max_depth=2,
    max_concurrent_subcalls=4,  # <-- this is the only change
)
result = rlm.completion("Answer these 3 questions using rlm_query_batched...")

@alexzhang13
Copy link
Copy Markdown
Owner

I'll look at this when I get time, this is pretty important to get in IMO (assuming it's correct)

Ashutosh Srivastava and others added 3 commits March 17, 2026 22:51
…wd race, default=1

- Remove global subcall semaphore and related API (set_global_max_subcalls, etc.)
  per maintainer feedback — concurrency bounded by ThreadPoolExecutor(max_workers) only
- Fix _temp_cwd race condition: restore to self.original_cwd instead of os.getcwd()
  which could point to a deleted temp dir under concurrent execution ([Errno 2])
- Change default max_concurrent_subcalls from 4 to 1 (sequential, backward-compatible)
- Scope max_concurrent_subcalls env_kwargs pass to local environment only
- Append pending_llm_calls in prompt order for deterministic metadata
- Remove duplicate semaphore-dependent tests, fix default assertions (4→1)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Made-with: Cursor
@h4shk4t h4shk4t force-pushed the parallel-rlm-subcalls branch from 6b60ec9 to dbb701f Compare April 6, 2026 19:57
@alexzhang13
Copy link
Copy Markdown
Owner

alexzhang13 commented Apr 21, 2026

Fix linting issues first, otherwise lgtm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants