[core] Add opt-in swap accounting to memory monitor and scheduler#63793
[core] Add opt-in swap accounting to memory monitor and scheduler#63793preneond wants to merge 9 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an opt-in configuration (RAY_count_swap_in_memory_monitor) to fold swap space into Ray's memory monitoring, node resource calculations, and dashboard metrics across both C++ and Python components. The review feedback highlights critical robustness improvements, specifically recommending exception handling when parsing memory.swap.max in C++ to prevent crashes on extremely large values, and wrapping psutil.swap_memory() calls in Python with try-except blocks to avoid failures in restricted or containerized environments.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
da2364f to
58ffe4c
Compare
|
Hi @preneond, thanks for the PR! I have a couple of question/comments.
|
Hi @Kunchd! On the opt-in question: I wanted to keep it opt-in because swap can slow Ray down, and on many hosts swap can be enabled for reasons unrelated to Ray (OS-level panic buffer, ...). An always-on default would silently push tasks into swap on those systems. Happy to flip the default to on if you think that's the better tradeoff or remove the flag completely On the resource isolation inconsistency: Good catch - the C++ user-slice OOM path doesn't include swap, so advertising RAM+swap on the Python side is inconsistent. I'd like to go with option 1, I will add swap accounting to resource isolation's monitoring system |
aedd43e to
24fd2cd
Compare
1efb1a0 to
6f723a7
Compare
6f723a7 to
3d73019
Compare
2a1a11b to
256ecdf
Compare
256ecdf to
bcdb1f0
Compare
Kunchd
left a comment
There was a problem hiding this comment.
Thanks for accounting for swap in resource isolation mode as well! In terms of whether we want to support swap out of the box, I agree that it should be opt in to avoid unintended behaviors.
Since supporting swap may introduce unexpected performance regression and edge case, could you provide a benchmark result of whether running with both resource isolation on and off while swap is enabled causes unintended ray OOM kills even when swap space is available in the PR description. Please provide the script for the benchmark if possible as well. Thanks!
| cgroup_swap_max = int(val) | ||
| if cgroup_swap_max > _INT64_MAX: | ||
| # Overflows int64; C++ treats this as unlimited and adds no swap. | ||
| return 0, 0 |
There was a problem hiding this comment.
Adding no swap when the limit is unlimited doesn't really make sense. Shouldn't we be returning the available host swap size in this case?
There was a problem hiding this comment.
Yeah, fair. I picked (0,0) to match the C++ side and to keep Python/C++ from diverging (cursor[bot] had flagged that earlier: #63793 (comment) and #63793 (comment)), but you're right that semantically that's still wrong — "unlimited" should mean host swap, not zero. I'll change both Python and C++ to fall back to host swap when memory.swap.max is "max" or overflows int64. Zero stays zero only when the kernel explicitly says swap is disabled.
There was a problem hiding this comment.
Addressed in 6ad3d1f407 (and the user-slice OOM threshold follow-up at 21aea5c9ac so Python, root-cgroup C++, and the user-slice path all agree).
| # against psutil itself raising on stripped containers / unsupported | ||
| # kernels — callers shouldn't have to know which exception types psutil | ||
| # may throw (RuntimeError, NotImplementedError, OSError, …). | ||
| try: |
There was a problem hiding this comment.
If the user has swap accounting enabled but we failed to fetch the swap size available, we should explicitly raise an exception. This is the startup path, and we want to fail fast if something is misconfigured instead of quietly failing.
There was a problem hiding this comment.
Makes sense. I'll drop the blanket except in the helper so psutil raises. The startup path (resource_and_label_spec.resolve) won't catch it, so misconfig fails fast. The dashboard reporter is a periodic loop so I'll keep a catch there but log the exception.
| if _COUNT_SWAP_IN_MEMORY_MONITOR: | ||
| try: | ||
| swap_total, _ = get_cgroup_aware_swap_memory() | ||
| except (OSError, NotImplementedError): |
There was a problem hiding this comment.
Would psutil or anything inside the call actually throw NotImplementedError?
There was a problem hiding this comment.
psutil can raise NotImplementedError on platforms where swap_memory() isn't supported, but the helper currently swallows everything so this except is dead code anyway. Going to drop the try/except entirely per your other comment — let it crash on startup if swap lookup fails.
| logger.warning( | ||
| "Failed to retrieve swap memory info; swap will " | ||
| "not be counted toward node memory capacity.", | ||
| exc_info=True, |
There was a problem hiding this comment.
We should log the exception's string representation as well.
There was a problem hiding this comment.
Addressed in 9d970dc632 — the original except here got removed (per your fail-fast comment), and str(e) is now used on the analogous catch in the dashboard reporter.
| try: | ||
| swap_total, swap_used = get_cgroup_aware_swap_memory() | ||
| total += swap_total | ||
| used += swap_used |
There was a problem hiding this comment.
This actually feels a little misleading. When swap space starts being used, users will observe a noticeable performance degradation. Instead of treating swap as extra memory in the metrics, we probably want to separate these out to a separate metric.
There was a problem hiding this comment.
Agreed. I'll revert _get_mem_usage to RAM-only and add a separate _get_swap_usage plus node_swap_used / node_swap_total / node_swap_utilization gauges and a swap field on the wire payload.
| MemoryUsageSnapshot memory_snapshot = MemoryMonitorUtils::TakeSystemMemoryUsageSnapshot( | ||
| cgroup_path, | ||
| MemoryMonitorUtils::kProcDirectory, | ||
| /*include_swap=*/false); |
There was a problem hiding this comment.
How does swap space interact with cgroup constraints? When memory usage approach a cgroup constraint, as the constraint get triggered before the memory spills to disk, or are we guaranteed that the memory spills before the constraint is triggered?
There was a problem hiding this comment.
memory.high is a soft throttle — crossing it triggers direct reclaim, which includes swapping anonymous pages out (subject to memory.swap.max) before the cgroup can keep allocating. So swap-out happens as part of hitting memory.high, not before or after it.
memory.max is the hard cap. On v2 it's RAM-only; swap is governed separately by memory.swap.max. We don't write memory.swap.max on the user slice so it inherits from the parent.
That's why include_swap=false here — we want the RAM-only number for the throttle, and the kernel uses available swap to absorb pressure before stalling. Will expand the inline comment to capture this.
…ited cursor[bot] caught a mismatch I introduced in the previous commit: GetMemoryThreshold added host swap to the OOM trigger for an unlimited user-slice swap.max, but TakeCgroupMemorySnapshot still left swap_max_bytes / swap_used_bytes at zero. TakeUserSliceMemoryUsageSnapshot fed those zeros into the per-tick used/total, so the trigger sat at RAM+host_swap while per-tick usage was RAM-only — the monitor would never see usage grow to the threshold and OOM kills would fire late or not at all. Apply the same fallback the threshold path uses: - "max" / non-numeric / int64-overflow → snapshot.swap_max_bytes = host SwapTotal (from GetHostSwapBytes). - swap.current is read whenever the cgroup has any swap budget (numeric non-zero OR unlimited). The kernel publishes swap.current regardless of the cap, so it gives the per-cgroup actual usage we want — host SwapTotal-SwapFree is host-wide and wouldn't isolate the user slice. - swap.max == 0 (swap explicitly disabled) keeps the old behavior: skip the swap.current read so a stale value can't surface as used. Plumbed proc_dir through TakeCgroupMemorySnapshot so tests can mock /proc/meminfo. Existing TestUserSliceUnlimitedSwapNotAddedToTotal renamed and flipped; new TestUserSliceOverflowSwapFallsBackToHostSwap covers the std::stoll out_of_range path. Closes: ray-project#63793 (comment) Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
Introduce the RAY_count_swap_in_memory_monitor config flag (default false) and wire it through MemoryMonitorUtils so the raylet threshold OOM monitor can include swap when computing total and used memory. * memory_monitor_utils.cc::GetCGroupMemoryBytes — cgroup v2 swap.max / swap.current added to total/used; cgroup v1 memsw counters replace the RAM-only counters when the flag is on. * memory_monitor_utils.cc::GetLinuxMemoryBytes — /proc/meminfo SwapTotal and SwapTotal-SwapFree folded into total/used. * swap.max parse safety: all-digit pre-check + std::stoll inside try/catch so kernel ULLONG_MAX / "max" sentinels do not crash the raylet, and swap.current is only consulted when swap.max > 0. Tests cover cgroup v2 swap, cgroup v1 memsw, /proc/meminfo swap, the parse-overflow guard, and the flag-off no-op for each path. Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
Introduce ray._common.utils.get_cgroup_aware_swap_memory as the single
source of truth for swap totals seen by ResourceAndLabelSpec and the
dashboard reporter agent. The helper's invariant: once a cgroup branch
is taken, every returned field is cgroup-scoped — host psutil values
are never mixed into a cgroup result. cgroup v2 reads memory.swap.max
and memory.swap.current; cgroup v1 derives swap from
memory.memsw.{limit,usage}_in_bytes minus the RAM counters; the
no-cgroup branch falls back to psutil.swap_memory().
Fixes several divergences from the C++ raylet monitor's enforcement:
* Scheduler previously read host psutil.swap_memory().total directly,
so in containers it advertised cgroup_ram + host_swap while the C++
OOM killer only enforced the cgroup limit.
* Dashboard fell back to host swap when memory.swap.max was the literal
"max" sentinel. C++ adds no swap in that case; helper now returns 0.
* cgroup v1 memsw without a RAM limit double-counted RAM as swap.
Helper now returns 0 in that ambiguous case.
* Once a cgroup branch was committed, a read or parse error leaked host
swap through the except handler; both branches now return (0, 0).
* memory.swap.max values that overflow int64 (kernel ULLONG_MAX
sentinel) parsed via int() and were clamped to host.total. Helper now
treats them as unusable and returns 0, matching C++ stoll out_of_range.
* When memory.swap.max parses to 0 (swap disabled), the helper no
longer reads memory.swap.current — prevents impossible used > total.
* The host clamp `min(cgroup_swap_max, host.total)` was dropped from
both branches so totals mirror what the raylet OOM monitor enforces.
The dashboard reporter applies a `used = max(0, min(used, total))` clamp
mirroring MemoryMonitorUtils::GetCGroupMemoryBytes so the Node Memory
graph cannot report negative available or percent > 100%.
Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
…nitor
When --enable-resource-isolation is on, the active enforcement path is
ThresholdMemoryMonitor::IsResourceIsolationThresholdExceeded →
TakeUserSliceMemoryUsageSnapshot, which previously summed
user.anon + user.shmem + system.shmem and compared against host RAM —
ignoring memory.swap.* for both user and system cgroups. Combined with
the scheduler advertising RAM + cgroup swap, this caused the OOM killer
to enforce a lower ceiling than the cluster was scheduling against.
Changes:
* memory_monitor_interface.h: extend CgroupMemorySnapshot with
swap_used_bytes and swap_max_bytes, defaulted to 0 so flag-off paths
stay byte-identical to the pre-change behavior.
* memory_monitor_utils.cc: TakeCgroupMemorySnapshot now reads
memory.swap.{max,current} for the cgroup when
count_swap_in_memory_monitor is on, mirroring the parse safety the v2
branch of GetCGroupMemoryBytes already uses — all-digit check before
std::stoll, try/catch around overflow / invalid input, and skip
swap.current when swap.max is 0.
* TakeUserSliceMemoryUsageSnapshot folds the user slice's swap into
both used and total. By analogy with how system.anon is excluded
(only system.shmem is included for indeterminacy reasons), the system
slice's swap.{max,current} is NOT credited to the user-slice budget —
swap is per-cgroup deterministic, so system swap belongs to raylet,
not to user workers.
* TakeUserSliceMemoryUsageSnapshot now calls GetLinuxMemoryBytes with
include_swap=false so host-level /proc/meminfo swap is not
double-counted against the per-slice cgroup swap added above.
* GetLinuxMemoryBytes takes a new include_swap parameter (default true)
so existing callers preserve behavior; only the user-slice path opts
out.
* GetMemoryThreshold: under resource isolation, when the flag is on,
add the user cgroup's memory.swap.max to the resolved threshold. The
isolation override uses memory.high (a RAM-only kernel constraint)
as the base threshold, so without this addition the per-tick
used_bytes (which now includes user-slice swap) would be compared
against a RAM-only ceiling and trigger the OOM killer prematurely.
Same parse safety pattern as the v2 swap path; "max" / overflow /
read failure all degrade silently to adding nothing.
Tests mirror the existing v2 cases in memory_monitor_utils_test.cc:
swap added when flag is on, ignored when flag is off, "max" sentinel
not added, swap.max == 0 ignores a stale swap.current, missing swap
files degrade to the pre-flag behavior. The "swap added" test mocks
non-zero system-cgroup swap and asserts it is NOT folded into used or
total. Also pins the flag off in the pre-existing
TestTakeUserSliceMemoryUsageSnapshotValidPaths… test so its assertion
against TakeSystemMemoryUsageSnapshot's host total is robust to test
ordering — RayConfig is process-global.
Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
When cgroup v2 `memory.swap.max` is the kernel's "unlimited" sentinel
("max", non-numeric, or a numeric value that overflows int64), the cgroup
imposes no swap cap, so the practical limit is whatever the host has —
not zero. Both the Python helper and the C++ memory monitor now read
host swap (from psutil / /proc/meminfo) in this case.
`memory.swap.max == 0` (swap explicitly disabled) still returns 0 — the
kernel is saying "no swap for this cgroup", which is distinct from
"unlimited".
Extends test_utils.py and the C++ gtest with parametrized coverage for
"max", garbage, empty, and overflow values. Adds a string overload of
MockCgroupv2Swap so tests can inject content std::stoll rejects.
Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
When RAY_count_swap_in_memory_monitor=1 is set, the operator has explicitly opted into swap accounting — silently swallowing the swap lookup failure means scheduling and the OOM monitor behave as if swap is off, which is the exact opposite of what was requested. - _get_host_swap_memory no longer wraps psutil.swap_memory() in a blanket except; psutil's native exception (OSError / NotImplementedError / RuntimeError) propagates. - resource_and_label_spec.resolve() drops its try/except. Misconfig crashes raylet startup with the native error. - Cgroup branches keep their (OSError, ValueError) handling — that protects the "never mix host swap into a cgroup result" invariant. - Periodic callers (e.g. dashboard reporter, addressed next) catch their own exceptions; they shouldn't take down metrics on transient failure. Test flipped from "silently logs warning" to "raises OSError"; adds a companion test that flag-off path never calls the swap helper. Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
Folding swap into the Node Memory graph hid the perf-degradation signal operators need once swap actually starts getting used. Split it out: - _get_mem_usage reverts to RAM-only. - New _get_swap_usage returns (total, used, percent) using the same cgroup-aware helper. Reads RAY_count_swap_in_memory_monitor; returns zeros when off. Wraps exceptions with str(e) in the log line per Kunchd's review feedback — periodic loop, so log-and-continue beats taking down the metrics path. - Wire payload gets a new "swap" field alongside "mem"/"host_mem"/ "cgroup_mem". - New Prometheus gauges node_swap_used / node_swap_total / node_swap_utilization, emitted only when swap_total > 0 so we don't add a constant-zero series to dashboards. Trade: the OOM killer and scheduler `memory` resource still count RAM+swap when the flag is on, so the kill threshold no longer matches the Node Memory graph 1:1. The swap-specific graph carries that signal instead. Frontend panel left as a follow-up; this commit ships the Prometheus + wire-payload contract. STATS_TEMPLATE gets the new "swap" key and tests cover both the "flag off / no swap" (no records) and "flag on with non-zero swap" (all three gauges emitted) cases. Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
The previous comment said memory.high was RAM-only and that swap must not be folded in, but didn't explain why or how swap actually interacts with the throttle. Expand to capture the kernel ordering Kunchd asked about in review: - memory.high is the soft-throttle line. Crossing it triggers direct reclaim, which itself includes swapping anonymous pages out (subject to memory.swap.max). Swap-out happens AS PART OF hitting memory.high, not before or after. - memory.max is the hard cap. Under cgroup v2 it's RAM-only; swap is governed by a separate memory.swap.max, which we don't write on the user slice — so it inherits from the parent (typically unlimited). - include_swap=false here is what lets the kernel use available swap to absorb pressure before stalling. Inflating total_bytes by swap would push memory.high past host RAM and silently disable the throttle. Comment-only change; no behavior diff. Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
The earlier commit changed GetCGroupMemoryBytes (root cgroup) and the Python helper to fall back to host swap when memory.swap.max is "max" or overflows int64. GetMemoryThreshold's resource-isolation branch read the *user* cgroup's swap.max and still treated "max" / overflow as "add nothing" — meaning the scheduling capacity and dashboard advertised RAM+swap while the user-slice OOM monitor enforced RAM only. Mirror the same semantics: when the user-slice swap.max is the "unlimited" sentinel, the user slice inherits the parent budget (typically unlimited at the root), so fall back to host swap. Python, the root-cgroup C++ path, and the user-slice C++ path now all agree. Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
…ited cursor[bot] caught a mismatch I introduced in the previous commit: GetMemoryThreshold added host swap to the OOM trigger for an unlimited user-slice swap.max, but TakeCgroupMemorySnapshot still left swap_max_bytes / swap_used_bytes at zero. TakeUserSliceMemoryUsageSnapshot fed those zeros into the per-tick used/total, so the trigger sat at RAM+host_swap while per-tick usage was RAM-only — the monitor would never see usage grow to the threshold and OOM kills would fire late or not at all. Apply the same fallback the threshold path uses: - "max" / non-numeric / int64-overflow → snapshot.swap_max_bytes = host SwapTotal (from GetHostSwapBytes). - swap.current is read whenever the cgroup has any swap budget (numeric non-zero OR unlimited). The kernel publishes swap.current regardless of the cap, so it gives the per-cgroup actual usage we want — host SwapTotal-SwapFree is host-wide and wouldn't isolate the user slice. - swap.max == 0 (swap explicitly disabled) keeps the old behavior: skip the swap.current read so a stale value can't surface as used. Plumbed proc_dir through TakeCgroupMemorySnapshot so tests can mock /proc/meminfo. Existing TestUserSliceUnlimitedSwapNotAddedToTotal renamed and flipped; new TestUserSliceOverflowSwapFallsBackToHostSwap covers the std::stoll out_of_range path. Closes: ray-project#63793 (comment) Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
7d8ddf0 to
fc20c12
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
Reviewed by Cursor Bugbot for commit fc20c12. Configure here.
| # the OOM monitor enforces. | ||
| swap_total = max(0, memsw_limit - ram_limit) | ||
| swap_used = 0 if ram_usage is None else max(0, memsw_usage - ram_usage) | ||
| return swap_total, swap_used |
There was a problem hiding this comment.
cgroup v1 memsw accounting mismatch
Medium Severity
With RAY_count_swap_in_memory_monitor enabled on cgroup v1, the Python scheduler helper treats memsw as swap-only (via memsw − ram) added to RAM-based capacity, while the C++ OOM path switches totals and usage to combined memsw counters. Missing memory.limit_in_bytes yields (0, 0) swap in Python but full memsw limits in C++.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit fc20c12. Configure here.
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Isolation threshold skips swap lookup
Medium Severity
Under resource isolation, per-tick user-slice usage always folds in user cgroup swap via direct memory.swap.* reads, but GetMemoryThreshold only extends memory.high with swap when GetUserCgroupConstraintValue succeeds. If that lookup fails while swap files remain readable, used includes swap while the kill threshold does not.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit fc20c12. Configure here.


Why
Ray's OOM monitor kills tasks based on RAM alone, while the Linux OOM killer only fires when both RAM and swap are exhausted. On nodes provisioned with swap as overflow, Ray kills workers long before the kernel would. The scheduler's
memoryresource and the dashboard's Node Memory graph have the same RAM-only blind spot, and under--enable-resource-isolationthe user-slice threshold ignores swap too — so scheduling, reporting, and enforcement all disagree.What
Adds opt-in
RAY_count_swap_in_memory_monitor(defaultfalse). When on, swap is folded into:memory.memsw.*or cgroup v2memory.swap.{max,current};/proc/meminfoswap on bare metal.memoryresource and dashboard Node Memory — via sharedget_cgroup_aware_swap_memory()so containers see cgroup-scoped swap, not host swap.TakeUserSliceMemoryUsageSnapshotnow readsuser.memory.swap.{max,current}and folds them into both used and total. Without this,ray statusadvertises RAM + swap while the monitor enforces RAM only.memory.high(RAM-only kernel constraint) is not inflated by swap —GetCGroupMemoryBytestakesinclude_swapso the cgroup-manager path stays RAM-only. Flag-off is the existing behavior.Commits
Test plan
"max"sentinel, missing files, user-slice swap fold, flag-off pinning.--memorybypass, dashboard_get_mem_usageclampsused ≤ total.RAY_count_swap_in_memory_monitor=1 ray start --head --enable-resource-isolation—ray status, dashboard, and OOM threshold all reflect RAM + cgroup swap.🤖 Generated with Claude Code