[core] Add opt-in swap accounting to memory monitor and scheduler by preneond · Pull Request #63793 · ray-project/ray

preneond · 2026-06-02T09:14:52Z

Why

Ray's OOM monitor kills tasks based on RAM alone, while the Linux OOM killer only fires when both RAM and swap are exhausted. On nodes provisioned with swap as overflow, Ray kills workers long before the kernel would. The scheduler's memory resource and the dashboard's Node Memory graph have the same RAM-only blind spot, and under --enable-resource-isolation the user-slice threshold ignores swap too — so scheduling, reporting, and enforcement all disagree.

What

Adds opt-in RAY_count_swap_in_memory_monitor (default false). When on, swap is folded into:

C++ memory monitor — cgroup v1 memory.memsw.* or cgroup v2 memory.swap.{max,current}; /proc/meminfo swap on bare metal.
Scheduler memory resource and dashboard Node Memory — via shared get_cgroup_aware_swap_memory() so containers see cgroup-scoped swap, not host swap.
Resource-isolation user-slice OOM monitor — TakeUserSliceMemoryUsageSnapshot now reads user.memory.swap.{max,current} and folds them into both used and total. Without this, ray status advertises RAM + swap while the monitor enforces RAM only.

memory.high (RAM-only kernel constraint) is not inflated by swap — GetCGroupMemoryBytes takes include_swap so the cgroup-manager path stays RAM-only. Flag-off is the existing behavior.

Commits

Add opt-in swap accounting to memory monitor (C++ + tests)
Make swap accounting cgroup-aware in scheduler and dashboard (Python helper)
Extend swap accounting to resource-isolation user-slice OOM monitor

Test plan

C++ gtests: cgroup v2 swap, cgroup v1 memsw, "max" sentinel, missing files, user-slice swap fold, flag-off pinning.
Python unit tests: scheduler swap inclusion, explicit --memory bypass, dashboard _get_mem_usage clamps used ≤ total.
Manual: RAY_count_swap_in_memory_monitor=1 ray start --head --enable-resource-isolation — ray status, dashboard, and OOM threshold all reflect RAM + cgroup swap.

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request introduces an opt-in configuration (RAY_count_swap_in_memory_monitor) to fold swap space into Ray's memory monitoring, node resource calculations, and dashboard metrics across both C++ and Python components. The review feedback highlights critical robustness improvements, specifically recommending exception handling when parsing memory.swap.max in C++ to prevent crashes on extremely large values, and wrapping psutil.swap_memory() calls in Python with try-except blocks to avoid failures in restricted or containerized environments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Kunchd · 2026-06-03T20:33:55Z

Hi @preneond, thanks for the PR! I have a couple of question/comments.

Is there a reason why swap accounting needs to be opt-in via an environment variable versus always on (but returns 0 when swap is not available)?
The proposed change is only applied to the memory monitor in default mode and does not modify the monitor system that kicks in when resource isolation is enabled via --enable-resource-isolation. However, resource_and_label_spec's modification will actually cause swap to be added when resolving the available memory on the host even when resource isolation is enabled: https://github.com/ray-project/ray/pull/63793/changes#diff-4e76259ffbe6651a00db2c2c67f218671b74effaf8a41559964ee37892553b66R405-R409. Please either also add swap accounting to resource isolation's monitoring system, or make sure that the changes are only applied to the default memory monitoring system.

preneond · 2026-06-09T11:14:21Z

Hi @preneond, thanks for the PR! I have a couple of question/comments.

Is there a reason why swap accounting needs to be opt-in via an environment variable versus always on (but returns 0 when swap is not available)?

The proposed change is only applied to the memory monitor in default mode and does not modify the monitor system that kicks in when resource isolation is enabled via --enable-resource-isolation. However, resource_and_label_spec's modification will actually cause swap to be added when resolving the available memory on the host even when resource isolation is enabled: https://github.com/ray-project/ray/pull/63793/changes#diff-4e76259ffbe6651a00db2c2c67f218671b74effaf8a41559964ee37892553b66R405-R409. Please either also add swap accounting to resource isolation's monitoring system, or make sure that the changes are only applied to the default memory monitoring system.

Hi @Kunchd!

On the opt-in question: I wanted to keep it opt-in because swap can slow Ray down, and on many hosts swap can be enabled for reasons unrelated to Ray (OS-level panic buffer, ...). An always-on default would silently push tasks into swap on those systems. Happy to flip the default to on if you think that's the better tradeoff or remove the flag completely

On the resource isolation inconsistency: Good catch - the C++ user-slice OOM path doesn't include swap, so advertising RAM+swap on the Python side is inconsistent. I'd like to go with option 1, I will add swap accounting to resource isolation's monitoring system

preneond · 2026-06-12T13:47:47Z

@Kunchd ready for re-review, let me know about the opt-in flag whether I should set is as default or enable it completely (see my comment here). Thank you!

Kunchd

Thanks for accounting for swap in resource isolation mode as well! In terms of whether we want to support swap out of the box, I agree that it should be opt in to avoid unintended behaviors.

Since supporting swap may introduce unexpected performance regression and edge case, could you provide a benchmark result of whether running with both resource isolation on and off while swap is enabled causes unintended ray OOM kills even when swap space is available in the PR description. Please provide the script for the benchmark if possible as well. Thanks!

Kunchd · 2026-06-18T17:09:11Z

+            cgroup_swap_max = int(val)
+            if cgroup_swap_max > _INT64_MAX:
+                # Overflows int64; C++ treats this as unlimited and adds no swap.
+                return 0, 0


Adding no swap when the limit is unlimited doesn't really make sense. Shouldn't we be returning the available host swap size in this case?

Yeah, fair. I picked (0,0) to match the C++ side and to keep Python/C++ from diverging (cursor[bot] had flagged that earlier: #63793 (comment) and #63793 (comment)), but you're right that semantically that's still wrong — "unlimited" should mean host swap, not zero. I'll change both Python and C++ to fall back to host swap when memory.swap.max is "max" or overflows int64. Zero stays zero only when the kernel explicitly says swap is disabled.

Addressed in 6ad3d1f407 (and the user-slice OOM threshold follow-up at 21aea5c9ac so Python, root-cgroup C++, and the user-slice path all agree).

Kunchd · 2026-06-18T17:12:31Z

+    # against psutil itself raising on stripped containers / unsupported
+    # kernels — callers shouldn't have to know which exception types psutil
+    # may throw (RuntimeError, NotImplementedError, OSError, …).
+    try:


If the user has swap accounting enabled but we failed to fetch the swap size available, we should explicitly raise an exception. This is the startup path, and we want to fail fast if something is misconfigured instead of quietly failing.

Makes sense. I'll drop the blanket except in the helper so psutil raises. The startup path (resource_and_label_spec.resolve) won't catch it, so misconfig fails fast. The dashboard reporter is a periodic loop so I'll keep a catch there but log the exception.

Addressed in 91d1089a27.

Kunchd · 2026-06-18T17:13:58Z

+            if _COUNT_SWAP_IN_MEMORY_MONITOR:
+                try:
+                    swap_total, _ = get_cgroup_aware_swap_memory()
+                except (OSError, NotImplementedError):


Would psutil or anything inside the call actually throw NotImplementedError?

psutil can raise NotImplementedError on platforms where swap_memory() isn't supported, but the helper currently swallows everything so this except is dead code anyway. Going to drop the try/except entirely per your other comment — let it crash on startup if swap lookup fails.

Addressed in 91d1089a27.

Kunchd · 2026-06-18T17:14:21Z

+                    logger.warning(
+                        "Failed to retrieve swap memory info; swap will "
+                        "not be counted toward node memory capacity.",
+                        exc_info=True,


We should log the exception's string representation as well.

Addressed in 9d970dc632 — the original except here got removed (per your fail-fast comment), and str(e) is now used on the analogous catch in the dashboard reporter.

Kunchd · 2026-06-18T17:18:34Z

+            try:
+                swap_total, swap_used = get_cgroup_aware_swap_memory()
+                total += swap_total
+                used += swap_used


This actually feels a little misleading. When swap space starts being used, users will observe a noticeable performance degradation. Instead of treating swap as extra memory in the metrics, we probably want to separate these out to a separate metric.

Agreed. I'll revert _get_mem_usage to RAM-only and add a separate _get_swap_usage plus node_swap_used / node_swap_total / node_swap_utilization gauges and a swap field on the wire payload.

Addressed in 9d970dc632.

Kunchd · 2026-06-18T17:23:13Z

+  MemoryUsageSnapshot memory_snapshot = MemoryMonitorUtils::TakeSystemMemoryUsageSnapshot(
+      cgroup_path,
+      MemoryMonitorUtils::kProcDirectory,
+      /*include_swap=*/false);


How does swap space interact with cgroup constraints? When memory usage approach a cgroup constraint, as the constraint get triggered before the memory spills to disk, or are we guaranteed that the memory spills before the constraint is triggered?

memory.high is a soft throttle — crossing it triggers direct reclaim, which includes swapping anonymous pages out (subject to memory.swap.max) before the cgroup can keep allocating. So swap-out happens as part of hitting memory.high, not before or after it.

memory.max is the hard cap. On v2 it's RAM-only; swap is governed separately by memory.swap.max. We don't write memory.swap.max on the user slice so it inherits from the parent.

That's why include_swap=false here — we want the RAM-only number for the throttle, and the kernel uses available swap to absorb pressure before stalling. Will expand the inline comment to capture this.

Addressed in 8b8d393647.

…ited cursor[bot] caught a mismatch I introduced in the previous commit: GetMemoryThreshold added host swap to the OOM trigger for an unlimited user-slice swap.max, but TakeCgroupMemorySnapshot still left swap_max_bytes / swap_used_bytes at zero. TakeUserSliceMemoryUsageSnapshot fed those zeros into the per-tick used/total, so the trigger sat at RAM+host_swap while per-tick usage was RAM-only — the monitor would never see usage grow to the threshold and OOM kills would fire late or not at all. Apply the same fallback the threshold path uses: - "max" / non-numeric / int64-overflow → snapshot.swap_max_bytes = host SwapTotal (from GetHostSwapBytes). - swap.current is read whenever the cgroup has any swap budget (numeric non-zero OR unlimited). The kernel publishes swap.current regardless of the cap, so it gives the per-cgroup actual usage we want — host SwapTotal-SwapFree is host-wide and wouldn't isolate the user slice. - swap.max == 0 (swap explicitly disabled) keeps the old behavior: skip the swap.current read so a stale value can't surface as used. Plumbed proc_dir through TakeCgroupMemorySnapshot so tests can mock /proc/meminfo. Existing TestUserSliceUnlimitedSwapNotAddedToTotal renamed and flipped; new TestUserSliceOverflowSwapFallsBackToHostSwap covers the std::stoll out_of_range path. Closes: ray-project#63793 (comment) Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>

Introduce the RAY_count_swap_in_memory_monitor config flag (default false) and wire it through MemoryMonitorUtils so the raylet threshold OOM monitor can include swap when computing total and used memory. * memory_monitor_utils.cc::GetCGroupMemoryBytes — cgroup v2 swap.max / swap.current added to total/used; cgroup v1 memsw counters replace the RAM-only counters when the flag is on. * memory_monitor_utils.cc::GetLinuxMemoryBytes — /proc/meminfo SwapTotal and SwapTotal-SwapFree folded into total/used. * swap.max parse safety: all-digit pre-check + std::stoll inside try/catch so kernel ULLONG_MAX / "max" sentinels do not crash the raylet, and swap.current is only consulted when swap.max > 0. Tests cover cgroup v2 swap, cgroup v1 memsw, /proc/meminfo swap, the parse-overflow guard, and the flag-off no-op for each path. Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>

Introduce ray._common.utils.get_cgroup_aware_swap_memory as the single source of truth for swap totals seen by ResourceAndLabelSpec and the dashboard reporter agent. The helper's invariant: once a cgroup branch is taken, every returned field is cgroup-scoped — host psutil values are never mixed into a cgroup result. cgroup v2 reads memory.swap.max and memory.swap.current; cgroup v1 derives swap from memory.memsw.{limit,usage}_in_bytes minus the RAM counters; the no-cgroup branch falls back to psutil.swap_memory(). Fixes several divergences from the C++ raylet monitor's enforcement: * Scheduler previously read host psutil.swap_memory().total directly, so in containers it advertised cgroup_ram + host_swap while the C++ OOM killer only enforced the cgroup limit. * Dashboard fell back to host swap when memory.swap.max was the literal "max" sentinel. C++ adds no swap in that case; helper now returns 0. * cgroup v1 memsw without a RAM limit double-counted RAM as swap. Helper now returns 0 in that ambiguous case. * Once a cgroup branch was committed, a read or parse error leaked host swap through the except handler; both branches now return (0, 0). * memory.swap.max values that overflow int64 (kernel ULLONG_MAX sentinel) parsed via int() and were clamped to host.total. Helper now treats them as unusable and returns 0, matching C++ stoll out_of_range. * When memory.swap.max parses to 0 (swap disabled), the helper no longer reads memory.swap.current — prevents impossible used > total. * The host clamp `min(cgroup_swap_max, host.total)` was dropped from both branches so totals mirror what the raylet OOM monitor enforces. The dashboard reporter applies a `used = max(0, min(used, total))` clamp mirroring MemoryMonitorUtils::GetCGroupMemoryBytes so the Node Memory graph cannot report negative available or percent > 100%. Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>

…nitor When --enable-resource-isolation is on, the active enforcement path is ThresholdMemoryMonitor::IsResourceIsolationThresholdExceeded → TakeUserSliceMemoryUsageSnapshot, which previously summed user.anon + user.shmem + system.shmem and compared against host RAM — ignoring memory.swap.* for both user and system cgroups. Combined with the scheduler advertising RAM + cgroup swap, this caused the OOM killer to enforce a lower ceiling than the cluster was scheduling against. Changes: * memory_monitor_interface.h: extend CgroupMemorySnapshot with swap_used_bytes and swap_max_bytes, defaulted to 0 so flag-off paths stay byte-identical to the pre-change behavior. * memory_monitor_utils.cc: TakeCgroupMemorySnapshot now reads memory.swap.{max,current} for the cgroup when count_swap_in_memory_monitor is on, mirroring the parse safety the v2 branch of GetCGroupMemoryBytes already uses — all-digit check before std::stoll, try/catch around overflow / invalid input, and skip swap.current when swap.max is 0. * TakeUserSliceMemoryUsageSnapshot folds the user slice's swap into both used and total. By analogy with how system.anon is excluded (only system.shmem is included for indeterminacy reasons), the system slice's swap.{max,current} is NOT credited to the user-slice budget — swap is per-cgroup deterministic, so system swap belongs to raylet, not to user workers. * TakeUserSliceMemoryUsageSnapshot now calls GetLinuxMemoryBytes with include_swap=false so host-level /proc/meminfo swap is not double-counted against the per-slice cgroup swap added above. * GetLinuxMemoryBytes takes a new include_swap parameter (default true) so existing callers preserve behavior; only the user-slice path opts out. * GetMemoryThreshold: under resource isolation, when the flag is on, add the user cgroup's memory.swap.max to the resolved threshold. The isolation override uses memory.high (a RAM-only kernel constraint) as the base threshold, so without this addition the per-tick used_bytes (which now includes user-slice swap) would be compared against a RAM-only ceiling and trigger the OOM killer prematurely. Same parse safety pattern as the v2 swap path; "max" / overflow / read failure all degrade silently to adding nothing. Tests mirror the existing v2 cases in memory_monitor_utils_test.cc: swap added when flag is on, ignored when flag is off, "max" sentinel not added, swap.max == 0 ignores a stale swap.current, missing swap files degrade to the pre-flag behavior. The "swap added" test mocks non-zero system-cgroup swap and asserts it is NOT folded into used or total. Also pins the flag off in the pre-existing TestTakeUserSliceMemoryUsageSnapshotValidPaths… test so its assertion against TakeSystemMemoryUsageSnapshot's host total is robust to test ordering — RayConfig is process-global. Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>

When cgroup v2 `memory.swap.max` is the kernel's "unlimited" sentinel ("max", non-numeric, or a numeric value that overflows int64), the cgroup imposes no swap cap, so the practical limit is whatever the host has — not zero. Both the Python helper and the C++ memory monitor now read host swap (from psutil / /proc/meminfo) in this case. `memory.swap.max == 0` (swap explicitly disabled) still returns 0 — the kernel is saying "no swap for this cgroup", which is distinct from "unlimited". Extends test_utils.py and the C++ gtest with parametrized coverage for "max", garbage, empty, and overflow values. Adds a string overload of MockCgroupv2Swap so tests can inject content std::stoll rejects. Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>

When RAY_count_swap_in_memory_monitor=1 is set, the operator has explicitly opted into swap accounting — silently swallowing the swap lookup failure means scheduling and the OOM monitor behave as if swap is off, which is the exact opposite of what was requested. - _get_host_swap_memory no longer wraps psutil.swap_memory() in a blanket except; psutil's native exception (OSError / NotImplementedError / RuntimeError) propagates. - resource_and_label_spec.resolve() drops its try/except. Misconfig crashes raylet startup with the native error. - Cgroup branches keep their (OSError, ValueError) handling — that protects the "never mix host swap into a cgroup result" invariant. - Periodic callers (e.g. dashboard reporter, addressed next) catch their own exceptions; they shouldn't take down metrics on transient failure. Test flipped from "silently logs warning" to "raises OSError"; adds a companion test that flag-off path never calls the swap helper. Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>

Folding swap into the Node Memory graph hid the perf-degradation signal operators need once swap actually starts getting used. Split it out: - _get_mem_usage reverts to RAM-only. - New _get_swap_usage returns (total, used, percent) using the same cgroup-aware helper. Reads RAY_count_swap_in_memory_monitor; returns zeros when off. Wraps exceptions with str(e) in the log line per Kunchd's review feedback — periodic loop, so log-and-continue beats taking down the metrics path. - Wire payload gets a new "swap" field alongside "mem"/"host_mem"/ "cgroup_mem". - New Prometheus gauges node_swap_used / node_swap_total / node_swap_utilization, emitted only when swap_total > 0 so we don't add a constant-zero series to dashboards. Trade: the OOM killer and scheduler `memory` resource still count RAM+swap when the flag is on, so the kill threshold no longer matches the Node Memory graph 1:1. The swap-specific graph carries that signal instead. Frontend panel left as a follow-up; this commit ships the Prometheus + wire-payload contract. STATS_TEMPLATE gets the new "swap" key and tests cover both the "flag off / no swap" (no records) and "flag on with non-zero swap" (all three gauges emitted) cases. Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>

The previous comment said memory.high was RAM-only and that swap must not be folded in, but didn't explain why or how swap actually interacts with the throttle. Expand to capture the kernel ordering Kunchd asked about in review: - memory.high is the soft-throttle line. Crossing it triggers direct reclaim, which itself includes swapping anonymous pages out (subject to memory.swap.max). Swap-out happens AS PART OF hitting memory.high, not before or after. - memory.max is the hard cap. Under cgroup v2 it's RAM-only; swap is governed by a separate memory.swap.max, which we don't write on the user slice — so it inherits from the parent (typically unlimited). - include_swap=false here is what lets the kernel use available swap to absorb pressure before stalling. Inflating total_bytes by swap would push memory.high past host RAM and silently disable the throttle. Comment-only change; no behavior diff. Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>

The earlier commit changed GetCGroupMemoryBytes (root cgroup) and the Python helper to fall back to host swap when memory.swap.max is "max" or overflows int64. GetMemoryThreshold's resource-isolation branch read the *user* cgroup's swap.max and still treated "max" / overflow as "add nothing" — meaning the scheduling capacity and dashboard advertised RAM+swap while the user-slice OOM monitor enforced RAM only. Mirror the same semantics: when the user-slice swap.max is the "unlimited" sentinel, the user slice inherits the parent budget (typically unlimited at the root), so fall back to host swap. Python, the root-cgroup C++ path, and the user-slice C++ path now all agree. Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>

…ited cursor[bot] caught a mismatch I introduced in the previous commit: GetMemoryThreshold added host swap to the OOM trigger for an unlimited user-slice swap.max, but TakeCgroupMemorySnapshot still left swap_max_bytes / swap_used_bytes at zero. TakeUserSliceMemoryUsageSnapshot fed those zeros into the per-tick used/total, so the trigger sat at RAM+host_swap while per-tick usage was RAM-only — the monitor would never see usage grow to the threshold and OOM kills would fire late or not at all. Apply the same fallback the threshold path uses: - "max" / non-numeric / int64-overflow → snapshot.swap_max_bytes = host SwapTotal (from GetHostSwapBytes). - swap.current is read whenever the cgroup has any swap budget (numeric non-zero OR unlimited). The kernel publishes swap.current regardless of the cap, so it gives the per-cgroup actual usage we want — host SwapTotal-SwapFree is host-wide and wouldn't isolate the user slice. - swap.max == 0 (swap explicitly disabled) keeps the old behavior: skip the swap.current read so a stale value can't surface as used. Plumbed proc_dir through TakeCgroupMemorySnapshot so tests can mock /proc/meminfo. Existing TestUserSliceUnlimitedSwapNotAddedToTotal renamed and flipped; new TestUserSliceOverflowSwapFallsBackToHostSwap covers the std::stoll out_of_range path. Closes: ray-project#63793 (comment) Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

^{Reviewed by Cursor Bugbot for commit fc20c12. Configure here.}

cursor · 2026-06-22T13:23:38Z

+            # the OOM monitor enforces.
+            swap_total = max(0, memsw_limit - ram_limit)
+            swap_used = 0 if ram_usage is None else max(0, memsw_usage - ram_usage)
+            return swap_total, swap_used


cgroup v1 memsw accounting mismatch

Medium Severity

With RAY_count_swap_in_memory_monitor enabled on cgroup v1, the Python scheduler helper treats memsw as swap-only (via memsw − ram) added to RAM-based capacity, while the C++ OOM path switches totals and usage to combined memsw counters. Missing memory.limit_in_bytes yields (0, 0) swap in Python but full memsw limits in C++.

Additional Locations (1)

src/ray/common/memory_monitor_utils.cc#L284-L323

^{Reviewed by Cursor Bugbot for commit fc20c12. Configure here.}

cursor · 2026-06-22T13:23:38Z

+          }
+        }
+      }
+    }


Isolation threshold skips swap lookup

Medium Severity

Under resource isolation, per-tick user-slice usage always folds in user cgroup swap via direct memory.swap.* reads, but GetMemoryThreshold only extends memory.high with swap when GetUserCgroupConstraintValue succeeds. If that lookup fails while swap files remain readable, used includes swap while the kill threshold does not.

Additional Locations (1)

src/ray/common/memory_monitor_utils.cc#L147-L190

^{Reviewed by Cursor Bugbot for commit fc20c12. Configure here.}

preneond requested a review from a team as a code owner June 2, 2026 09:14

gemini-code-assist Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/ray/common/memory_monitor_utils.cc Outdated

Comment thread python/ray/_private/resource_and_label_spec.py Outdated

Comment thread python/ray/dashboard/modules/reporter/reporter_agent.py Outdated

cursor Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/ray/common/memory_monitor_test_fixture.cc

preneond marked this pull request as draft June 2, 2026 09:52

preneond force-pushed the feat/count-swap-in-memory-monitor branch from da2364f to 58ffe4c Compare June 2, 2026 09:55

preneond marked this pull request as ready for review June 2, 2026 13:49

ray-gardener Bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Jun 2, 2026