Skip to content

[core] Add opt-in swap accounting to memory monitor and scheduler#63793

Open
preneond wants to merge 9 commits into
ray-project:masterfrom
preneond:feat/count-swap-in-memory-monitor
Open

[core] Add opt-in swap accounting to memory monitor and scheduler#63793
preneond wants to merge 9 commits into
ray-project:masterfrom
preneond:feat/count-swap-in-memory-monitor

Conversation

@preneond

@preneond preneond commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Why

Ray's OOM monitor kills tasks based on RAM alone, while the Linux OOM killer only fires when both RAM and swap are exhausted. On nodes provisioned with swap as overflow, Ray kills workers long before the kernel would. The scheduler's memory resource and the dashboard's Node Memory graph have the same RAM-only blind spot, and under --enable-resource-isolation the user-slice threshold ignores swap too — so scheduling, reporting, and enforcement all disagree.

What

Adds opt-in RAY_count_swap_in_memory_monitor (default false). When on, swap is folded into:

  1. C++ memory monitor — cgroup v1 memory.memsw.* or cgroup v2 memory.swap.{max,current}; /proc/meminfo swap on bare metal.
  2. Scheduler memory resource and dashboard Node Memory — via shared get_cgroup_aware_swap_memory() so containers see cgroup-scoped swap, not host swap.
  3. Resource-isolation user-slice OOM monitorTakeUserSliceMemoryUsageSnapshot now reads user.memory.swap.{max,current} and folds them into both used and total. Without this, ray status advertises RAM + swap while the monitor enforces RAM only.

memory.high (RAM-only kernel constraint) is not inflated by swap — GetCGroupMemoryBytes takes include_swap so the cgroup-manager path stays RAM-only. Flag-off is the existing behavior.

Commits

  1. Add opt-in swap accounting to memory monitor (C++ + tests)
  2. Make swap accounting cgroup-aware in scheduler and dashboard (Python helper)
  3. Extend swap accounting to resource-isolation user-slice OOM monitor

Test plan

  • C++ gtests: cgroup v2 swap, cgroup v1 memsw, "max" sentinel, missing files, user-slice swap fold, flag-off pinning.
  • Python unit tests: scheduler swap inclusion, explicit --memory bypass, dashboard _get_mem_usage clamps used ≤ total.
  • Manual: RAY_count_swap_in_memory_monitor=1 ray start --head --enable-resource-isolationray status, dashboard, and OOM threshold all reflect RAM + cgroup swap.

🤖 Generated with Claude Code

@preneond preneond requested a review from a team as a code owner June 2, 2026 09:14

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an opt-in configuration (RAY_count_swap_in_memory_monitor) to fold swap space into Ray's memory monitoring, node resource calculations, and dashboard metrics across both C++ and Python components. The review feedback highlights critical robustness improvements, specifically recommending exception handling when parsing memory.swap.max in C++ to prevent crashes on extremely large values, and wrapping psutil.swap_memory() calls in Python with try-except blocks to avoid failures in restricted or containerized environments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/ray/common/memory_monitor_utils.cc Outdated
Comment thread python/ray/_private/resource_and_label_spec.py Outdated
Comment thread python/ray/dashboard/modules/reporter/reporter_agent.py Outdated
Comment thread src/ray/common/memory_monitor_test_fixture.cc
@preneond preneond marked this pull request as draft June 2, 2026 09:52
@preneond preneond force-pushed the feat/count-swap-in-memory-monitor branch from da2364f to 58ffe4c Compare June 2, 2026 09:55
@preneond preneond marked this pull request as ready for review June 2, 2026 13:49
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Jun 2, 2026
@Kunchd

Kunchd commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Hi @preneond, thanks for the PR! I have a couple of question/comments.

  • Is there a reason why swap accounting needs to be opt-in via an environment variable versus always on (but returns 0 when swap is not available)?
  • The proposed change is only applied to the memory monitor in default mode and does not modify the monitor system that kicks in when resource isolation is enabled via --enable-resource-isolation. However, resource_and_label_spec's modification will actually cause swap to be added when resolving the available memory on the host even when resource isolation is enabled: https://github.com/ray-project/ray/pull/63793/changes#diff-4e76259ffbe6651a00db2c2c67f218671b74effaf8a41559964ee37892553b66R405-R409. Please either also add swap accounting to resource isolation's monitoring system, or make sure that the changes are only applied to the default memory monitoring system.

@preneond

preneond commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Hi @preneond, thanks for the PR! I have a couple of question/comments.

  • Is there a reason why swap accounting needs to be opt-in via an environment variable versus always on (but returns 0 when swap is not available)?
  • The proposed change is only applied to the memory monitor in default mode and does not modify the monitor system that kicks in when resource isolation is enabled via --enable-resource-isolation. However, resource_and_label_spec's modification will actually cause swap to be added when resolving the available memory on the host even when resource isolation is enabled: https://github.com/ray-project/ray/pull/63793/changes#diff-4e76259ffbe6651a00db2c2c67f218671b74effaf8a41559964ee37892553b66R405-R409. Please either also add swap accounting to resource isolation's monitoring system, or make sure that the changes are only applied to the default memory monitoring system.

Hi @Kunchd!

On the opt-in question: I wanted to keep it opt-in because swap can slow Ray down, and on many hosts swap can be enabled for reasons unrelated to Ray (OS-level panic buffer, ...). An always-on default would silently push tasks into swap on those systems. Happy to flip the default to on if you think that's the better tradeoff or remove the flag completely

On the resource isolation inconsistency: Good catch - the C++ user-slice OOM path doesn't include swap, so advertising RAM+swap on the Python side is inconsistent. I'd like to go with option 1, I will add swap accounting to resource isolation's monitoring system

Comment thread python/ray/dashboard/modules/reporter/reporter_agent.py Outdated
Comment thread python/ray/dashboard/modules/reporter/reporter_agent.py Outdated
Comment thread python/ray/_private/resource_and_label_spec.py
@preneond preneond force-pushed the feat/count-swap-in-memory-monitor branch from aedd43e to 24fd2cd Compare June 9, 2026 14:37
Comment thread python/ray/_common/utils.py
Comment thread python/ray/_common/utils.py Outdated
Comment thread python/ray/_common/utils.py Outdated
Comment thread python/ray/dashboard/modules/reporter/reporter_agent.py Outdated
Comment thread python/ray/_common/utils.py Outdated
Comment thread python/ray/_common/utils.py
Comment thread python/ray/_common/utils.py
@preneond preneond marked this pull request as draft June 9, 2026 20:17
@preneond preneond force-pushed the feat/count-swap-in-memory-monitor branch 3 times, most recently from 1efb1a0 to 6f723a7 Compare June 9, 2026 21:31
@preneond preneond marked this pull request as ready for review June 9, 2026 21:32
@preneond preneond force-pushed the feat/count-swap-in-memory-monitor branch from 6f723a7 to 3d73019 Compare June 10, 2026 07:21
Comment thread src/ray/common/memory_monitor_utils.cc
@preneond preneond force-pushed the feat/count-swap-in-memory-monitor branch 2 times, most recently from 2a1a11b to 256ecdf Compare June 10, 2026 07:49
@preneond preneond force-pushed the feat/count-swap-in-memory-monitor branch from 256ecdf to bcdb1f0 Compare June 10, 2026 12:01
@preneond

Copy link
Copy Markdown
Contributor Author

@Kunchd ready for re-review, let me know about the opt-in flag whether I should set is as default or enable it completely (see my comment here). Thank you!

@Kunchd Kunchd left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for accounting for swap in resource isolation mode as well! In terms of whether we want to support swap out of the box, I agree that it should be opt in to avoid unintended behaviors.

Since supporting swap may introduce unexpected performance regression and edge case, could you provide a benchmark result of whether running with both resource isolation on and off while swap is enabled causes unintended ray OOM kills even when swap space is available in the PR description. Please provide the script for the benchmark if possible as well. Thanks!

Comment thread python/ray/_common/utils.py Outdated
cgroup_swap_max = int(val)
if cgroup_swap_max > _INT64_MAX:
# Overflows int64; C++ treats this as unlimited and adds no swap.
return 0, 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding no swap when the limit is unlimited doesn't really make sense. Shouldn't we be returning the available host swap size in this case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair. I picked (0,0) to match the C++ side and to keep Python/C++ from diverging (cursor[bot] had flagged that earlier: #63793 (comment) and #63793 (comment)), but you're right that semantically that's still wrong — "unlimited" should mean host swap, not zero. I'll change both Python and C++ to fall back to host swap when memory.swap.max is "max" or overflows int64. Zero stays zero only when the kernel explicitly says swap is disabled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6ad3d1f407 (and the user-slice OOM threshold follow-up at 21aea5c9ac so Python, root-cgroup C++, and the user-slice path all agree).

Comment thread python/ray/_common/utils.py Outdated
# against psutil itself raising on stripped containers / unsupported
# kernels — callers shouldn't have to know which exception types psutil
# may throw (RuntimeError, NotImplementedError, OSError, …).
try:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user has swap accounting enabled but we failed to fetch the swap size available, we should explicitly raise an exception. This is the startup path, and we want to fail fast if something is misconfigured instead of quietly failing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll drop the blanket except in the helper so psutil raises. The startup path (resource_and_label_spec.resolve) won't catch it, so misconfig fails fast. The dashboard reporter is a periodic loop so I'll keep a catch there but log the exception.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 91d1089a27.

if _COUNT_SWAP_IN_MEMORY_MONITOR:
try:
swap_total, _ = get_cgroup_aware_swap_memory()
except (OSError, NotImplementedError):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would psutil or anything inside the call actually throw NotImplementedError?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

psutil can raise NotImplementedError on platforms where swap_memory() isn't supported, but the helper currently swallows everything so this except is dead code anyway. Going to drop the try/except entirely per your other comment — let it crash on startup if swap lookup fails.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 91d1089a27.

logger.warning(
"Failed to retrieve swap memory info; swap will "
"not be counted toward node memory capacity.",
exc_info=True,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log the exception's string representation as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9d970dc632 — the original except here got removed (per your fail-fast comment), and str(e) is now used on the analogous catch in the dashboard reporter.

try:
swap_total, swap_used = get_cgroup_aware_swap_memory()
total += swap_total
used += swap_used

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually feels a little misleading. When swap space starts being used, users will observe a noticeable performance degradation. Instead of treating swap as extra memory in the metrics, we probably want to separate these out to a separate metric.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'll revert _get_mem_usage to RAM-only and add a separate _get_swap_usage plus node_swap_used / node_swap_total / node_swap_utilization gauges and a swap field on the wire payload.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9d970dc632.

MemoryUsageSnapshot memory_snapshot = MemoryMonitorUtils::TakeSystemMemoryUsageSnapshot(
cgroup_path,
MemoryMonitorUtils::kProcDirectory,
/*include_swap=*/false);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does swap space interact with cgroup constraints? When memory usage approach a cgroup constraint, as the constraint get triggered before the memory spills to disk, or are we guaranteed that the memory spills before the constraint is triggered?

@preneond preneond Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory.high is a soft throttle — crossing it triggers direct reclaim, which includes swapping anonymous pages out (subject to memory.swap.max) before the cgroup can keep allocating. So swap-out happens as part of hitting memory.high, not before or after it.

memory.max is the hard cap. On v2 it's RAM-only; swap is governed separately by memory.swap.max. We don't write memory.swap.max on the user slice so it inherits from the parent.

That's why include_swap=false here — we want the RAM-only number for the throttle, and the kernel uses available swap to absorb pressure before stalling. Will expand the inline comment to capture this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 8b8d393647.

Comment thread src/ray/common/memory_monitor_utils.cc
preneond added a commit to preneond/ray that referenced this pull request Jun 22, 2026
…ited

cursor[bot] caught a mismatch I introduced in the previous commit:
GetMemoryThreshold added host swap to the OOM trigger for an unlimited
user-slice swap.max, but TakeCgroupMemorySnapshot still left
swap_max_bytes / swap_used_bytes at zero. TakeUserSliceMemoryUsageSnapshot
fed those zeros into the per-tick used/total, so the trigger sat at
RAM+host_swap while per-tick usage was RAM-only — the monitor would
never see usage grow to the threshold and OOM kills would fire late or
not at all.

Apply the same fallback the threshold path uses:

- "max" / non-numeric / int64-overflow → snapshot.swap_max_bytes =
  host SwapTotal (from GetHostSwapBytes).
- swap.current is read whenever the cgroup has any swap budget
  (numeric non-zero OR unlimited). The kernel publishes swap.current
  regardless of the cap, so it gives the per-cgroup actual usage we
  want — host SwapTotal-SwapFree is host-wide and wouldn't isolate the
  user slice.
- swap.max == 0 (swap explicitly disabled) keeps the old behavior:
  skip the swap.current read so a stale value can't surface as used.

Plumbed proc_dir through TakeCgroupMemorySnapshot so tests can mock
/proc/meminfo. Existing TestUserSliceUnlimitedSwapNotAddedToTotal
renamed and flipped; new TestUserSliceOverflowSwapFallsBackToHostSwap
covers the std::stoll out_of_range path.

Closes: ray-project#63793 (comment)
Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
preneond added 9 commits June 22, 2026 15:20
Introduce the RAY_count_swap_in_memory_monitor config flag (default false)
and wire it through MemoryMonitorUtils so the raylet threshold OOM monitor
can include swap when computing total and used memory.

* memory_monitor_utils.cc::GetCGroupMemoryBytes — cgroup v2 swap.max /
  swap.current added to total/used; cgroup v1 memsw counters replace the
  RAM-only counters when the flag is on.
* memory_monitor_utils.cc::GetLinuxMemoryBytes — /proc/meminfo SwapTotal
  and SwapTotal-SwapFree folded into total/used.
* swap.max parse safety: all-digit pre-check + std::stoll inside
  try/catch so kernel ULLONG_MAX / "max" sentinels do not crash the
  raylet, and swap.current is only consulted when swap.max > 0.

Tests cover cgroup v2 swap, cgroup v1 memsw, /proc/meminfo swap, the
parse-overflow guard, and the flag-off no-op for each path.

Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
Introduce ray._common.utils.get_cgroup_aware_swap_memory as the single
source of truth for swap totals seen by ResourceAndLabelSpec and the
dashboard reporter agent. The helper's invariant: once a cgroup branch
is taken, every returned field is cgroup-scoped — host psutil values
are never mixed into a cgroup result. cgroup v2 reads memory.swap.max
and memory.swap.current; cgroup v1 derives swap from
memory.memsw.{limit,usage}_in_bytes minus the RAM counters; the
no-cgroup branch falls back to psutil.swap_memory().

Fixes several divergences from the C++ raylet monitor's enforcement:

* Scheduler previously read host psutil.swap_memory().total directly,
  so in containers it advertised cgroup_ram + host_swap while the C++
  OOM killer only enforced the cgroup limit.
* Dashboard fell back to host swap when memory.swap.max was the literal
  "max" sentinel. C++ adds no swap in that case; helper now returns 0.
* cgroup v1 memsw without a RAM limit double-counted RAM as swap.
  Helper now returns 0 in that ambiguous case.
* Once a cgroup branch was committed, a read or parse error leaked host
  swap through the except handler; both branches now return (0, 0).
* memory.swap.max values that overflow int64 (kernel ULLONG_MAX
  sentinel) parsed via int() and were clamped to host.total. Helper now
  treats them as unusable and returns 0, matching C++ stoll out_of_range.
* When memory.swap.max parses to 0 (swap disabled), the helper no
  longer reads memory.swap.current — prevents impossible used > total.
* The host clamp `min(cgroup_swap_max, host.total)` was dropped from
  both branches so totals mirror what the raylet OOM monitor enforces.

The dashboard reporter applies a `used = max(0, min(used, total))` clamp
mirroring MemoryMonitorUtils::GetCGroupMemoryBytes so the Node Memory
graph cannot report negative available or percent > 100%.

Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
…nitor

When --enable-resource-isolation is on, the active enforcement path is
ThresholdMemoryMonitor::IsResourceIsolationThresholdExceeded →
TakeUserSliceMemoryUsageSnapshot, which previously summed
user.anon + user.shmem + system.shmem and compared against host RAM —
ignoring memory.swap.* for both user and system cgroups. Combined with
the scheduler advertising RAM + cgroup swap, this caused the OOM killer
to enforce a lower ceiling than the cluster was scheduling against.

Changes:

* memory_monitor_interface.h: extend CgroupMemorySnapshot with
  swap_used_bytes and swap_max_bytes, defaulted to 0 so flag-off paths
  stay byte-identical to the pre-change behavior.
* memory_monitor_utils.cc: TakeCgroupMemorySnapshot now reads
  memory.swap.{max,current} for the cgroup when
  count_swap_in_memory_monitor is on, mirroring the parse safety the v2
  branch of GetCGroupMemoryBytes already uses — all-digit check before
  std::stoll, try/catch around overflow / invalid input, and skip
  swap.current when swap.max is 0.
* TakeUserSliceMemoryUsageSnapshot folds the user slice's swap into
  both used and total. By analogy with how system.anon is excluded
  (only system.shmem is included for indeterminacy reasons), the system
  slice's swap.{max,current} is NOT credited to the user-slice budget —
  swap is per-cgroup deterministic, so system swap belongs to raylet,
  not to user workers.
* TakeUserSliceMemoryUsageSnapshot now calls GetLinuxMemoryBytes with
  include_swap=false so host-level /proc/meminfo swap is not
  double-counted against the per-slice cgroup swap added above.
* GetLinuxMemoryBytes takes a new include_swap parameter (default true)
  so existing callers preserve behavior; only the user-slice path opts
  out.
* GetMemoryThreshold: under resource isolation, when the flag is on,
  add the user cgroup's memory.swap.max to the resolved threshold. The
  isolation override uses memory.high (a RAM-only kernel constraint)
  as the base threshold, so without this addition the per-tick
  used_bytes (which now includes user-slice swap) would be compared
  against a RAM-only ceiling and trigger the OOM killer prematurely.
  Same parse safety pattern as the v2 swap path; "max" / overflow /
  read failure all degrade silently to adding nothing.

Tests mirror the existing v2 cases in memory_monitor_utils_test.cc:
swap added when flag is on, ignored when flag is off, "max" sentinel
not added, swap.max == 0 ignores a stale swap.current, missing swap
files degrade to the pre-flag behavior. The "swap added" test mocks
non-zero system-cgroup swap and asserts it is NOT folded into used or
total. Also pins the flag off in the pre-existing
TestTakeUserSliceMemoryUsageSnapshotValidPaths… test so its assertion
against TakeSystemMemoryUsageSnapshot's host total is robust to test
ordering — RayConfig is process-global.

Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
When cgroup v2 `memory.swap.max` is the kernel's "unlimited" sentinel
("max", non-numeric, or a numeric value that overflows int64), the cgroup
imposes no swap cap, so the practical limit is whatever the host has —
not zero. Both the Python helper and the C++ memory monitor now read
host swap (from psutil / /proc/meminfo) in this case.

`memory.swap.max == 0` (swap explicitly disabled) still returns 0 — the
kernel is saying "no swap for this cgroup", which is distinct from
"unlimited".

Extends test_utils.py and the C++ gtest with parametrized coverage for
"max", garbage, empty, and overflow values. Adds a string overload of
MockCgroupv2Swap so tests can inject content std::stoll rejects.

Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
When RAY_count_swap_in_memory_monitor=1 is set, the operator has
explicitly opted into swap accounting — silently swallowing the swap
lookup failure means scheduling and the OOM monitor behave as if swap
is off, which is the exact opposite of what was requested.

- _get_host_swap_memory no longer wraps psutil.swap_memory() in a
  blanket except; psutil's native exception (OSError /
  NotImplementedError / RuntimeError) propagates.
- resource_and_label_spec.resolve() drops its try/except. Misconfig
  crashes raylet startup with the native error.
- Cgroup branches keep their (OSError, ValueError) handling — that
  protects the "never mix host swap into a cgroup result" invariant.
- Periodic callers (e.g. dashboard reporter, addressed next) catch
  their own exceptions; they shouldn't take down metrics on transient
  failure.

Test flipped from "silently logs warning" to "raises OSError"; adds a
companion test that flag-off path never calls the swap helper.

Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
Folding swap into the Node Memory graph hid the perf-degradation signal
operators need once swap actually starts getting used. Split it out:

- _get_mem_usage reverts to RAM-only.
- New _get_swap_usage returns (total, used, percent) using the same
  cgroup-aware helper. Reads RAY_count_swap_in_memory_monitor; returns
  zeros when off. Wraps exceptions with str(e) in the log line per
  Kunchd's review feedback — periodic loop, so log-and-continue beats
  taking down the metrics path.
- Wire payload gets a new "swap" field alongside "mem"/"host_mem"/
  "cgroup_mem".
- New Prometheus gauges node_swap_used / node_swap_total /
  node_swap_utilization, emitted only when swap_total > 0 so we don't
  add a constant-zero series to dashboards.

Trade: the OOM killer and scheduler `memory` resource still count
RAM+swap when the flag is on, so the kill threshold no longer matches
the Node Memory graph 1:1. The swap-specific graph carries that signal
instead. Frontend panel left as a follow-up; this commit ships the
Prometheus + wire-payload contract.

STATS_TEMPLATE gets the new "swap" key and tests cover both the
"flag off / no swap" (no records) and "flag on with non-zero swap"
(all three gauges emitted) cases.

Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
The previous comment said memory.high was RAM-only and that swap must
not be folded in, but didn't explain why or how swap actually interacts
with the throttle. Expand to capture the kernel ordering Kunchd asked
about in review:

- memory.high is the soft-throttle line. Crossing it triggers direct
  reclaim, which itself includes swapping anonymous pages out (subject
  to memory.swap.max). Swap-out happens AS PART OF hitting memory.high,
  not before or after.
- memory.max is the hard cap. Under cgroup v2 it's RAM-only; swap is
  governed by a separate memory.swap.max, which we don't write on the
  user slice — so it inherits from the parent (typically unlimited).
- include_swap=false here is what lets the kernel use available swap to
  absorb pressure before stalling. Inflating total_bytes by swap would
  push memory.high past host RAM and silently disable the throttle.

Comment-only change; no behavior diff.

Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
The earlier commit changed GetCGroupMemoryBytes (root cgroup) and the
Python helper to fall back to host swap when memory.swap.max is "max"
or overflows int64. GetMemoryThreshold's resource-isolation branch read
the *user* cgroup's swap.max and still treated "max" / overflow as
"add nothing" — meaning the scheduling capacity and dashboard
advertised RAM+swap while the user-slice OOM monitor enforced RAM only.

Mirror the same semantics: when the user-slice swap.max is the
"unlimited" sentinel, the user slice inherits the parent budget
(typically unlimited at the root), so fall back to host swap. Python,
the root-cgroup C++ path, and the user-slice C++ path now all agree.

Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
…ited

cursor[bot] caught a mismatch I introduced in the previous commit:
GetMemoryThreshold added host swap to the OOM trigger for an unlimited
user-slice swap.max, but TakeCgroupMemorySnapshot still left
swap_max_bytes / swap_used_bytes at zero. TakeUserSliceMemoryUsageSnapshot
fed those zeros into the per-tick used/total, so the trigger sat at
RAM+host_swap while per-tick usage was RAM-only — the monitor would
never see usage grow to the threshold and OOM kills would fire late or
not at all.

Apply the same fallback the threshold path uses:

- "max" / non-numeric / int64-overflow → snapshot.swap_max_bytes =
  host SwapTotal (from GetHostSwapBytes).
- swap.current is read whenever the cgroup has any swap budget
  (numeric non-zero OR unlimited). The kernel publishes swap.current
  regardless of the cap, so it gives the per-cgroup actual usage we
  want — host SwapTotal-SwapFree is host-wide and wouldn't isolate the
  user slice.
- swap.max == 0 (swap explicitly disabled) keeps the old behavior:
  skip the swap.current read so a stale value can't surface as used.

Plumbed proc_dir through TakeCgroupMemorySnapshot so tests can mock
/proc/meminfo. Existing TestUserSliceUnlimitedSwapNotAddedToTotal
renamed and flipped; new TestUserSliceOverflowSwapFallsBackToHostSwap
covers the std::stoll out_of_range path.

Closes: ray-project#63793 (comment)
Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
@preneond preneond force-pushed the feat/count-swap-in-memory-monitor branch from 7d8ddf0 to fc20c12 Compare June 22, 2026 13:20

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit fc20c12. Configure here.

# the OOM monitor enforces.
swap_total = max(0, memsw_limit - ram_limit)
swap_used = 0 if ram_usage is None else max(0, memsw_usage - ram_usage)
return swap_total, swap_used

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cgroup v1 memsw accounting mismatch

Medium Severity

With RAY_count_swap_in_memory_monitor enabled on cgroup v1, the Python scheduler helper treats memsw as swap-only (via memsw − ram) added to RAM-based capacity, while the C++ OOM path switches totals and usage to combined memsw counters. Missing memory.limit_in_bytes yields (0, 0) swap in Python but full memsw limits in C++.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit fc20c12. Configure here.

}
}
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isolation threshold skips swap lookup

Medium Severity

Under resource isolation, per-tick user-slice usage always folds in user cgroup swap via direct memory.swap.* reads, but GetMemoryThreshold only extends memory.high with swap when GetUserCgroupConstraintValue succeeds. If that lookup fails while swap files remain readable, used includes swap while the kill threshold does not.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit fc20c12. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants