Skip to content

Bypass safetensors mmap when --disable-mmap is set#13609

Closed
johnnynunez wants to merge 1 commit into
Comfy-Org:masterfrom
johnnynunez:fix/safetensors-no-mmap-loader
Closed

Bypass safetensors mmap when --disable-mmap is set#13609
johnnynunez wants to merge 1 commit into
Comfy-Org:masterfrom
johnnynunez:fix/safetensors-no-mmap-loader

Conversation

@johnnynunez
Copy link
Copy Markdown

@johnnynunez johnnynunez commented Apr 29, 2026

Summary

  • Adds a temporary ComfyUI-side safetensors loader that does not call safetensors.safe_open() when --disable-mmap is set.
  • Parses the safetensors header directly, reads each tensor from disk with readinto, and wraps the bytes with torch.frombuffer before moving to the requested device.
  • Removes the previous tensor.to(..., copy=True) workaround, which still retained the mmap-backed storage and could increase peak memory instead of reducing it.

Motivation

safetensors.safe_open() currently always mmaps the underlying file. On unified CPU/GPU memory systems such as NVIDIA Grace Blackwell / DGX Spark, Apple Silicon, AMD APUs, Jetson/IGX, and integrated GPUs, mmap-backed file pages and device/framework tensors can live in the same physical memory pool. Loading large models can therefore require roughly two resident copies of the checkpoint and OOM even when the system has enough memory for one model copy.

This PR makes ComfyUI's existing --disable-mmap flag actually bypass safetensors mmap for .safetensors / .sft files. Peak memory becomes one model copy plus one tensor in flight while the upstream safetensors fix is being reviewed.

Related:

Test plan

  • python3 -c "import ast; ast.parse(open('comfy/utils.py').read()); print('syntax OK')"
  • Cursor linter diagnostics for comfy/utils.py: no errors
  • git diff --check -- comfy/utils.py

Load safetensors through a direct read path under --disable-mmap so unified-memory systems avoid retaining mmap-backed file pages alongside framework tensors.

Made-with: Cursor
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

📝 Walkthrough

Walkthrough

The change adds a new safetensors loader function that manually reads file headers and streams tensor payloads into memory, bypassing Rust mmap. This loader is integrated into load_torch_file to activate when DISABLE_MMAP is enabled for .safetensors or .sft files, allowing tensors to be constructed from buffers and optionally transferred to requested devices. The prior workaround that copied tensors inside the safe_open retrieval loop is removed. The modification adds approximately 75 lines while removing 4.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding logic to bypass safetensors mmap when --disable-mmap is set.
Linked Issues check ✅ Passed The PR implements the required non-mmap safetensors loader and removes the ineffective copy=True workaround, directly addressing the memory duplication issue in #10896 on unified-memory systems.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing the non-mmap safetensors loader and removing the previous workaround; no unrelated modifications detected.
Description check ✅ Passed The pull request description clearly and specifically explains the changes, motivation, and test plan for implementing a non-mmap safetensors loader.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/utils.py`:
- Around line 149-156: The code reads header_size from the file and trusts it;
to prevent OOM attacks you must cap header_size to the same hard limit used by
safetensors_header() before calling f.read(header_size): validate that
header_size is an int within (0, MAX_HEADER_SIZE] (or the existing constant/name
used in safetensors_header), raise a ValueError if out of range, then proceed to
read and decode header_data and json.loads as before; update the logic around
header_size/header_data in the function containing this diff to reuse the
existing safetensors_header size limit and checks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fc606afb-141f-472f-82af-a32d086fc2af

📥 Commits

Reviewing files that changed from the base of the PR and between fce0398 and 1ff3648.

📒 Files selected for processing (1)
  • comfy/utils.py

Comment thread comfy/utils.py
@johnnynunez
Copy link
Copy Markdown
Author

Additional local validation for the temporary no-mmap path:

uv run --with torch --with safetensors --with packaging --with numpy --with pillow --with tqdm --with einops python - <<'PY'
# Creates a temporary safetensors file, forces comfy.utils.DISABLE_MMAP = True,
# monkeypatches comfy.utils.safetensors.safe_open to raise if called, then
# verifies load_torch_file(..., return_metadata=True/False) preserves tensors
# and metadata for f32/f16/bf16/i64/u8/empty tensors.
PY

Result:

no-mmap load_torch_file functional test passed

This specifically verifies that --disable-mmap takes the manual parser path and does not reach safetensors.safe_open().

@johnnynunez
Copy link
Copy Markdown
Author

Additional comparison with comfy.utils.DISABLE_MMAP = False on a local synthetic 256 MiB .safetensors tensor:

=== mmap ===
baseline                        194.4 MiB peak
after load_torch_file           195.6 MiB peak
after touching tensor           452.5 MiB peak

=== mmap_clone ===
baseline                        193.9 MiB peak
after load_torch_file           195.1 MiB peak
after touching tensor           452.0 MiB peak
after clone/copy                708.2 MiB peak

=== nommap ===
baseline                        193.5 MiB peak
after load_torch_file           450.0 MiB peak
after touching tensor           451.3 MiB peak

=== nommap_clone ===
baseline                        193.2 MiB peak
after load_torch_file           449.7 MiB peak
after touching tensor           451.0 MiB peak
after clone/copy                707.2 MiB peak

Interpretation: with DISABLE_MMAP=False, load_torch_file() goes through safetensors.safe_open(). The mmap-backed tensor is lazy at first, then becomes resident once touched. Holding that mmap-backed storage plus a second materialized tensor copy produces the expected ~2x tensor-size peak. The no-mmap path materializes the tensor immediately without retaining a file mapping; on CPU, explicitly cloning still creates a second copy as expected, while device loads can release the per-tensor staging buffer after .to(device).

@johnnynunez
Copy link
Copy Markdown
Author

Clear mmap on/off comparison from local reproduction with a synthetic 256 MiB .safetensors tensor.

In ComfyUI terms:

  • mmap=True means comfy.utils.DISABLE_MMAP = False, so load_torch_file() uses safetensors.safe_open().
  • mmap=False means comfy.utils.DISABLE_MMAP = True, so this PR's manual no-mmap parser is used.
=== mmap=True / DISABLE_MMAP=False ===
baseline                        194.4 MiB peak
after load_torch_file           195.6 MiB peak
after touching tensor           452.5 MiB peak

=== mmap=True + second tensor copy ===
baseline                        193.9 MiB peak
after load_torch_file           195.1 MiB peak
after touching tensor           452.0 MiB peak
after clone/copy                708.2 MiB peak

=== mmap=False / DISABLE_MMAP=True ===
baseline                        193.5 MiB peak
after load_torch_file           450.0 MiB peak
after touching tensor           451.3 MiB peak

=== mmap=False + second tensor copy ===
baseline                        193.2 MiB peak
after load_torch_file           449.7 MiB peak
after touching tensor           451.0 MiB peak
after clone/copy                707.2 MiB peak

Interpretation:

  • With mmap=True, safe_open() is lazy: load_torch_file() barely changes RSS until the mmap-backed tensor is touched.
  • Once touched, the file-backed mapping becomes resident (~+256 MiB here).
  • If a second materialized tensor copy exists at the same time, peak becomes roughly baseline + 2x tensor size (~708 MiB here).
  • With mmap=False, this PR materializes one tensor copy directly and does not retain a file mapping. On CPU, explicitly cloning still creates a second copy as expected; for device loads, the per-tensor staging buffer can be released after .to(device).

@rattus128
Copy link
Copy Markdown
Contributor

We just cut comfy-aimdo 0.3.0 with aarch64 support. does it help? This should bring the dynamic-vram feature into play which very deliberately avoids this copy. The thinking in aimdo is the sole copy of the model is the mmap copy.

If you have a look at load_safetensors there is the notion of the _comfy_tensor_file_slice, which is a little piece of metadata, which allows the model loader to bypass the mmap completely without needing the upfront model deep-copy either. Currently its only used for pins, however on DGX spark it may actually make sense to load the entire model with this path, that is, in some new mode, try using the _comfy_tensor_file_slice to read to temp buffer then to GPU. That would take you RAM footprint for the model to 0 and leave you only with the VRAM copy (which is how it should be on DGX spark)

@rattus128 rattus128 self-assigned this Apr 29, 2026
@johnnynunez-nv
Copy link
Copy Markdown

We just cut comfy-aimdo 0.3.0 with aarch64 support. does it help? This should bring the dynamic-vram feature into play which very deliberately avoids this copy. The thinking in aimdo is the sole copy of the model is the mmap copy.

If you have a look at load_safetensors there is the notion of the _comfy_tensor_file_slice, which is a little piece of metadata, which allows the model loader to bypass the mmap completely without needing the upfront model deep-copy either. Currently its only used for pins, however on DGX spark it may actually make sense to load the entire model with this path, that is, in some new mode, try using the _comfy_tensor_file_slice to read to temp buffer then to GPU. That would take you RAM footprint for the model to 0 and leave you only with the VRAM copy (which is how it should be on DGX spark)

that is good, i didn't know that. Let's community to try it

@Balaxxe
Copy link
Copy Markdown

Balaxxe commented Apr 29, 2026

Clear mmap on/off comparison from local reproduction with a synthetic 256 MiB .safetensors tensor.

In ComfyUI terms:

  • mmap=True means comfy.utils.DISABLE_MMAP = False, so load_torch_file() uses safetensors.safe_open().
  • mmap=False means comfy.utils.DISABLE_MMAP = True, so this PR's manual no-mmap parser is used.
=== mmap=True / DISABLE_MMAP=False ===
baseline                        194.4 MiB peak
after load_torch_file           195.6 MiB peak
after touching tensor           452.5 MiB peak

=== mmap=True + second tensor copy ===
baseline                        193.9 MiB peak
after load_torch_file           195.1 MiB peak
after touching tensor           452.0 MiB peak
after clone/copy                708.2 MiB peak

=== mmap=False / DISABLE_MMAP=True ===
baseline                        193.5 MiB peak
after load_torch_file           450.0 MiB peak
after touching tensor           451.3 MiB peak

=== mmap=False + second tensor copy ===
baseline                        193.2 MiB peak
after load_torch_file           449.7 MiB peak
after touching tensor           451.0 MiB peak
after clone/copy                707.2 MiB peak

Interpretation:

  • With mmap=True, safe_open() is lazy: load_torch_file() barely changes RSS until the mmap-backed tensor is touched.
  • Once touched, the file-backed mapping becomes resident (~+256 MiB here).
  • If a second materialized tensor copy exists at the same time, peak becomes roughly baseline + 2x tensor size (~708 MiB here).
  • With mmap=False, this PR materializes one tensor copy directly and does not retain a file mapping. On CPU, explicitly cloning still creates a second copy as expected; for device loads, the per-tensor staging buffer can be released after .to(device).

Cold notwithstanding, finally got time to test PR #13609. Quick report.

Setup: same Spark / GB10 / Comfy 0.20.1 / PyTorch 2.9.1+cu130 box from my earlier comment. Chroma1-HD bf16 + 6 LoRAs + T5xxl_fp16 + ae VAE. Ran with --disable-dynamic-vram --disable-mmap, which exercises the new load_safetensors_no_mmap path.

Loader works as advertised. Process settles after load:

RssAnon: 20.15 GiB   (single resident copy of all weights)
VmRSS:   21.00 GiB

Logs confirm load_safetensors_no_mmap was hit for every safetensors file in the workflow. So the file-backed mmap + materialized tensor coexistence from #10896 is gone on this path.

Two things came out of the testing that may be worth folding back:

  1. Peak during load is still ~2x model size. VmHWM hit 34.75 GiB even though steady state is 21 GiB. Source isn't the file mmap any more, it's the tensor.to('cuda') step inside the model patcher creating a second resident copy on UMA before the CPU-side reference releases. Different doubling than the one this PR addresses, but it shows up in the same metric and confuses observation. This is what @rattus128 was pointing at with the _comfy_tensor_file_slice zero-RAM-footprint suggestion. PR Bypass safetensors mmap when --disable-mmap is set #13609 fixes the file-vs-tensor doubling cleanly; the CPU-vs-GPU doubling still needs a separate path.

  2. Page cache hangs around after load. readinto populates the kernel page cache, and on Spark UMA those pages stay charged against the unified pool until something reclaims them. After loading Chroma + T5 + LoRAs I saw Cached: 32.5 GiB worth of safetensors files just sitting there. A one-line os.posix_fadvise(fd, 0, 0, os.POSIX_FADV_DONTNEED) after each file's read drops it back to ~5 GiB. Patch I'm running locally:

    try:
        os.posix_fadvise(fd, 0, 0, os.POSIX_FADV_DONTNEED)
    except (AttributeError, OSError):
        pass

    Belongs upstream-ish (probably in safetensors PR Grids & combinatorial outputs #759 too) but easy to add here while the upstream review is pending.

@Balaxxe
Copy link
Copy Markdown

Balaxxe commented Apr 29, 2026

We just cut comfy-aimdo 0.3.0 with aarch64 support. does it help? This should bring the dynamic-vram feature into play which very deliberately avoids this copy. The thinking in aimdo is the sole copy of the model is the mmap copy.

If you have a look at load_safetensors there is the notion of the _comfy_tensor_file_slice, which is a little piece of metadata, which allows the model loader to bypass the mmap completely without needing the upfront model deep-copy either. Currently its only used for pins, however on DGX spark it may actually make sense to load the entire model with this path, that is, in some new mode, try using the _comfy_tensor_file_slice to read to temp buffer then to GPU. That would take you RAM footprint for the model to 0 and leave you only with the VRAM copy (which is how it should be on DGX spark)

This has fully addressed the issue. Thanks!

@rattus128
Copy link
Copy Markdown
Contributor

@Balaxxe @johnnynunez-nv @johnnynunez , please take a look at #13802 . I am getting a big improvement in DGX spark loads as that PR takes the RAM footprint even lower again and seems to solve slowness issues. Let me know either way if you try it as if its good we can push this to master. Note the custom aimdo wheel for DGX listed in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Loading .safetensors files requires double memory on DGX Spark

4 participants