[pull] main from pytorch:main by pull[bot] · Pull Request #979 · MPACT-ORG/pytorch

pull · 2026-03-18T23:25:03Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

The main idea is decoupling runtime estimations gathering into a separate function, that can be used without scheduling OverlapScheduler, but the results can be fed into it, to avoid double measuring. The usecase is using it in simulations of scheduling. runtime_estimations depend on fusion_regions and we do not want to recalculate them second time in OverlapScheduler. For now adding as a second output. Pull Request resolved: #175174 Approved by: https://github.com/eellison

) Fixes intel/torch-xpu-ops#2269 Fixes intel/torch-xpu-ops#2356 Fixes intel/torch-xpu-ops#2687 The root cause of all the failed test cases from mentioned issues is that float64 type is not properly handled by operators `addmm_out` and `baddbmm_out`. In their implementations, they both use `onednn::matmul` call with added post operations (`attr.append_post_eltwise` and `attr.append_post_sum`). The problem is that according to my best knowledge, those post operations doesn't support float64. Just basic matrix multiplication of `onednn::matmul` has support for float64. So adding any post operations without possibility to handle float64 results in numerical precision issues. This PR changes approach to split one `onednn::matmul` call into separate add, multiply by scalar and matrix multiplication operations. They all support float64 thus the results are correct. This change enforced a little different handling of inplace usage case. For inplace case, the `self (input)` and `result` tensors are the same objects. In the original approach, `onednn::matmul` performed all the operations, so there were no any problems as we passed the `self` and `result` tensors to `matmul` and we got the correct result. When we use `onednn::matmul` just for matrix multiplication and perform scalar multiplication and add operations separately, we cannot use the `self` tensor as it is the same as `result` tensor after `matmul` operation and the `result` contains now the matrix multiplication result. That is why the additional copy for inplace version is needed. Pull Request resolved: #174590 Approved by: https://github.com/astachowiczhabana, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/gujinghui

…oups (#176225) Fix _validate_no_duplicate_params false positive from id() reuse _validate_no_duplicate_params detects tied weights placed in separate FSDP groups by comparing id(param) values captured at FSDPParam creation time. However, Python reuses id() values for garbage-collected objects. In layer-by-layer init patterns (materialize block, fully_shard, GC, next block), freed parameters' id() values get reused by new parameters in later blocks, triggering a spurious "already managed by another FSDP group" error. The fix replaces id(param) with a monotonically increasing UID stamped on the parameter object itself. For tied weights, both modules share the same nn.Parameter object, so they see the same UID and the duplicate check still fires correctly. For different parameters that happen to recycle the same id() after GC, each gets its own fresh UID. Add a test that materializes and shards blocks one at a time with gc.collect() between each, verifying that forward and backward complete without a false-positive duplicate parameter error. Fixes #176076 Authored with Claude. Pull Request resolved: #176225 Approved by: https://github.com/weifengpy

Summary: When `batched_grad_copy=True` is passed to DDP (or Reducer directly), individual per-parameter gradient-to-bucket copy and div operations are deferred and flushed as a single `_foreach_copy_` + flat `div_` when a bucket becomes ready (pending == 0). This reduces N per-parameter kernel launches (mul_out/copy_/div_) down to 2 kernels per bucket. The optimization is behind a flag and disabled by default. It handles both the non-alias path (grads copied into bucket views, e.g. when `set_to_none=True` destroys the bucket view alias every iteration) and the alias path (grads already alias bucket views via `gradient_as_bucket_view=True`), where per-variable div_ is deferred to a single flat div_ on the bucket tensor. Read the changes in this order: 1. reducer.hpp — new flag, member variable, Bucket::deferred_copy_indices 2. reducer.cpp — constructor, mark_variable_ready_dense (defer copies), mark_variable_ready (flush when bucket ready), reset_bucket_counting 3. init.cpp — pybind binding 4. distributed.py — Python DDP parameter plumbing 5. test_c10d_gloo.py — unit tests Authored with Claude. Test Plan: ``` buck run fbcode//mode/opt fbcode//caffe2/test/distributed:c10d -- test_c10d_gloo.ReducerTest ``` All 10 ReducerTest tests pass (6 existing + 4 new): ``` test_batched_grad_copy_basic ... ok test_batched_grad_copy_matches_default ... ok test_batched_grad_copy_with_bucket_view ... ok test_batched_grad_copy_with_set_to_none ... ok test_forward_backward ... ok test_forward_backward_optimizer ... ok test_forward_backward_unused_parameters ... ok test_multi_dtype_multi_bucket ... ok test_multi_dtype_single_bucket ... ok test_single_dtype_single_bucket ... ok Ran 10 tests in 0.257s ``` Differential Revision: D94854325 Pull Request resolved: #176638 Approved by: https://github.com/wconstab

…/inductor/aoti_torch/c/shim.h` (#177594) authored with claude code, supersedes #177106 according to claude it was passing on x86 inadvertently because "On x86, the generated kernel code pulls in ATen vectorized headers (AVX2/AVX512) that transitively includc10/util/Exception.h, accidentally defining TORCH_CHECK_MSG before the macro is expanded. So the bug is masked" This change uses the `+#include <torch/headeronly/util/Exception.h>` include instead, `STD_TORCH_CHECK_MSG` Pull Request resolved: #177594 Approved by: https://github.com/desertfire, https://github.com/Skylion007

… mode + replication padding (#177166) Fixes #170079 ## Context `torch.compile(ReplicationPad1d(...), fullgraph=True)` crashes when `torch.use_deterministic_algorithms(True)` is set on CUDA. The error: Dynamo can't trace through `importlib.import_module`. The deterministic code path exists because the native `replication_pad1d_backward` CUDA kernel uses `atomicAdd` (non-deterministic). `functional.py` calls `_replication_pad` — a Python decomposition using `_unsafe_index`, whose backward uses `index_put` (deterministic). ## Dynamo limitations encountered Three separate Dynamo tracing barriers prevented calling `_replication_pad` directly: ### 1. `importlib.import_module` is marked as skipped ```python @torch.compile(fullgraph=True) def fn(x): import importlib return importlib.import_module("torch").sin(x) fn(torch.randn(3)) # Unsupported: function marked as skipped ``` ### 2. `elementwise_dtypes` returns non-Tensor (from `@pw_cast_for_opmath`) ```python @torch.compile(fullgraph=True) def fn(x): from torch._prims_common import elementwise_dtypes, ELEMENTWISE_TYPE_PROMOTION_KIND dt, _ = elementwise_dtypes(x, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT) return x.to(dt) fn(torch.randn(3)) # Unsupported: torch.* op returned non-Tensor ``` ### 3. `torch._check` with closure lambda ```python @torch.compile(fullgraph=True) def fn(x): dim = x.dim() torch._check(dim in (2, 3), lambda: f"expected 2D or 3D, got {dim}D") return x + 1 fn(torch.randn(3, 3)) # Unsupported: Can't extract message from torch._check() ``` ## Iteration log | # | Approach | Who | Tests | Reviewer pushback | Why it failed | |---|----------|-----|-------|-------------------|---------------| | 1 | Replace `importlib` with `from...import` | Claude | bilinear/trilinear pass, replicate fails | "why do we need bilinear/trilinear tests?" — scoped fix to reported bug only | Hit limitation #2: `@pw_cast_for_opmath` | | 2 | Skip decomposition under compile via `is_compiling()`, rely on AOTAutograd's `@register_decomposition` | Claude | forward-only `backend="eager"` passes | "can you verify at inductor level this is actually deterministic?" — inspect AOT graph | No backward decomposition registered; backward still uses native `replication_pad1d_backward` (non-deterministic) | | 3 | Unwrap `@pw_cast_for_opmath` via `__wrapped__` | Claude | N/A — fails immediately | N/A | Hit limitation #3: `torch._check()` closure | | 4 | `@nonstrict_trace` — Dynamo skips body, AOTAutograd traces through | Reviewer suggestion | `backend="aot_eager"`, forward + backward under `DeterministicGuard(True)` | N/A — fix is correct | N/A | ## Key insight The fix isn't about making Dynamo trace the decomposition or skipping it entirely — it's about putting the boundary in the right place. Dynamo doesn't need to see inside; AOTAutograd does. `@nonstrict_trace` is exactly this boundary. Each "obvious" fix had passing tests that weren't testing the right thing. Only when the reviewer pushed for backward determinism verification and AOT graph inspection did the weaknesses surface. The backward completing without error under `DeterministicGuard(True)` proves determinism — PyTorch explicitly raises `RuntimeError` if any non-deterministic CUDA kernel executes under this mode. Authored with Claude. Pull Request resolved: #177166 Approved by: https://github.com/mlazos, https://github.com/williamwen42

#177720) Add a `python_value_for_identity()` hook to `VariableTracker` that returns the underlying Python object for identity (`is`) comparison. The base implementation delegates to `as_python_constant()`; VTs that wrap a real Python object but don't support `as_python_constant()` (e.g. `UserDefinedObjectVariable`, `NNModuleVariable`, `ObjectVariable`, `StreamVariable`, `EventVariable`) override it to return `self.value` directly. This lets `handle_is` compare any two VTs generically — including cross-type comparisons that previously fell through to a graph break — and removes 5 redundant type-specific handler entries that each reimplemented the same pattern with slightly different attribute access (`.value`, `.fn`, `get_submodule()`). Also consolidates the duplicate `NO_SUCH_SUBOBJ` sentinel definitions from `misc.py` and `user_defined.py` into `base.py`. Authored with Claude. Pull Request resolved: #177720 Approved by: https://github.com/guilhermeleobas

Enables tracing requires_grad_() on intermediate tensors when the autograd effects (backward/grad) are consumed within the compiled region. Previously, requires_grad_() always graph-broke. The key pattern this unlocks: ``` def fn(x, targets): # Forward computation before detach (e.g. transformer layers) h = x * 2 + 1 x_detached = h.detach().requires_grad_() chunksz = x_detached.shape[0] // 2 total_loss = torch.tensor(0.0) for start in range(0, x_detached.shape[0], chunksz): chunk = x_detached[start : start + chunksz] chunk_targets = targets[start : start + chunksz] logits = chunk @ torch.eye(chunk.shape[-1]) loss = torch.nn.functional.cross_entropy(logits, chunk_targets) loss.backward() total_loss = total_loss + loss.detach() # Propagate chunked grad back through the forward computation h.backward(x_detached.grad) return x.grad, total_loss ``` Pull Request resolved: #176984 Approved by: https://github.com/zou3519

* `x[0:s1]` where x.size(0) = `s0-1` should produce `Min(s1, s0-1)` * Before this PR, it would produce `u0`. Pull Request resolved: #175819 Approved by: https://github.com/laithsakka

…n available (#174594)" (#177703) (#177703) Summary: X-link: pytorch/benchmark#2674 Original commit changeset: dd82c6ba3b37910568b2a25674149c1653768ec2 This reverts D94545016. Test Plan: IN CI Reviewed By: huydhn Differential Revision: D96365273 Pull Request resolved: #177703 Approved by: https://github.com/huydhn

…e TestNN with instantiate_device_type_tests (#166396) For #114850, we will port aten unit tests to Intel GPU. This PR will work on test/test_nn.py TEST_NN class for single GPU test only. We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. Use torch.accelerator to extend cude specific test to XPU. 2. Added skipIfXPU decorator for cases with known issues on Intel GPU 3. Enabled 'xpu' for some test pathes Pull Request resolved: #166396 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/mikaylagawarecki

Similar to #176881 and #151441 this PR adds an SVE fast exponential implementation, intended for cases where outputs will be downcasted to FP16 / BF16 (e.g. attention softmax). Implementation is similar to exp_u20, but: - approximates exp(r) - 1 as r instead of r + 0.5 r^2 - does not split natural log (ln) into high / low parts - avoids special case code by clamping exp(x) to 0 for x < -87.346 and inf for x > 88.717 ## Accuracy Tested in a similar fashion to #17688 by iterating over all possible FP32 bit patterns and calculates ULP between: - `fexp_u20` with inputs in FP32, outputs converted to BF16/FP16 - `std::exp` with inputs in FP32, outputs converted to BF16/FP16 From the accuracy study above, this exp is: - Accurate within a maximum of 1 ULP for FP16 - Accurate within a maximum of 1 ULP for BF16 for inputs in [-87.346, max_float] & clamps inputs < -87.346 to zero. ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V1 cores (with SVE256): | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup vs current | |---:|---:|---:|---:|---:|---:|---|---|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +7.20% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | +0.38% (noise) | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +4.32% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +3.38% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.35% | Pull Request resolved: #177645 Approved by: https://github.com/Skylion007

Fix #176970 Pull Request resolved: #177674 Approved by: https://github.com/eellison

…tity hook (#177720)" This reverts commit e10d2d3. Reverted #177720 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#177720 (comment)))

Resolving device using _get_pg_default_device() will return cpu if multi-device PG. NCCL estimator can always choose device as cuda. Bug seen in rigi which creates multi-device PG. Due to this, NCCL estimator is not used for multi-device PGs. Testing: Adding new unit tests covering fake, multi-device, and single-device cases.  Pull Request resolved: #175896 Approved by: https://github.com/IvanKobzarev

@codex

Fix #145598 ## Summary 1. What is the root cause problem The `torch._refs.add` and `torch._refs.sub` alpha-validation path assumes at least one promoted operand is tensor-like and reads `.dtype` from it. Under `torch.compile`, scalar-only `add/sub(..., alpha=..., out=...)` calls for the `aten.add.out` / `aten.sub.out` path reach the ref decomposition with two promoted Python scalars and an `out` tensor, so fake/meta execution raises when it tries to read `b.dtype` from a scalar. 2. What is the proposed fix Teach the shared alpha-validation path to derive its dtype from a tensor operand when one exists, and otherwise fall back to the promoted scalar type when both operands are Python scalars. The PR also extends the existing Dynamo `add/sub alpha out` regression coverage to include scalar-only inputs. 3. Why the proposed fix is the right long term fix This keeps dtype validation aligned with the post-promotion operands that the ref decomposition already operates on, without special-casing Dynamo or the out variant. It fixes the reported compile failure at the decomposition boundary and covers the same shared add/sub logic with a targeted regression. Drafted via @codex, published after manual review by @bobrenjc93 Pull Request resolved: #177677 Approved by: https://github.com/Lucaskabela

…177546) pad_mm pads mm operands to alignment boundaries, then slices the output back to the original size. This slice is a view that inherits the padded base strides (e.g., stride (4, 1) instead of (2, 1) for a 3x2 output). Two bugs allowed these padded strides to leak through to user-visible outputs: 1. `record_original_output_strides` ran after `joint_graph_passes` (which includes pad_mm), so it captured already-padded strides as the "original" target strides. 2. The stride enforcement in `GraphLowering.run_node` used `require_stride_order` for view outputs, which only checks stride ordering (both (4,1) and (2,1) are row-major). For user-visible outputs, we need `require_exact_strides` to enforce the actual stride values. Fixes: pytorch/alerting-infra#3219 Authored with Claude. Pull Request resolved: #177546 Approved by: https://github.com/desertfire, https://github.com/v0i0, https://github.com/benjaminglass1

…77746) Document how to use torch.mps.compile_shader to JIT-compile and test Metal kernels in isolation, covering dispatch semantics, constant parameters, multi-kernel pipeline debugging strategy, and pitfalls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Pull Request resolved: #177746 Approved by: https://github.com/dcci, https://github.com/Skylion007

…174361) Fix #174049 Fix #151098 scale in sdpa is passed as a scalar into pattern matcher. Some sdpa pattern replacement graph use the default value (sqrt(head_dim)), but eager code may not be default. In this case, there will be accuracy issue for torch compile. I add a check for scalar values. We may add support for non-default scale in future if we meet such cases in real workload (maybe follow the way in dropout) We may revert #172951 before this pr because it tries to solve the same issue by disabling pattern matcher. Pull Request resolved: #174361 Approved by: https://github.com/eellison

…uts for OSS by default (#173668) Addresses issue: #173313 Differential Revision: [D96456766](https://our.internmc.facebook.com/intern/diff/D96456766) Pull Request resolved: #173668 Approved by: https://github.com/bdhirsh, https://github.com/zou3519

…7395) Summary: There's a possibility that x and y can be long and int at the same time. This causes functools.reduce to fail. Test Plan: f1041623442 failed. f1045088335 with custom ien.lower succeeded. Reviewed By: desertfire Differential Revision: D95253698 Pull Request resolved: #177395 Approved by: https://github.com/PaulZhang12, https://github.com/Skylion007, https://github.com/mlazos, https://github.com/desertfire

[ROCm] Enable test_embedding_backward_dynamic_shapes_large_grid on ROCm The test already contains a ROCm-specific assertion path that correctly validates the total-threads grid limit (num_blocks * num_warps * warp_size) instead of CUDA's block-count-only limit. The @skipIfRocm was added as a temporary workaround for a regression in ROCm 7.2, but the test logic was subsequently updated to handle ROCm's platform differences correctly. Remove the skip so the test can validate ROCm grid limit behavior. Pull Request resolved: #176567 Approved by: https://github.com/laithsakka, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

The test fails on rocm with: ``` ----------------------------------------- Captured stderr call ----------------------------------------- E0220 18:38:26.967000 6455 torch/_dynamo/utils.py:3413] Accuracy failed: allclose not within tol=0.5 ======================================= short test summary info ======================================== FAILED [2.4376s] test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_rms_norm_sharing_weights_split_reductions_False_bfloat16 - AssertionError: False is not true ``` Tested on AMD Radeon Pro V710. Pull Request resolved: #175427 Approved by: https://github.com/jeffdaily

Ubuntu is working on packaging the ROCm stack natively (IE in FHS directories). In the case of Ubuntu, HIP is found in `/usr/lib/x86_64-linux-gnu/cmake/hip`. When looking for HIP, check the `CMAKE_LIBRARY_ARCHITECTURE` path as well. Pull Request resolved: #175349 Approved by: https://github.com/jeffdaily

@codex

Fix #143702 ## Root cause The Meta/FakeTensor `aten::as_strided` path used an unchecked view construction helper that skipped eager argument validation. That let invalid negative strides survive tracing and reach generated code, where the compiled path could crash instead of raising the same runtime error as eager mode. ## Proposed fix - Factor the eager `as_strided` argument validation into a shared helper. - Reuse that validation in the unchecked Meta/FakeTensor path so it still avoids storage-bound guards, but rejects invalid negative strides and storage offsets immediately. - Add a FakeTensor regression test for the empty-tensor negative-stride case from the issue. ## Why this is the right long term fix This keeps eager and Meta/FakeTensor semantics aligned at the validation boundary without reintroducing the storage-size guards that the unchecked path intentionally avoids. Sharing the validation logic in one helper also reduces the chance that eager and tracing behavior drift apart again. Drafted via @codex, published after manual review by @bobrenjc93 Pull Request resolved: #177678 Approved by: https://github.com/aorenste

The CUDA caching allocator's TraceEntry records a mempool_ field, but this was never serialized into the snapshot output from torch.cuda.memory._snapshot(). This makes it impossible to associate trace events (alloc/free) with specific memory pools when analyzing snapshots. This PR adds `pool_id` as a `(int, int)` tuple to each trace entry in both serialization paths (pybind and pickle), matching the format already used by segment_pool_id on segments. Also documents the previously-undocumented segment_pool_id field on the Segment TypedDict. Pull Request resolved: #177717 Approved by: https://github.com/malfet

## Summary Adds a new **"Allocated Memory (incl. Private Pools)"** tab to the CUDA Memory Visualizer (`MemoryViz.js`). This tab visualizes memory from both the default allocator pool and private `MemPool`s in a single stacked-area timeline. This should be useful for view CUDA Graph memory because those memory are allocated in private pools. perf: tried on an 80Mb pickle file with 241384 entries, loads in ~3 sec on my macbook air. ### Problem The existing "Active Memory Timeline" tab treats every allocation independently. When using private `MemPool`s, individual blocks are freed and reallocated within the pool, but the pool's reserved memory is never returned to the system. The existing view doesn't show this — freed pool blocks simply disappear, making it hard to understand actual memory pressure from private pools. ### Solution Each private pool is rendered as a single **envelope** — a gray band whose height equals the pool's high-water mark. The envelope only grows (never shrinks), reflecting that private pool memory is reserved until the pool's segment is explicitly freed. Individual block allocations within the pool are shown as **colored stripes** (at 50% opacity) inside the envelope, appearing when active and disappearing when freed. Default-pool allocations are rendered exactly as before (full-opacity colored blocks in the stacked area chart). ### How to review All changes are in `torch/utils/viz/MemoryViz.js`. The two commits can be reviewed in order: 1. **`Add a tab to show private pool memory view`** — Adds the tab entry, `include_private_inactive` parameter plumbing, `isPrivatePool` helper, `find_pool_id` binary search for mapping block addresses to pool IDs, `segment_pool_id` propagation through the data model, and pool_id display in segment/block hover text. 2. **`add activities inside the pool`** — Adds the pool envelope and stripe logic inside `process_alloc_data`: per-pool state tracking (`max`, `active`, `block_stack`), envelope creation/growth with animated transitions, stripe lifecycle (create on alloc, close on free, shift on repack), and stripe-envelope synchronization when the global stack shifts. ### Detailed behavior Given this sequence (from `test.py`): ```python torch.cuda.memory._record_memory_history() # 1. Default-pool allocation x = torch.empty(1024, device="cuda", dtype=torch.uint8) # 2. Private-pool allocation m = torch.cuda.MemPool() with torch.cuda.use_mem_pool(m): y = torch.empty(1024, device="cuda", dtype=torch.uint8) # 3-4. Free tensors — y's memory becomes inactive in the private pool del x del y # 5. Another default-pool allocation z = torch.empty(1024, device="cuda", dtype=torch.uint8) # 6-7. More private-pool allocations — w reuses y's space, w2 forces growth with torch.cuda.use_mem_pool(m): w = torch.empty(1024, device="cuda", dtype=torch.uint8) w2 = torch.empty(1024, device="cuda", dtype=torch.uint8) snapshot = torch.cuda.memory._snapshot() ``` The visualization shows: 1. `alloc x` (default pool) → x appears as a normal colored block 2. `alloc y` (pool (0,1)) → pool envelope appears, grows to 1024; y stripe inside 3. `free x` → x disappears, pool envelope shifts down 4. `free y` → y stripe disappears; envelope stays at 1024 (gray unused space) 5. `alloc z` (default pool) → z appears on top of the pool envelope 6. `alloc w` (pool (0,1)) → w stripe appears inside pool (reuses space, no growth) 7. `alloc w2` (pool (0,1)) → pool envelope grows from 1024 to 2048; z shifts up; w2 stripe appears Key design decisions: - Pool envelope height = high-water mark of active block sizes (monotonically non-decreasing) - Pool stripes use the same color scheme as regular allocations but at 50% opacity - Hovering the gray envelope shows `"Private Pool (x,y): capacity {size}"` - Hovering a stripe shows the normal block context (address, size, stack trace, pool_id) - Growth animations shift elements above the pool over 3 timesteps; stripes are created after the shift completes to avoid visual overlap - A yellow banner warns that MemPools must not be deleted before `torch.cuda.memory._snapshot()` is called This PR was authored with Claude. Example tab view: gray bar starts when mem pool is allocated. <img width="1467" height="612" alt="Screenshot 2026-03-12 at 3 46 51 PM" src="https://github.com/user-attachments/assets/929315a1-c1c3-42cb-b349-c52012a8591d" /> and from a real workload: The gray bar at the bottom is the private pool that cuda graph use. The colored striped inside are the activities inside the private pool. <img width="1367" height="472" alt="Screenshot 2026-03-12 at 2 37 46 PM" src="https://github.com/user-attachments/assets/177c85dd-0364-4013-a45b-56434d66bc07" /> ## Test plan - Manual testing with `test.py` which exercises both default and private pool allocations, frees, reuse, and pool growth - Verified in browser that: - Pool envelope appears on first private pool allocation - Envelope grows when active blocks exceed high-water mark - Envelope never shrinks when blocks are freed - Stripes appear/disappear correctly on alloc/free - Default-pool blocks stack correctly above/below the pool envelope - No visual overlap between pool stripes and default-pool blocks - Hover context shows pool info for envelopes, block info for stripes - Detail slider and minimap work correctly with pool data - Other tabs ("Active Memory Timeline", "Allocator State History", etc.) are unaffected Pull Request resolved: #177289 Approved by: https://github.com/ngimel, https://github.com/BoyuanFeng, https://github.com/divyanshk

…77596) Fixes #177318 inference_mode excludes autograd dispatch keys from TLS, causing functorch TensorWrappers to lack autograd metadata. The transforms already handle no_grad by saving prev_grad_mode and enabling grad; this extends the same treatment to inference_mode for grad, vjp,and jvp. Vmap and functionalize are not addressed in this PR. Add _disable_inference_mode() to surgically disable inference_mode without clobbering grad_mode/fw_grad_mode (which inference_mode(False) would do, breaking the prev_grad_mode invariant). Wrap grad_increment_nesting, jvp_increment_nesting, and vjp's backward closure. My hope is that this can unblock inference_mode workflows where one needs to compute gradients of frozen weights, with respect to inputs(e.g., particle positions) and not model parameters. This happens, for example, in molecular dynamics with a neural net potential function; compute the forces for the particle update steps. Pull Request resolved: #177596 Approved by: https://github.com/aorenste

The Metal kernels for minimum/maximum were already introduced in #169407 but the dispatch entries still routed to the MPSGraph implementations. Switch the dispatch to use the shared stub path and remove the now-dead MPSGraph code, including the `MPSGraph (PyTorchFixups)` category that worked around an MPSGraph bug with NaN propagation on integral types. Also fix the Metal functors to use `c10::metal::min`/`c10::metal::max` which properly propagate NaN, instead of the bare `min`/`max` which resolve to `::metal::min`/`::metal::max` (no NaN propagation). Un-xfail `test_python_ref__refs_clamp_[max|min]` for MPS (that probably tested NaN propagation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Pull Request resolved: #177747 Approved by: https://github.com/dcci, https://github.com/kurtamohler ghstack dependencies: #177746

…177701) Fixes #177771 Pull Request resolved: #177701 Approved by: https://github.com/ngimel

Test that the AOT autograd cache persists across separate processes by running two subprocesses with a shared cache directory and verifying the second one gets a cache hit. Pull Request resolved: #177397 Approved by: https://github.com/aorenste, https://github.com/zou3519

) This PR addresses and fixes #176910. I've arranged this PR into 4 commits to make it easier to reproduce the issues and verify that they have been resolved: 1. Adds a test that the loss is set to zero correctly, which fails. 2. Patches the CTC loss to fix the above test. 3. Adds a test that the gradients are set to zero correctly, which fails. 4. Patches the CTC loss backwards function to fix the above test. Both tests employ the private function `torch._use_cudnn_ctc_loss` to ensure the cuDNN backend is used. The second test additionally uses `torch._cudnn_ctc_loss` with `deterministic=False`. The reason for this is that I was unable to find examples that genuinely produced infinite gradients from cuDNN's deterministic implementation (which is the only one available through the public API), even when `zero_infinity=False` (likely due to this backend's unusual behavior, as described in #176910). Please let me know if this approach should be modified. Pull Request resolved: #176911 Approved by: https://github.com/eqy

IvanKobzarev and others added 30 commits March 18, 2026 13:10

Slicing with backed should produce backed output when possible (#175819)

3989076

* `x[0:s1]` where x.size(0) = `s0-1` should produce `Min(s1, s0-1)` * Before this PR, it would produce `u0`. Pull Request resolved: #175819 Approved by: https://github.com/laithsakka

[inductor] fix oom of mix-order-red in ci (#177674)

cbd6cd0

Fix #176970 Pull Request resolved: #177674 Approved by: https://github.com/eellison

[user-streams] Fix check that didn't include torch.Event subclasses (#…

a4a0059

…177701) Fixes #177771 Pull Request resolved: #177701 Approved by: https://github.com/ngimel

frgossen and others added 2 commits March 18, 2026 22:22

pull bot locked and limited conversation to collaborators Mar 18, 2026

pull bot added the ⤵️ pull label Mar 18, 2026

pull bot merged commit 0f1438d into MPACT-ORG:main Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from pytorch:main#979

[pull] main from pytorch:main#979
pull[bot] merged 32 commits intoMPACT-ORG:mainfrom
pytorch:main

pull bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

pull bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

pull bot commented Mar 18, 2026 •

edited

Loading