Skip to content

chore: nightly sync main into dev (06_05_2026)#4659

Open
svcnvidia-nemo-ci wants to merge 108 commits intodevfrom
main2dev/06_05_2026
Open

chore: nightly sync main into dev (06_05_2026)#4659
svcnvidia-nemo-ci wants to merge 108 commits intodevfrom
main2dev/06_05_2026

Conversation

@svcnvidia-nemo-ci
Copy link
Copy Markdown

Summary

Nightly sync of main into dev.

  • 102 commits merged from main
  • Python lines: +33575 / -8262 across 253 files
  • Merge strategy: git merge origin/main -X theirs --no-edit, with manual reconciliation for known conflicts.

Files taken from main

  • megatron/core/optimizer/layer_wise_optimizer.py (no-op; identical between main and dev currently)

Files kept on dev (overriding the skill's default of taking main's version)

The skill recommends taking main's version of these files for known semantic conflicts. In this sync the situation is reversed — dev's versions are the more current ones. Main's versions reference args.hybrid_context_parallel, but dev renamed that flag to args.dynamic_context_parallel (commit cde56a4 "Fix for rope when enabling THD + Dynamic-CP; use the naming Dynamic-CP"). Taking main's versions would cascade into AttributeError at runtime.

  • megatron/training/training.py
  • megatron/training/utils.py
  • megatron/training/initialize.py
  • megatron/training/datasets/data_samplers.py

Files deleted in main, accepted as deletion

These were legacy GPT loaders removed in main #4322 ("remove legacy GPT code"). Nothing in the merged tree references them.

  • tools/checkpoint/loader_legacy.py
  • tools/checkpoint/loader_llama_mistral.py

Files deleted in dev, NOT restored

megatron/core/pipeline_parallel/hybrid_cp_schedule.py was intentionally removed in dev (commit cde56a4) as part of the dynamic-CP refactor. Not restored, since the merged tree uses dev's wrap_data_iterator mechanism — no caller imports BalancedCPScheduler or HybridCPDataLoaderWrapper.

Dependency triple kept on dev

Per the skill's hard rule: pyproject.toml, uv.lock, docker/Dockerfile.ci.dev were restored from origin/dev. Dev's nvidia-resiliency-ext pinned revision (15a8515) was verified to contain all APIs the merged tree imports (get_write_results_queue, CheckpointMetadataCache, CachedMetadataFileSystemReader, etc.). No git-source reconciliation required.

API mismatch detection

After taking main's version of files (then later reverting), audited:

  • multi_latent_attention.py calls off_interface.group_offload() and off_interface.group_commit() — both exist on dev's FineGrainedActivationOffloadingInterface
  • gpt_model.py and hybrid_model.py call init_chunk_handler(6 kwargs) — matches dev's signature
  • _resolve_cu_seqlens exists on dev's GatedDeltaNet
  • _is_distopt_quantized_param exists on dev's DistributedOptimizer
  • CudaGraphScope exists in dev's enums.py

No active mismatches remain.

Linting

  • black --config pyproject.toml (24.10.0): no diff
  • isort (5.13.2): no diff
  • pylint on changed megatron/core/ files (84 files): 10.00/10

Remerge diff

Remerge diff stat (file-level summary)
Date:   Wed May 6 21:32:39 2026 +0000

    chore: nightly sync main into dev (06_05_2026)

 .github/workflows/cicd-main.yml                    |    5 -
 docker/Dockerfile.ci.dev                           |    4 -
 docs/conf.py                                       |   18 +-
 .../detxoify_lm/generate_samples_gpt.py            |   76 +-
 .../gpt/gpt_dynamic_inference_with_coordinator.py  |    6 +-
 examples/mimo/train.py                             |    6 +-
 examples/multimodal/layer_specs.py                 |    2 +-
 examples/multimodal/model.py                       |   85 +-
 examples/post_training/modelopt/convert_model.py   |   19 +-
 examples/post_training/modelopt/export.py          |    5 +-
 examples/post_training/modelopt/finetune.py        |   67 +-
 examples/post_training/modelopt/generate.py        |   27 +-
 examples/post_training/modelopt/mmlu.py            |   45 +-
 .../modelopt/offline_feature_extract.py            |   56 +-
 examples/post_training/modelopt/prune.py           |   13 +-
 examples/post_training/modelopt/quantize.py        |   55 +-
 examples/post_training/modelopt/validate.py        |   32 +-
 gpt_builders.py                                    |   77 +-
 hybrid_builders.py                                 |    4 +-
 megatron/core/datasets/readme.md                   |   64 --
 megatron/core/transformer/mlp.py                   |    4 -
 megatron/core/transformer/moe/fused_a2a.py         |   13 -
 megatron/core/transformer/moe/moe_layer.py         |    8 -
 megatron/core/transformer/moe/token_dispatcher.py  |    4 -
 megatron/core/transformer/transformer_config.py    |   27 -
 megatron/core/transformer/transformer_layer.py     |   13 -
 megatron/elastification/arguments.py               |    6 +-
 megatron/elastification/flextron_utils.py          |   11 +-
 megatron/elastification/pretrain_hybrid_flex.py    |  136 ++-
 .../elastification/router/hybrid_flex_router.py    |    7 +-
 megatron/legacy/model/__init__.py                  |    5 -
 megatron/post_training/arguments.py                |    7 +-
 megatron/post_training/model_builder.py            |   55 +-
 megatron/training/activation_logging.py            |   37 +-
 megatron/training/argument_utils.py                |   90 +-
 megatron/training/arguments.py                     |  589 +----------
 megatron/training/async_utils.py                   |    4 +-
 megatron/training/checkpointing.py                 |   33 +-
 megatron/training/config/__init__.py               |   27 +-
 megatron/training/config/container.py              |   40 +-
 megatron/training/config/instantiate_utils.py      |   46 +-
 megatron/training/config/training_config.py        |   24 +-
 megatron/training/config/utils.py                  |   13 +-
 megatron/training/config/yaml_utils.py             |   10 +-
 megatron/training/datasets/data_samplers.py        |   51 +-
 megatron/training/training.py                      |  261 +----
 megatron/training/utils.py                         |    9 -
 model_provider.py                                  |   12 +-
 pretrain_bert.py                                   |   32 +-
 pretrain_gpt.py                                    |   42 +-
 pretrain_hybrid.py                                 |   65 +-
 pretrain_mamba.py                                  |  363 -------
 pretrain_t5.py                                     |    2 +-
 pretrain_vlm.py                                    |   10 +-
 pyproject.toml                                     |   19 +-
 .../unit_tests/fusions/test_mla_yarn_rope_apply.py |   10 -
 tests/unit_tests/models/test_hybrid_moe_model.py   |   16 -
 tools/checkpoint/checkpoint_inspector.py           |    9 +-
 tools/checkpoint/convert.py                        |   62 +-
 tools/checkpoint/dist_checkpoint_io.py             |   45 +-
 tools/checkpoint/gpt_hybrid_conversion.py          |  171 +--
 tools/checkpoint/loader_legacy.py                  |  416 --------
 tools/checkpoint/loader_llama_mistral.py           |  751 -------------
 tools/checkpoint/loader_mixtral_hf.py              |   12 +-
 tools/checkpoint/remap_gpt_dsa_to_mamba.py         |    5 -
 tools/prepare_cache.py                             |    9 +-
 tools/preprocess_data.py                           |  217 ++--
 tools/preprocess_mmdata.py                         |  160 ++-
 train_rl.py                                        |   20 +-
 uv.lock                                            | 1114 +++-----------------
 70 files changed, 1258 insertions(+), 4500 deletions(-)

Full diff omitted to keep the PR body compact (~10k lines). Reviewers can run git show --remerge-diff 431ac5df05104bc1d5015f5ac1842285d1c5e6ee locally or browse the merge commit on GitHub.

minitu and others added 30 commits April 22, 2026 18:02
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: john2 <john2@jrlogin01.jureca>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: root <root@nvl72098-T17.cm.cluster>
Co-authored-by: William Dykas <wdykas@oci-hsg-cs-001-vscode-03.cm.cluster>
Co-authored-by: root <root@nvl72160-T13.cm.cluster>
…classmethod (#3812)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
#4403)

Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>
Co-authored-by: Philip Petrakian <ppetrakian@nvidia.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Co-authored-by: Siddharth Singh <sidsingh@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ss curve gaps for latent MoE models (#4433)

Signed-off-by: root <jiemingz@nvidia.com>
…4158)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…4422)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rprenger <rprenger@nvidia.com>
Signed-off-by: qiyuw <qiyuw@nvidia.com>
Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
@svcnvidia-nemo-ci
Copy link
Copy Markdown
Author

/ok to test 3f10d85

@svcnvidia-nemo-ci
Copy link
Copy Markdown
Author

/ok to test b83a102

@svcnvidia-nemo-ci
Copy link
Copy Markdown
Author

/ok to test 46ee761

@Phlip79 Phlip79 marked this pull request as ready for review May 7, 2026 07:02
@Phlip79 Phlip79 requested review from a team as code owners May 7, 2026 07:02
@svcnvidia-nemo-ci
Copy link
Copy Markdown
Author

Superseded by today's nightly sync.

@Phlip79 Phlip79 reopened this May 8, 2026
# Conflicts:
#	megatron/core/distributed/param_and_grad_buffer.py
@Phlip79
Copy link
Copy Markdown
Member

Phlip79 commented May 8, 2026

/ok to test 676f3fa

@FDecaYed
Copy link
Copy Markdown
Contributor

FDecaYed commented May 8, 2026

/ok to test 0cb4ec3

@Phlip79
Copy link
Copy Markdown
Member

Phlip79 commented May 8, 2026

/ok to test 2207908

@svcnvidia-nemo-ci
Copy link
Copy Markdown
Author

Superseded by today's nightly sync.

@Phlip79 Phlip79 reopened this May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

complexity: high Run functional tests Run MBridge tests Attach this for testing this PR against MBridge main

Projects

None yet

Development

Successfully merging this pull request may close these issues.