Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
9a3c927
Fix nvtx_decorator to check _nvtx_enabled at call time (#4184)
minitu Apr 22, 2026
60f71e1
fix merges_file typo in megatron_hf_tokenizer (#4392)
chelseajohn Apr 22, 2026
c9dfe34
Enable NullTokenizer for pretraining to reduce I/O access (#4057)
asolergi-nv Apr 22, 2026
7073492
docs: Add SECURITY.md (#4431)
chtruong814 Apr 22, 2026
40627d0
Mamba inference opt (#4414)
wdykas Apr 22, 2026
55b8111
DDP refactoring: Extract parameter layout computation into optimizer …
deepakn94 Apr 22, 2026
90e09b6
Update PR template with explicit request for issue (#4409)
Phlip79 Apr 22, 2026
ab2b33d
Misc inference fixes (#4397)
sidsingh-nvidia Apr 23, 2026
60408d5
Rename Mamba to Hybrid outside megatron/core (#4159)
Phlip79 Apr 23, 2026
a52014c
Include mtp layers in token per expert logging (#4412)
Mellonta Apr 23, 2026
32275b2
fix: NVRx async compatibility and defer resiliency import (#4420)
sbak5 Apr 23, 2026
9bb35a8
ci: add base_sha to codecov/codecov-action upload step (#4445)
ko3n1g Apr 23, 2026
3034d86
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] Apr 24, 2026
f78ed05
fix(checkpoint_inspector): allow empty --param-to-param-group-map-jso…
DAISY-gh Apr 24, 2026
4d6cdd5
Add the YARN support for hybrid_model (#4244)
guihong-nv Apr 24, 2026
41ffa83
[training migration] Add container class for config dataclasses (#4227)
maanug-nv Apr 24, 2026
a1165fa
Inference: Fix broken functional tests on gitlab (#4454)
sidsingh-nvidia Apr 24, 2026
d4cacef
SafeUnpickler class for safe pickle usage (#4319)
dimapihtar Apr 24, 2026
109feda
get rid of weights_only=False (#4434)
dimapihtar Apr 24, 2026
64870c1
Inference | Per-block MoE routing storage for prefix caching (#4301)
lmcafee-nvidia Apr 24, 2026
017e684
Add troubleshooting tip for 'access forbidden' (#4449)
balasaajay Apr 24, 2026
3d7bcd3
Fix checkpoint loading with rerun state machine (#4448)
YangFei1990 Apr 24, 2026
9b02206
Add misc CUDA graph sugar to CudaGraphManager (#4425)
tdene Apr 24, 2026
35f76df
Inference: Add the embedding and output layer in the full_iteration_i…
sidsingh-nvidia Apr 24, 2026
481efd0
Important bugfixes in local CG implementation that were leading to lo…
jiemingz Apr 24, 2026
e9abb6c
fix: Replace polynomial rolling hash with SHA-256 for prefix caching …
lmcafee-nvidia Apr 24, 2026
377af02
feat(ckpt): expose validate_access_integrity knob on dist-ckpt load (…
asolergi-nv Apr 24, 2026
241a5ca
Fix multivalidation (#3388)
RPrenger Apr 25, 2026
f2dcd42
Add missing knob for reduce_scatter_with_fp32_accumulation (#4410)
WanZzzzzz Apr 25, 2026
03f4111
Enable CUDA graphs for MTP inference (#4260)
santhnm2 Apr 26, 2026
1879dc2
chore(beep boop 🤖): Bump (main) (2026-04-27)
github-actions[bot] Apr 27, 2026
970c254
checkpoint integrity verification (#4305)
dimapihtar Apr 27, 2026
ebd70d3
Fix cache gating (#4455)
wdykas Apr 27, 2026
0447347
[Main] Fix FusedAdam.use_decoupled_grad mis-set for Megatron-FSDP. (#…
cspades Apr 27, 2026
8c5cf05
add permute fusion into hybrid ep (#4089)
Autumn1998 Apr 28, 2026
42e396e
Add ColocatedBridgeCommunicator for heterogeneous TP/DP MIMO training…
yashaswikarnati Apr 28, 2026
6fd6652
Fix incorrect bias display in extra_repr of Column/RowParallelLinear …
HelloWorldBeginner Apr 28, 2026
c8a4bfd
Fix assertion logic in combined_1f1b_schedule_for_interleaved_pipelin…
joapolarbear Apr 28, 2026
374fa85
ci: Fix event name reference in CI workflow condition for merge group…
balasaajay Apr 28, 2026
9c15290
Add manual sync workflow from main to dev (#4165)
Phlip79 Apr 28, 2026
9816140
fix: handle list-format quant_cfg from ModelOpt PR #1094 (#4187)
ChenhanYu Apr 28, 2026
9e98259
ci: also add Run MBridge tests label in nightly sync workflow (#4499)
Phlip79 Apr 28, 2026
533dc75
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] Apr 29, 2026
1c4e537
[training migration] Add serialization features to config container (…
maanug-nv Apr 29, 2026
f4a49cf
Fix conflict with inference graphs (#4504)
tdene Apr 29, 2026
251c6e9
chore: rotate oncall schedule
github-actions[bot] Apr 29, 2026
c5201a0
Add tools/prepare_cache.py for offline GPT dataset cache preparation …
asolergi-nv Apr 29, 2026
cb3d5d9
[build] fix: move mamba-ssm and causal-conv1d to optional [ssm] extra…
ko3n1g Apr 29, 2026
4e208a8
mamba: avoid redundant HBM reloads in causal_conv1d_update shift loop…
wdykas Apr 29, 2026
3f59bbb
Standardize misc graph interface (#4485)
tdene Apr 29, 2026
29864b2
Fix inference graph override in RL flow (#4323)
tdene Apr 29, 2026
b23aa3f
Unify and refactor Megatron-FSDP documentation. (#4418)
cspades Apr 29, 2026
51ea07e
Revert "ci: add base_sha to codecov/codecov-action upload step (#4445…
chtruong814 Apr 29, 2026
cfee04e
Skills for running unit tests and working with slurm (#4502)
yashaswikarnati Apr 29, 2026
0d98cb8
Reorganize order of operations in inference context and text generati…
tdene Apr 29, 2026
0c52c39
ci: Update CI workflow conditions to include merge group handling (#4…
balasaajay Apr 30, 2026
6ba794b
ci: add base_sha to codecov/codecov-action upload step (#4540)
chtruong814 Apr 30, 2026
580d53a
Fix release tests: remove --global-batch-size conflicting with --step…
deepakn94 Apr 30, 2026
77afc60
docs: use @file-path notation for file references in skills (#4542)
ko3n1g Apr 30, 2026
1a83320
Support YAML quant recipe in PTQ and remove first/last layer modifier…
jenchen13 Apr 30, 2026
12f18da
Avoid nsys profile crash with CUDA graphs (#4541)
tdene Apr 30, 2026
dcb2bd2
fix(ci): add retry with backoff to approve-test-queue bot (#4559)
ko3n1g Apr 30, 2026
bfd4574
New allgathervdispatcher for inference and simplify old dispatcher. …
sidsingh-nvidia Apr 30, 2026
83e7466
Fixes for modelopt examples and SFTTokenizer for transformers v5 (#4450)
jenchen13 Apr 30, 2026
3460bba
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] May 1, 2026
2d862fe
Adding code for Flextron (#4429)
sheliang-nv May 1, 2026
3b1521e
Fix partial cudagraphs + HybridEP not properly triggering DDP hook (#…
jiemingz May 1, 2026
9776b58
Ignore pytorch link anchors (#4582)
maanug-nv May 1, 2026
4e0f636
MoE dispatcher fixes: size NVLS dispatcher buffers from actual tensor…
mathemakitten May 1, 2026
74c857b
Finalize all builders in preprocess_data, not just the last key (#4573)
sayalinvidia May 2, 2026
a6cf566
refactor(skills): add when_to_use frontmatter, split ci-test-system, …
ko3n1g May 2, 2026
396bee1
Make last_token_logits graphable (#4552)
tdene May 2, 2026
0031752
fix(ci): correct off-by-one in total_steps_evaluated formula (#4591)
ko3n1g May 2, 2026
0afcfbf
Add fault injection support via nvidia_resiliency_ext. (#4370)
hexinw-nvidia May 2, 2026
cf736dc
Guard vocab reduce_scatter on TP > 1 (#4565)
mathemakitten May 2, 2026
342dd59
Move inference context bookkeeping to CPU with ContextGPUView (#4306)
lmcafee-nvidia May 3, 2026
fc43cb8
Enable InJob restart on failures. (#4594)
hexinw-nvidia May 3, 2026
442a936
Enable shared expert overlap with allgatherv in inference (#4570)
sidsingh-nvidia May 4, 2026
bb979dd
Add vLLM grouped gemm backend for MoE inference (#4566)
santhnm2 May 4, 2026
99abdc8
Move KD teacher loading to after Float16Module (#4394)
AAnoosheh May 4, 2026
0efa47a
ci: update gpt3_7b_tp4_pp1_memory_speed gb200 golden values (#4601)
ko3n1g May 4, 2026
0c479ee
Fix inference unit test (#4589)
maanug-nv May 4, 2026
cf21d70
Checkpoint conversion between GPT_model and Hybrid_model (#4482)
guihong-nv May 4, 2026
c8fde51
ci: add cadence input for test filtering in CI workflows (#4561)
balasaajay May 4, 2026
fa9c714
Handle SSM sharded tensor merge OOM with CPU fallback (#4442)
returnL May 4, 2026
2194f51
Fix `mtp_use_repeated_layer` behavior for GPT models (#3965)
rkarimimahab May 4, 2026
878228f
FlashInfer sampling (#2456)
tdene May 5, 2026
7924242
Fix main2dev workflow (#4610)
Phlip79 May 5, 2026
f4a0710
Add logic to enable chunked MLP during training (#3656)
pengdurice May 5, 2026
c817dad
Inference bug-fixes: Re-enable EP syncs for the legacy A2A dispatcher…
sidsingh-nvidia May 5, 2026
0b2b572
Remove invalid `timeout` argument for dist.barrier (#4512)
zhaoyinglia May 5, 2026
4397e07
Fix buffers in refit (#4580)
wdykas May 5, 2026
4858caf
Named validation sets (#4578)
RPrenger May 5, 2026
ae65776
Fix Hang in tests (#4575)
wdykas May 5, 2026
b819ac7
Single commit for main2dev nightly (#4614)
Phlip79 May 5, 2026
40d024b
convert tokenizer args to config (#4406)
dimapihtar May 5, 2026
190c833
Siddharth/fix ep sync (#4607)
wdykas May 5, 2026
6e5fb47
mmiranda working on another set of broken links (#4534)
megnvidia May 5, 2026
b25a76e
Fix gradient corruption with layerwise param all-gather overlap (#4609)
deepakn94 May 6, 2026
c325855
test: mark TestFusedApplyMLARope::test_forward_backward_for_q flaky_i…
ko3n1g May 6, 2026
fd443f2
chore: rotate oncall schedule
github-actions[bot] May 6, 2026
39ec5eb
remove legacy GPT code (#4322)
dimapihtar May 6, 2026
431ac5d
chore: nightly sync main into dev (06_05_2026)
May 6, 2026
46ee761
fix: post-CI corrections
May 6, 2026
676f3fa
Merge remote-tracking branch 'origin/dev' into main2dev/06_05_2026
Phlip79 May 8, 2026
d019432
restore some missing changes post merge due to PR not merged to main …
FDecaYed May 8, 2026
0cb4ec3
Merge branch 'dev' into main2dev/06_05_2026
FDecaYed May 8, 2026
2207908
fix: correct misplaced colon in moe_layer.py inference guard
Phlip79 May 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
14 changes: 14 additions & 0 deletions .claude/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"hooks": {
"UserPromptSubmit": [
{
"hooks": [
{
"type": "command",
"command": "printf '{\"hookSpecificOutput\":{\"hookEventName\":\"UserPromptSubmit\",\"additionalContext\":\"MANDATORY WORKFLOW — never skip or reorder: (1) Read the artifact first (commit, file, error, PR). (2) Identify and invoke the relevant skill via the Skill tool BEFORE forming any answer or plan — even when the answer seems obvious. (3) Only then answer using the skill context. Skipping step 2 is not allowed.\"}}'"
}
]
}
]
}
}
7 changes: 7 additions & 0 deletions .github/actions/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,10 @@ inputs:
description: "Platform to run tests on (e.g. dgx_h100, dgx_gb200)"
required: false
default: "dgx_h100"
cadence:
description: "Trigger cadence for cadence filter (pr|nightly|mergegroup). Empty disables filter."
required: false
default: ""
runs:
using: "composite"
steps:
Expand Down Expand Up @@ -136,6 +140,9 @@ runs:
if [ "${{ inputs.lightweight }}" == "true" ]; then
ARGS+=(--enable-lightweight-mode)
fi
if [ -n "${{ inputs.cadence }}" ]; then
ARGS+=(--cadence ${{ inputs.cadence }})
fi

export PYTHONPATH=$(pwd)
export NEMORUN_HOME=$(pwd)
Expand Down
2 changes: 1 addition & 1 deletion .github/copy-pr-bot.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
enabled: true
auto_sync_draft: false
auto_sync_ready: true
trustees_override: ["AAnoosheh", "ArEsKay3", "Autumn1998", "BestJuly", "BoxiangW", "CarlosGomes98", "ChenhanYu", "Connor-XY", "FDecaYed", "HaochenYuan", "ISEEKYAN", "JRD971000", "Mellonta", "Phlip79", "QiZhangNV", "RPrenger", "ShriyaRishab", "Victarry", "WanZzzzzz", "Wohox", "YangFei1990", "ZhiyuLi-Nvidia", "ahmadki", "aklife97", "ananthsub", "aroshanghias-nvd", "asolergi-nv", "buptzyb", "chtruong814", "cjld", "cspades", "cuichenx", "deepakn94", "dimapihtar", "dingqingy-nv", "duncanriach", "erhoo82", "ericharper", "fanshiqing", "faradawn", "fitsumreda", "frsun-nvda", "gautham-kollu", "gdengk", "guihong-nv", "guyueh1", "hexinw-nvidia", "huvunvidia", "hxbai", "ilml", "jalbericiola", "janEbert", "jaredcasper", "jenchen13", "jiemingz", "jingqiny-99", "jkamalu", "jon-barker", "jstjohn", "kajalj22", "kanz-nv", "keshavb96", "kevalmorabia97", "ko3n1g", "ksivaman", "kunlunl", "kvareddy", "kwyss-nvidia", "layalir", "lhb8125", "lmcafee-nvidia", "maanug-nv", "mathemakitten", "matthieule", "mchrzanowski", "mehraakash", "minitu", "mkhona-nvidia", "nanz-nv", "parthmannan", "prajwal1210", "pthombre", "rhewett-nv", "rogerwaleffe", "sajadn", "sanandaraj5597", "sancha", "santhnm2", "sbak5", "shanmugamr1992", "sharathts", "sheliang-nv", "shengf-nv", "shifangx", "shjwudp", "sidsingh-nvidia", "skyw", "sraman-rgb", "sudhakarsingh27", "tdene", "theothermike", "thomasdhc", "tomlifu", "trintamaki", "tylerpoon", "wdykas", "wplf", "wujingyue", "xiaoyao0115", "xuwchen", "yanring", "yaox12", "yaoyu-33", "yashaswikarnati", "yeyu-nvidia", "yobibyte", "youngeunkwon0405", "yueshen2016", "yuzhongw-nvidia", "zhongbozhu"]
trustees_override: ["AAnoosheh", "ArEsKay3", "Autumn1998", "BestJuly", "BoxiangW", "CarlosGomes98", "ChenhanYu", "Connor-XY", "FDecaYed", "HaochenYuan", "ISEEKYAN", "JRD971000", "Mellonta", "Phlip79", "QiZhangNV", "RPrenger", "ShriyaRishab", "Victarry", "WanZzzzzz", "Wohox", "YangFei1990", "ZhiyuLi-Nvidia", "ahmadki", "aklife97", "ananthsub", "aroshanghias-nvd", "asolergi-nv", "balasaajay", "buptzyb", "chtruong814", "cjld", "cspades", "cuichenx", "deepakn94", "dimapihtar", "dingqingy-nv", "duncanriach", "erhoo82", "ericharper", "fanshiqing", "faradawn", "fitsumreda", "frsun-nvda", "gautham-kollu", "gdengk", "guihong-nv", "guyueh1", "hexinw-nvidia", "huvunvidia", "hxbai", "ilml", "jalbericiola", "janEbert", "jaredcasper", "jenchen13", "jiemingz", "jingqiny-99", "jkamalu", "jon-barker", "jstjohn", "kajalj22", "kanz-nv", "kevalmorabia97", "ko3n1g", "ksivaman", "kunlunl", "kvareddy", "kwyss-nvidia", "layalir", "lhb8125", "lmcafee-nvidia", "maanug-nv", "mathemakitten", "matthieule", "mchrzanowski", "mehraakash", "minitu", "mkhona-nvidia", "nanz-nv", "ntajbakhsh", "parthmannan", "prajwal1210", "pthombre", "rhewett-nv", "rogerwaleffe", "sajadn", "sanandaraj5597", "sancha", "santhnm2", "sbak5", "shanmugamr1992", "sharathts", "sheliang-nv", "shengf-nv", "shifangx", "shjwudp", "sidsingh-nvidia", "skyw", "sraman-rgb", "sudhakarsingh27", "tdene", "theothermike", "thomasdhc", "tomlifu", "trintamaki", "tylerpoon", "wdykas", "wplf", "wujingyue", "xiaoyao0115", "xuantengh", "xuwchen", "yanring", "yaox12", "yaoyu-33", "yashaswikarnati", "yeyu-nvidia", "yobibyte", "youngeunkwon0405", "yueshen2016", "yuzhongw-nvidia", "zhongbozhu"]
16 changes: 8 additions & 8 deletions .github/oncall_schedule.json
Original file line number Diff line number Diff line change
@@ -1,12 +1,4 @@
[
{
"user": "asolergi-nv",
"date": "2026-04-22"
},
{
"user": "maanug-nv",
"date": "2026-04-29"
},
{
"user": "dimapihtar",
"date": "2026-05-06"
Expand Down Expand Up @@ -46,5 +38,13 @@
{
"user": "wujingyue",
"date": "2026-07-08"
},
{
"user": "Connor-XY",
"date": "2026-07-15"
},
{
"user": "Phlip79",
"date": "2026-07-22"
}
]
9 changes: 9 additions & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,15 @@

:warning: For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

## Issue tracking

For PRs from open-source community contributors:

- **New features**: a linked issue is **required**. Please open a [feature request](https://github.com/NVIDIA/Megatron-LM/issues/new?template=feature_request.md) and reference it here before submitting the PR.
- **Small updates (bug fixes, minor improvements)**: a linked issue is **recommended** and will accelerate the PR review process.

Linked issue: <!-- e.g. Fixes #1234 / Related to #1234 -->

## Contribution process

### Pre-checks
Expand Down
70 changes: 52 additions & 18 deletions .github/workflows/cicd-approve-test-queue.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ jobs:
import json
import requests
import re
import time

# GitHub API configuration
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
Expand All @@ -88,21 +89,38 @@ jobs:
"X-GitHub-Api-Version": "2022-11-28",
}

def make_request(endpoint, method="GET", data=None):
"""Make a request to the GitHub API with error handling."""
def make_request(endpoint, method="GET", data=None, max_retries=5):
"""Make a request to the GitHub API with retry on transient errors."""
url = f"{API_BASE}/{endpoint}"
try:
if method == "GET":
response = requests.get(url, headers=headers)
else:
response = requests.post(url, headers=headers, json=data)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
print(f"Error making request to {endpoint}: {str(e)}")
if hasattr(e.response, 'text'):
print(f"Response: {e.response.text}")
return None
for attempt in range(max_retries):
try:
if method == "GET":
response = requests.get(url, headers=headers, timeout=30)
else:
response = requests.post(url, headers=headers, json=data, timeout=30)
if response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", 2 ** attempt))
print(f"Rate limited on {endpoint}, retrying in {retry_after}s (attempt {attempt + 1}/{max_retries})")
time.sleep(retry_after)
continue
if response.status_code >= 500:
delay = 2 ** attempt
print(f"Server error {response.status_code} on {endpoint}, retrying in {delay}s (attempt {attempt + 1}/{max_retries})")
time.sleep(delay)
continue
response.raise_for_status()
return response.json()
except (requests.exceptions.ConnectionError, requests.exceptions.Timeout) as e:
delay = 2 ** attempt
print(f"Transient error on {endpoint}: {e}, retrying in {delay}s (attempt {attempt + 1}/{max_retries})")
time.sleep(delay)
except requests.exceptions.RequestException as e:
print(f"Error making request to {endpoint}: {str(e)}")
if hasattr(e, 'response') and e.response is not None:
print(f"Response: {e.response.text}")
return None
print(f"Max retries ({max_retries}) exceeded for {endpoint}")
return None

def is_internal_contributor(pr_info):
"""Return True if the PR author is a member of NVIDIA or NVIDIA-NeMo org (is_org_member)."""
Expand Down Expand Up @@ -166,8 +184,16 @@ jobs:

# Get current running and queued workflows
print("Fetching workflow runs...")
queued_workflow_runs = make_request("actions/runs?status=queued").get("workflow_runs", [])
in_progress_workflow_runs = make_request("actions/runs?status=in_progress").get("workflow_runs", [])
queued_resp = make_request("actions/runs?status=queued")
if queued_resp is None:
print("Failed to fetch queued workflow runs after retries, exiting")
exit(1)
queued_workflow_runs = queued_resp.get("workflow_runs", [])
in_progress_resp = make_request("actions/runs?status=in_progress")
if in_progress_resp is None:
print("Failed to fetch in-progress workflow runs after retries, exiting")
exit(1)
in_progress_workflow_runs = in_progress_resp.get("workflow_runs", [])

# For external contributors, enforce a single global concurrency limit across ALL branches.
# For internal contributors, enforce per-branch limits as before.
Expand Down Expand Up @@ -199,7 +225,11 @@ jobs:

# Get waiting CI workflows for test environment
print("Fetching deployments...")
pending_workflows = make_request("actions/runs?status=waiting").get("workflow_runs", [])
waiting_resp = make_request("actions/runs?status=waiting")
if waiting_resp is None:
print("Failed to fetch waiting workflow runs after retries, exiting")
exit(1)
pending_workflows = waiting_resp.get("workflow_runs", [])
print("Pending workflows:", len(pending_workflows))
pending_workflows = [run for run in pending_workflows
if run["name"] == "CICD Megatron-LM" and matches_queue(run, "${{ matrix.branch }}", CONTRIBUTOR_TYPE)]
Expand All @@ -220,7 +250,11 @@ jobs:
print(f"Approving workflow {workflow_name} with Run Id: {workflow_id}")

deployment_url = f"actions/runs/{workflow_id}/pending_deployments"
deployment = make_request(deployment_url)[0]
deployments = make_request(deployment_url)
if not deployments:
print(f"Failed to fetch pending deployments for run {workflow_id}")
exit(1)
deployment = deployments[0]
environment_id = deployment["environment"]["id"]

# Approve the deployment
Expand Down
Loading
Loading