Add report generation strategy for the MegatronRun by juntaowww · Pull Request #787 · NVIDIA/cloudai

juntaowww · 2026-01-22T09:07:08Z

Summary

Adds a new report generation strategy for the MegatronRun workload to extract iteration time and GPU TFLOP/s metrics from training logs. This enables Design Space Exploration (DSE) capabilities for MegatronRun workloads.

Also fixes the configuration overwrite behavior for agent_metrics in test and scenario toml setups. Originally if agent_metrics is set in test configuration but not in scenario configuration, the agent_metrics would be overwrited by [default,]. Now agent_metrics in test configuration could be correctly propagated when agent_metrics is not set in scenario configuration.

Test Plan

Adds following to the test configurations

agent_metrics = ["tflops-per-gpu"]
agent_reward_function = "identity"  # For maximizing TFLOP/s

# Or
# agent_metrics = ["iteration-time"] # For minimizing iteration time

megatron_run_report.csv would be generated and trajectory.csv would reflect the observations and rewards.

coderabbitai · 2026-01-22T09:07:31Z

📝 Walkthrough

Walkthrough

Adds MegatronRunReportGenerationStrategy: parses Megatron-Run stdout to extract per-iteration times and per-GPU TFLOP/s, computes statistics over recent iterations, writes a CSV report, exports and registers the strategy, and adds unit tests.

Changes

Cohort / File(s)	Summary
Report strategy implementation `src/cloudai/workloads/megatron_run/report_generation_strategy.py`	New class `MegatronRunReportGenerationStrategy`. Parses `stdout.txt` for iteration-time and tflops-per-gpu, skips initial iterations, computes avg/median/min/max/std, writes `megatron_run_report.csv`, exposes `get_metric` and helpers, and handles missing/invalid data.
Package exports `src/cloudai/workloads/megatron_run/__init__.py`	Exported `MegatronRunReportGenerationStrategy` added to module namespace and `__all__`; copyright year updated.
Registration `src/cloudai/registration.py`	Imported and registered `MegatronRunReportGenerationStrategy` for `MegatronRunTestDefinition` via `Registry().add_report(...)`; minor copyright year bump.
Models `src/cloudai/models/scenario.py`	`TestRunModel.tdef_model_dump` adjusted to emit `agent_metrics` only when present in model fields set (otherwise `None`).
Tests `tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py`, `tests/test_test_scenario.py`	New unit tests for MegatronRun strategy: fixtures with stdout samples, tests for detection, CSV generation structure and contents, metric extraction (including invalid/no-data cases), and updated test scenario to include the new strategy.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐇 I nibble logs with whiskers keen,
find iteration times and TFLOP/s unseen.
I skip the first hops, then average the rest,
a neat little CSV to show which run's best.
Hooray — the rabbit reports pass the test! ✨

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding a report generation strategy for the MegatronRun workload.
Description check	✅ Passed	The pull request description clearly explains the addition of a new MegatronRunReportGenerationStrategy for extracting metrics and fixing agent_metrics configuration behavior, directly aligned with the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-01-22T09:09:45Z

Greptile Overview

Greptile Summary

Adds MegatronRunReportGenerationStrategy to extract iteration time and GPU TFLOP/s metrics from MegatronRun training logs, enabling Design Space Exploration (DSE) capabilities. The implementation follows the established pattern from MegatronBridgeReportGenerationStrategy with similar structure and error handling.

Key changes:

New report generation strategy parses stdout.txt for iteration metrics using regex pattern matching
Skips first 20 iterations to exclude warmup period (vs MegatronBridge's approach of keeping last 10)
Generates CSV reports with statistics (avg, median, min, max, std) for both iteration time (ms) and TFLOP/s per GPU
Supports three metric types: default, iteration-time, and tflops-per-gpu for DSE agent integration
Updates scenario.py to conditionally serialize agent_metrics only when explicitly set, preventing default values from being written
Comprehensive test coverage with multiple edge cases and validation scenarios

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation follows well-established patterns from MegatronBridge, has comprehensive test coverage including edge cases, properly handles errors, and all changes are additive without modifying existing functionality. The scenario.py change is a minor improvement to serialization logic that correctly uses Pydantic's model_fields_set.
No files require special attention

Important Files Changed

Filename	Overview
src/cloudai/workloads/megatron_run/report_generation_strategy.py	Adds new MegatronRunReportGenerationStrategy class to extract iteration time and TFLOP/s metrics from stdout.txt, following similar patterns to MegatronBridge. Implementation looks solid with proper error handling and warmup iteration skipping.
tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py	Comprehensive test coverage for the new report generation strategy including happy path, edge cases, and metric extraction validation.
src/cloudai/models/scenario.py	Updates tdef_model_dump to conditionally serialize agent_metrics only if explicitly set, preventing default values from being serialized.

Sequence Diagram

sequenceDiagram
    participant User
    participant System
    participant Registry
    participant MegatronRunRGS as MegatronRunReportGenerationStrategy
    participant TestRun
    participant LogFile as stdout.txt
    participant ReportFile as megatron_run_report.csv

    User->>System: Execute MegatronRun workload
    System->>TestRun: Create TestRun instance
    TestRun->>LogFile: Generate stdout.txt with iteration metrics
    
    User->>System: Request report generation
    System->>Registry: Get report strategies for MegatronRunTestDefinition
    Registry-->>System: Return [CheckpointTimingRGS, MegatronRunRGS]
    
    System->>MegatronRunRGS: can_handle_directory()
    MegatronRunRGS->>LogFile: Open and search for ITERATION_REGEX
    LogFile-->>MegatronRunRGS: Pattern found/not found
    MegatronRunRGS-->>System: Return true/false
    
    alt Can handle directory
        System->>MegatronRunRGS: generate_report()
        MegatronRunRGS->>LogFile: Extract iteration times and TFLOP/s
        LogFile-->>MegatronRunRGS: Raw metrics data
        MegatronRunRGS->>MegatronRunRGS: Skip first 20 iterations (warmup)
        MegatronRunRGS->>MegatronRunRGS: Calculate statistics (mean, median, min, max, pstdev)
        MegatronRunRGS->>ReportFile: Write CSV with metrics
        ReportFile-->>MegatronRunRGS: Success
        
        User->>System: Request metric value
        System->>MegatronRunRGS: get_metric("iteration-time" or "tflops-per-gpu")
        MegatronRunRGS->>LogFile: Re-extract data
        MegatronRunRGS->>MegatronRunRGS: Calculate mean
        MegatronRunRGS-->>System: Return metric value
        System-->>User: Metric value for DSE
    else Cannot handle directory
        MegatronRunRGS-->>System: Cannot handle
        System->>Registry: Try next strategy
    end

amaslenn

We are in the code freeze stage now, this PR will be merged later.

src/cloudai/workloads/megatron_run/report_generation_strategy.py

jeffnvidia · 2026-01-22T14:12:03Z

src/cloudai/workloads/megatron_run/report_generation_strategy.py

+
+        # Keep only the last 10 iterations for statistics (to exclude warmup)
+        if len(iter_times_ms) > 10:
+            iter_times_ms = iter_times_ms[-10:]


I'm wondering if taking the last 10 iterations is the most relevant. What if the training has some ups and downs (as I already saw). Maybe just skipping the warmup, so say the 20 first iterations is enough ?

Yes, skipping the warmup stage makes more sense, have updated to skipping the first 20 iterations. Originally was following the format in Megatron-Bridge report. Maybe later need to unify the formats for computing statistics.

The last 10 iteration is what the GPU perf team uses. I have seen those runs on IB clusters and mostly towards the end it remains stable.

Which one would be better? To make things consistent, should we keep the last 10 iterations?

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/cloudai/workloads/megatron_run/report_generation_strategy.py (1)

30-121: Iteration-time parsing fails when TFLOP/s isn’t logged.

ITERATION_REGEX requires TFLOP/s, so logs that only print iteration time will be ignored; can_handle_directory() then returns False and get_metric("iteration-time") returns METRIC_ERROR despite valid data. Consider making TFLOP/s optional and skipping warmup by iteration count, not by list position.

🔧 Proposed fix

-ITERATION_REGEX = re.compile(
-    r"elapsed time per iteration \(ms\):\s*([0-9]+(?:\.[0-9]+)?)"
-    r".*?"
-    r"throughput per GPU \(TFLOP/s/GPU\):\s*([0-9]+(?:\.[0-9]+)?)",
-    re.IGNORECASE,
-)
+ITERATION_REGEX = re.compile(
+    r"elapsed time per iteration \(ms\):\s*([0-9]+(?:\.[0-9]+)?)"
+    r"(?:.*?throughput per GPU \(TFLOP/s/GPU\):\s*([0-9]+(?:\.[0-9]+)?))?",
+    re.IGNORECASE,
+)

 def _extract(self, log_path: Path) -> tuple[list[float], list[float]]:
     """Extract iteration times (ms) and GPU TFLOPS from the log file."""
-    iter_times_ms: list[float] = []
-    gpu_tflops: list[float] = []
+    records: list[tuple[float, float | None]] = []
     with log_path.open("r", encoding="utf-8", errors="ignore") as f:
         for line in f:
             m = ITERATION_REGEX.search(line)
             if m:
                 try:
-                    iter_times_ms.append(float(m.group(1)))
-                    gpu_tflops.append(float(m.group(2)))
+                    iter_time = float(m.group(1))
+                    tflops = float(m.group(2)) if m.group(2) is not None else None
+                    records.append((iter_time, tflops))
                 except (ValueError, TypeError):
                     logging.debug("Failed to parse iteration metrics line: %s", line.rstrip("\n"))

     # Skip the first 20 iterations for statistics (to exclude warmup)
-    if len(iter_times_ms) > 20:
-        iter_times_ms = iter_times_ms[20:]
-        gpu_tflops = gpu_tflops[20:]
-    return iter_times_ms, gpu_tflops
+    if len(records) > 20:
+        records = records[20:]
+    iter_times_ms = [t for t, _ in records]
+    gpu_tflops = [g for _, g in records if g is not None]
+    return iter_times_ms, gpu_tflops

🤖 Fix all issues with AI agents

In `@src/cloudai/workloads/megatron_run/report_generation_strategy.py`:
- Around line 139-167: The CSV currently writes zeros when gpu_tflops is empty;
change the branch that sets
tflops_avg/tflops_median/tflops_min/tflops_max/tflops_std (used when writing via
writer to report_file in report_generation_strategy.py) to instead set those
values to empty strings (or the existing METRIC_ERROR sentinel used by
get_metric) so the CSV shows missing TFLOP/s data explicitly rather than zeros;
keep the same writer.writerow call that writes the "tflops_per_gpu" row but use
the new empty/sentinel values when gpu_tflops is falsy.

In
`@tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py`:
- Around line 32-83: The tests currently (megatron_run_tr and
megatron_run_tr_no_data fixtures) only provide 3 iterations and do not exercise
the warmup-skip behavior; add a new fixture (e.g., megatron_run_tr_with_warmup)
that builds a TestRun with stdout.txt containing at least 21 iteration log lines
formatted like the existing stdout_content so the first 20 iterations are
present and can be skipped, then add a corresponding test that uses this fixture
to assert the report generation logic ignores the first 20 iterations (reference
MegatronRunTestDefinition, TestRun, and the stdout.txt written in the fixtures).

src/cloudai/workloads/megatron_run/report_generation_strategy.py

tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py

amaslenn · 2026-01-23T09:46:55Z

src/cloudai/models/scenario.py

            "agent": self.agent,
            "agent_steps": self.agent_steps,
-            "agent_metrics": self.agent_metrics,
+            "agent_metrics": self.agent_metrics if "agent_metrics" in self.model_fields_set else None,


That would change default values to None. Why is it needed?

If it is None here then the definition of agent_metrics in test config can propagate, otherwise if agent_metrics is not defined in the scenario config, the final merged config would always be [default] even though agent_metrics is set in the test config.

juntaowww added 3 commits January 22, 2026 10:44

Add MegatronRunReportGenerationStrategy

da3a0ea

Add tests

8b2dbea

Fix copyright year

c257022

juntaowww requested review from alexmanle, amaslenn, jeffnvidia and srivatsankrishnan as code owners January 22, 2026 09:07

Fix tests

27edf70

amaslenn reviewed Jan 22, 2026

View reviewed changes

src/cloudai/workloads/megatron_run/report_generation_strategy.py Outdated Show resolved Hide resolved

jeffnvidia reviewed Jan 22, 2026

View reviewed changes

juntaowww added 3 commits January 22, 2026 17:25

Change report to csv format

347aba8

Skip first 20 iters instead of keeping last 10

0efbc49

Fix overwrite behavior between test and scenario configuration

27ac52a

greptile-apps bot reviewed Jan 23, 2026

View reviewed changes

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

src/cloudai/workloads/megatron_run/report_generation_strategy.py Show resolved Hide resolved

tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py Show resolved Hide resolved

amaslenn reviewed Jan 23, 2026

View reviewed changes

Conversation

juntaowww commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

coderabbitai bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

greptile-apps bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

amaslenn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeffnvidia Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

juntaowww Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

srivatsankrishnan Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

juntaowww Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

amaslenn Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

juntaowww Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

juntaowww commented Jan 22, 2026 •

edited

Loading

coderabbitai bot commented Jan 22, 2026 •

edited

Loading

greptile-apps bot commented Jan 22, 2026 •

edited

Loading