Add report generation strategy for the MegatronRun#787
Add report generation strategy for the MegatronRun#787juntaowww wants to merge 7 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds MegatronRunReportGenerationStrategy: parses Megatron-Run stdout to extract per-iteration times and per-GPU TFLOP/s, computes statistics over recent iterations, writes a CSV report, exports and registers the strategy, and adds unit tests. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile OverviewGreptile SummaryAdds Key changes:
Confidence Score: 5/5
Important Files Changed
|
amaslenn
left a comment
There was a problem hiding this comment.
We are in the code freeze stage now, this PR will be merged later.
src/cloudai/workloads/megatron_run/report_generation_strategy.py
Outdated
Show resolved
Hide resolved
|
|
||
| # Keep only the last 10 iterations for statistics (to exclude warmup) | ||
| if len(iter_times_ms) > 10: | ||
| iter_times_ms = iter_times_ms[-10:] |
There was a problem hiding this comment.
I'm wondering if taking the last 10 iterations is the most relevant. What if the training has some ups and downs (as I already saw). Maybe just skipping the warmup, so say the 20 first iterations is enough ?
There was a problem hiding this comment.
Yes, skipping the warmup stage makes more sense, have updated to skipping the first 20 iterations. Originally was following the format in Megatron-Bridge report. Maybe later need to unify the formats for computing statistics.
There was a problem hiding this comment.
The last 10 iteration is what the GPU perf team uses. I have seen those runs on IB clusters and mostly towards the end it remains stable.
There was a problem hiding this comment.
Which one would be better? To make things consistent, should we keep the last 10 iterations?
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/workloads/megatron_run/report_generation_strategy.py (1)
30-121: Iteration-time parsing fails when TFLOP/s isn’t logged.
ITERATION_REGEXrequires TFLOP/s, so logs that only print iteration time will be ignored;can_handle_directory()then returns False andget_metric("iteration-time")returns METRIC_ERROR despite valid data. Consider making TFLOP/s optional and skipping warmup by iteration count, not by list position.🔧 Proposed fix
-ITERATION_REGEX = re.compile( - r"elapsed time per iteration \(ms\):\s*([0-9]+(?:\.[0-9]+)?)" - r".*?" - r"throughput per GPU \(TFLOP/s/GPU\):\s*([0-9]+(?:\.[0-9]+)?)", - re.IGNORECASE, -) +ITERATION_REGEX = re.compile( + r"elapsed time per iteration \(ms\):\s*([0-9]+(?:\.[0-9]+)?)" + r"(?:.*?throughput per GPU \(TFLOP/s/GPU\):\s*([0-9]+(?:\.[0-9]+)?))?", + re.IGNORECASE, +) def _extract(self, log_path: Path) -> tuple[list[float], list[float]]: """Extract iteration times (ms) and GPU TFLOPS from the log file.""" - iter_times_ms: list[float] = [] - gpu_tflops: list[float] = [] + records: list[tuple[float, float | None]] = [] with log_path.open("r", encoding="utf-8", errors="ignore") as f: for line in f: m = ITERATION_REGEX.search(line) if m: try: - iter_times_ms.append(float(m.group(1))) - gpu_tflops.append(float(m.group(2))) + iter_time = float(m.group(1)) + tflops = float(m.group(2)) if m.group(2) is not None else None + records.append((iter_time, tflops)) except (ValueError, TypeError): logging.debug("Failed to parse iteration metrics line: %s", line.rstrip("\n")) # Skip the first 20 iterations for statistics (to exclude warmup) - if len(iter_times_ms) > 20: - iter_times_ms = iter_times_ms[20:] - gpu_tflops = gpu_tflops[20:] - return iter_times_ms, gpu_tflops + if len(records) > 20: + records = records[20:] + iter_times_ms = [t for t, _ in records] + gpu_tflops = [g for _, g in records if g is not None] + return iter_times_ms, gpu_tflops
🤖 Fix all issues with AI agents
In `@src/cloudai/workloads/megatron_run/report_generation_strategy.py`:
- Around line 139-167: The CSV currently writes zeros when gpu_tflops is empty;
change the branch that sets
tflops_avg/tflops_median/tflops_min/tflops_max/tflops_std (used when writing via
writer to report_file in report_generation_strategy.py) to instead set those
values to empty strings (or the existing METRIC_ERROR sentinel used by
get_metric) so the CSV shows missing TFLOP/s data explicitly rather than zeros;
keep the same writer.writerow call that writes the "tflops_per_gpu" row but use
the new empty/sentinel values when gpu_tflops is falsy.
In
`@tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py`:
- Around line 32-83: The tests currently (megatron_run_tr and
megatron_run_tr_no_data fixtures) only provide 3 iterations and do not exercise
the warmup-skip behavior; add a new fixture (e.g., megatron_run_tr_with_warmup)
that builds a TestRun with stdout.txt containing at least 21 iteration log lines
formatted like the existing stdout_content so the first 20 iterations are
present and can be skipped, then add a corresponding test that uses this fixture
to assert the report generation logic ignores the first 20 iterations (reference
MegatronRunTestDefinition, TestRun, and the stdout.txt written in the fixtures).
tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py
Show resolved
Hide resolved
| "agent": self.agent, | ||
| "agent_steps": self.agent_steps, | ||
| "agent_metrics": self.agent_metrics, | ||
| "agent_metrics": self.agent_metrics if "agent_metrics" in self.model_fields_set else None, |
There was a problem hiding this comment.
That would change default values to None. Why is it needed?
There was a problem hiding this comment.
If it is None here then the definition of agent_metrics in test config can propagate, otherwise if agent_metrics is not defined in the scenario config, the final merged config would always be [default] even though agent_metrics is set in the test config.
Summary
Adds a new report generation strategy for the MegatronRun workload to extract iteration time and GPU TFLOP/s metrics from training logs. This enables Design Space Exploration (DSE) capabilities for MegatronRun workloads.
Also fixes the configuration overwrite behavior for
agent_metricsin test and scenario toml setups. Originally ifagent_metricsis set in test configuration but not in scenario configuration, theagent_metricswould be overwrited by[default,]. Nowagent_metricsin test configuration could be correctly propagated whenagent_metricsis not set in scenario configuration.Test Plan
Adds following to the test configurations
megatron_run_report.csvwould be generated andtrajectory.csvwould reflect the observations and rewards.