feat: make rocenv tool lite/full mode configurable via additional_context#125
Open
coketaste wants to merge 7 commits into
Open
feat: make rocenv tool lite/full mode configurable via additional_context#125coketaste wants to merge 7 commits into
coketaste wants to merge 7 commits into
Conversation
…text
Allow users to collect full system environment details (including
hardware_information, bios_settings, dmesg logs, amdgpu modinfo) by
setting rocenv_mode to "full" in additional_context. Defaults to "lite"
for backward compatibility.
Usage: --additional-context "{'rocenv_mode': 'full'}"
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Makes the rocEnvTool collection mode configurable via additional_context.rocenv_mode (default lite, opt-in full) instead of hardcoding --lite. The shell script accepts a second argument controlling the flag, and both the local (container_runner) and K8s (k8s_scripts/k8s_template_context) code paths read rocenv_mode from context and pass it through. Invalid values fall back to lite with a warning.
Changes:
run_rocenv_tool.shaccepts a mode arg and conditionally applies--lite.ContainerRunner.gather_system_env_detailsandKubernetesScriptsMixin.gather_system_env_detailsbuild args as"<output_name> <mode>"with validation.- K8s template context extracts
rocenv_modefromadditional_contextand forwards it.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
src/madengine/scripts/common/pre_scripts/run_rocenv_tool.sh |
Adds ROCENV_MODE second arg; conditionally sets LITE_FLAG. |
src/madengine/execution/container_runner.py |
Reads rocenv_mode from context, validates, appends to args. |
src/madengine/deployment/k8s_scripts.py |
Adds rocenv_mode param with validation; updates args/log. |
src/madengine/deployment/k8s_template_context.py |
Pulls rocenv_mode from additional_context and forwards. |
tests/unit/test_container_runner.py |
New tests for default/lite/full/invalid rocenv_mode (local path). |
tests/unit/test_k8s.py |
New tests for default/lite/full/invalid rocenv_mode (K8s path). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add a dummy model whose Dockerfile installs lshw, dmidecode, kmod, and util-linux so rocenv_tool.py full mode can collect all 4 extra sections (hardware_information, bios_settings, dmsg_gpu_drm_atom_logs, amdgpu_modinfo) inside the container. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Add dump handlers for the 4 sections only collected in full mode: - hardware_information (lshw) - bios_settings (dmidecode) - dmsg_gpu_drm_atom_logs (dmesg) - amdgpu_modinfo (modinfo) Previously these sections were collected into .txt files but silently skipped by the CSV parser because it had no matching handlers. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
When rocenv_mode is 'full', detect missing tools (lshw, dmidecode, kmod, dmesg) and attempt apt-get install before running rocenv_tool.py. Only installs what's actually missing, degrades gracefully if network or permissions are unavailable. This eliminates the need for a special Dockerfile to use full mode — any model container can collect all system diagnostic sections. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
…tOS) Wire MAD_GUEST_OS and a third run_rocenv_tool.sh argument from additional_context so full mode uses apt on UBUNTU and microdnf/dnf/yum on CENTOS instead of assuming Debian-based images everywhere. K8s pre-scripts pass the same guest_os. Co-authored-by: Cursor <cursoragent@cursor.com>
Use apt-get/yum without sudo when root in trace.sh RPD setup, add build deps for rocmProfileData, and constrain dummy NCCL init (HIP/IB/socket) to avoid topology hangs. Relax rccl_trace log assertion for minor NCCL log format drift. Co-authored-by: Cursor <cursoragent@cursor.com>
Upstream rocmProfileData/rpd_tracer/Makefile uses `xxd -i` to embed tableSchema.cmd/utilitySchema.cmd as C arrays. The rocm/pytorch base image lacks xxd, so `make rpd` failed with exit 127 and the e2e test saw no trace.rpd. Add xxd on Ubuntu and vim-common (provides xxd) on CentOS. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Comment on lines
17
to
+51
| os='' | ||
| if command -v apt >/dev/null 2>&1; then | ||
| if command -v apt-get >/dev/null 2>&1; then | ||
| os=ubuntu | ||
| elif command -v yum >/dev/null 2>&1; then | ||
| os=centos | ||
| else | ||
| echo 'Unable to detect Host OS in pre_script (need apt or yum for RPD dependencies)' >&2 | ||
| echo 'Unable to detect Host OS in pre_script (need apt-get or yum for RPD dependencies)' >&2 | ||
| exit 1 | ||
| fi | ||
| if [ "$os" == 'ubuntu' ]; then | ||
| sudo apt update | ||
| sudo apt install -y sqlite3 libsqlite3-dev libfmt-dev python3-pip nlohmann-json3-dev | ||
| if [ "$(id -u)" -eq 0 ]; then | ||
| apt-get update -qq | ||
| DEBIAN_FRONTEND=noninteractive apt-get install -y -qq \ | ||
| sqlite3 libsqlite3-dev libfmt-dev python3-pip nlohmann-json3-dev \ | ||
| git build-essential pkg-config xxd | ||
| elif command -v sudo >/dev/null 2>&1; then | ||
| sudo apt-get update -qq | ||
| sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -qq \ | ||
| sqlite3 libsqlite3-dev libfmt-dev python3-pip nlohmann-json3-dev \ | ||
| git build-essential pkg-config xxd | ||
| else | ||
| echo 'RPD pre-script: need root or sudo for apt-get' >&2 | ||
| exit 1 | ||
| fi | ||
| elif [ "$os" == 'centos' ]; then | ||
| sudo yum install -y libsqlite3x-devel.x86_64 fmt-devel python3-pip json-devel | ||
| if [ "$(id -u)" -eq 0 ]; then | ||
| yum install -y gcc gcc-c++ make git \ | ||
| libsqlite3x-devel.x86_64 fmt-devel python3-pip json-devel vim-common | ||
| elif command -v sudo >/dev/null 2>&1; then | ||
| sudo yum install -y gcc gcc-c++ make git \ | ||
| libsqlite3x-devel.x86_64 fmt-devel python3-pip json-devel vim-common | ||
| else | ||
| echo 'RPD pre-script: need root or sudo for yum' >&2 | ||
| exit 1 | ||
| fi |
Comment on lines
+224
to
+287
| def dump_hardware_information_in_csv(self, log_path): | ||
| """Parse lshw output, extracting key hardware fields.""" | ||
| lines = self.get_log_file_data(log_path) | ||
| info_list = [] | ||
| info_list.append(lines[0].rstrip()) | ||
| keywords = ("product:", "vendor:", "serial:", "width:", "size:", | ||
| "description:", "capabilities:", "clock:") | ||
| for j in range(1, len(lines)): | ||
| line = lines[j].rstrip() | ||
| if not line.strip(): | ||
| continue | ||
| line_lower = line.strip().lower() | ||
| for kw in keywords: | ||
| if line_lower.startswith(kw): | ||
| key, _, value = line.strip().partition(":") | ||
| info_list.append(key.strip() + "|" + value.strip()) | ||
| break | ||
| return info_list | ||
|
|
||
| def dump_bios_settings_in_csv(self, log_path): | ||
| """Parse dmidecode output, extracting key:value pairs.""" | ||
| lines = self.get_log_file_data(log_path) | ||
| info_list = [] | ||
| info_list.append(lines[0].rstrip()) | ||
| for j in range(1, len(lines)): | ||
| line = lines[j].rstrip() | ||
| if not line.strip() or line.startswith("Handle ") or line.startswith("#"): | ||
| continue | ||
| if ":" in line and line[0] in (" ", "\t"): | ||
| key, _, value = line.strip().partition(":") | ||
| value = value.strip() | ||
| if value: | ||
| info_list.append(key.strip() + "|" + value) | ||
| return info_list | ||
|
|
||
| def dump_dmsg_gpu_drm_atom_logs_in_csv(self, log_path): | ||
| """Parse dmesg filtered log lines.""" | ||
| lines = self.get_log_file_data(log_path) | ||
| info_list = [] | ||
| info_list.append(lines[0].rstrip()) | ||
| count = 0 | ||
| for j in range(1, len(lines)): | ||
| line = lines[j].rstrip() | ||
| if not line.strip(): | ||
| continue | ||
| info_list.append("Log|" + line.strip()) | ||
| count += 1 | ||
| if count >= 50: | ||
| break | ||
| return info_list | ||
|
|
||
| def dump_amdgpu_modinfo_in_csv(self, log_path): | ||
| """Parse modinfo output (key:value per line, like lscpu).""" | ||
| lines = self.get_log_file_data(log_path) | ||
| info_list = [] | ||
| info_list.append(lines[0].rstrip()) | ||
| for j in range(1, len(lines)): | ||
| line = lines[j].rstrip() | ||
| if not line.strip(): | ||
| continue | ||
| if ":" in line: | ||
| key, _, value = line.partition(":") | ||
| info_list.append(key.strip() + "|" + value.strip()) | ||
| return info_list |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--lite/full collection mode configurable throughadditional_contextinstead of being hardcoded to--litelite(backward compatible) — users opt intofullmode via--additional-context "{'rocenv_mode': 'full'}"hardware_information,bios_settings,dmsg_gpu_drm_atom_logs,amdgpu_modinfoMotivation
The
--liteflag was hardcoded inrun_rocenv_tool.sh, making it impossible to collect the full system environment data (BIOS dumps, kernel dmesg, hardware enumeration, amdgpu moduleinfo) through madengine. These sections are valuable for debugging hardware-level issues on bare-metal nodes.
Changes
run_rocenv_tool.sh$2as mode arg, conditionally apply--litecontainer_runner.pyrocenv_modefrom context, pass as second argk8s_scripts.pyrocenv_modeparameter, pass as second argk8s_template_context.pyrocenv_modefromadditional_context, pass to mixinUsage