Skip to content

feat: make rocenv tool lite/full mode configurable via additional_context#125

Open
coketaste wants to merge 7 commits into
developfrom
coketaste/roc-env-tool-full
Open

feat: make rocenv tool lite/full mode configurable via additional_context#125
coketaste wants to merge 7 commits into
developfrom
coketaste/roc-env-tool-full

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

Summary

  • Make the rocenv tool's --lite/full collection mode configurable through additional_context instead of being hardcoded to --lite
  • Default remains lite (backward compatible) — users opt into full mode via --additional-context "{'rocenv_mode': 'full'}"
  • Full mode adds 4 extra system diagnostic sections: hardware_information, bios_settings, dmsg_gpu_drm_atom_logs, amdgpu_modinfo

Motivation

The --lite flag was hardcoded in run_rocenv_tool.sh, making it impossible to collect the full system environment data (BIOS dumps, kernel dmesg, hardware enumeration, amdgpu module
info) through madengine. These sections are valuable for debugging hardware-level issues on bare-metal nodes.

Changes

File Change
run_rocenv_tool.sh Accept $2 as mode arg, conditionally apply --lite
container_runner.py Read rocenv_mode from context, pass as second arg
k8s_scripts.py Accept rocenv_mode parameter, pass as second arg
k8s_template_context.py Extract rocenv_mode from additional_context, pass to mixin

Usage

# Full mode (collects all 32 sections):
madengine run --tags my_model --additional-context "{'rocenv_mode': 'full'}"

# Default lite mode (28 sections, unchanged behavior):
madengine run --tags my_model

Test plan

- 8 new unit tests covering default/lite/full/invalid modes for both local and K8s paths (434 total pass)
- Verified end-to-end on bare metal: rocenv_mode: 'full' correctly omits --lite flag and attempts full collection
- Note: full-mode sections require lshw, dmidecode, dmesg, modinfo in the container — these are typically only available on bare-metal hosts

…text

Allow users to collect full system environment details (including
hardware_information, bios_settings, dmesg logs, amdgpu modinfo) by
setting rocenv_mode to "full" in additional_context. Defaults to "lite"
for backward compatibility.

Usage: --additional-context "{'rocenv_mode': 'full'}"

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 13, 2026 14:08
@coketaste coketaste self-assigned this May 13, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Makes the rocEnvTool collection mode configurable via additional_context.rocenv_mode (default lite, opt-in full) instead of hardcoding --lite. The shell script accepts a second argument controlling the flag, and both the local (container_runner) and K8s (k8s_scripts/k8s_template_context) code paths read rocenv_mode from context and pass it through. Invalid values fall back to lite with a warning.

Changes:

  • run_rocenv_tool.sh accepts a mode arg and conditionally applies --lite.
  • ContainerRunner.gather_system_env_details and KubernetesScriptsMixin.gather_system_env_details build args as "<output_name> <mode>" with validation.
  • K8s template context extracts rocenv_mode from additional_context and forwards it.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/madengine/scripts/common/pre_scripts/run_rocenv_tool.sh Adds ROCENV_MODE second arg; conditionally sets LITE_FLAG.
src/madengine/execution/container_runner.py Reads rocenv_mode from context, validates, appends to args.
src/madengine/deployment/k8s_scripts.py Adds rocenv_mode param with validation; updates args/log.
src/madengine/deployment/k8s_template_context.py Pulls rocenv_mode from additional_context and forwards.
tests/unit/test_container_runner.py New tests for default/lite/full/invalid rocenv_mode (local path).
tests/unit/test_k8s.py New tests for default/lite/full/invalid rocenv_mode (K8s path).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coketaste and others added 2 commits May 13, 2026 09:15
Add a dummy model whose Dockerfile installs lshw, dmidecode, kmod, and
util-linux so rocenv_tool.py full mode can collect all 4 extra sections
(hardware_information, bios_settings, dmsg_gpu_drm_atom_logs, amdgpu_modinfo)
inside the container.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Add dump handlers for the 4 sections only collected in full mode:
- hardware_information (lshw)
- bios_settings (dmidecode)
- dmsg_gpu_drm_atom_logs (dmesg)
- amdgpu_modinfo (modinfo)

Previously these sections were collected into .txt files but silently
skipped by the CSV parser because it had no matching handlers.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 13, 2026 14:22
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

When rocenv_mode is 'full', detect missing tools (lshw, dmidecode,
kmod, dmesg) and attempt apt-get install before running rocenv_tool.py.
Only installs what's actually missing, degrades gracefully if network
or permissions are unavailable. This eliminates the need for a special
Dockerfile to use full mode — any model container can collect all
system diagnostic sections.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Comment thread src/madengine/scripts/common/pre_scripts/run_rocenv_tool.sh Outdated
…tOS)

Wire MAD_GUEST_OS and a third run_rocenv_tool.sh argument from additional_context
so full mode uses apt on UBUNTU and microdnf/dnf/yum on CENTOS instead of
assuming Debian-based images everywhere. K8s pre-scripts pass the same guest_os.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 15, 2026 03:25
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

Cemberk
Cemberk previously approved these changes May 15, 2026
Copy link
Copy Markdown

@Cemberk Cemberk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Use apt-get/yum without sudo when root in trace.sh RPD setup, add build deps
for rocmProfileData, and constrain dummy NCCL init (HIP/IB/socket) to avoid
topology hangs. Relax rccl_trace log assertion for minor NCCL log format drift.

Co-authored-by: Cursor <cursoragent@cursor.com>
Upstream rocmProfileData/rpd_tracer/Makefile uses `xxd -i` to embed
tableSchema.cmd/utilitySchema.cmd as C arrays. The rocm/pytorch base
image lacks xxd, so `make rpd` failed with exit 127 and the e2e test
saw no trace.rpd. Add xxd on Ubuntu and vim-common (provides xxd) on
CentOS.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 16, 2026 04:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment on lines 17 to +51
os=''
if command -v apt >/dev/null 2>&1; then
if command -v apt-get >/dev/null 2>&1; then
os=ubuntu
elif command -v yum >/dev/null 2>&1; then
os=centos
else
echo 'Unable to detect Host OS in pre_script (need apt or yum for RPD dependencies)' >&2
echo 'Unable to detect Host OS in pre_script (need apt-get or yum for RPD dependencies)' >&2
exit 1
fi
if [ "$os" == 'ubuntu' ]; then
sudo apt update
sudo apt install -y sqlite3 libsqlite3-dev libfmt-dev python3-pip nlohmann-json3-dev
if [ "$(id -u)" -eq 0 ]; then
apt-get update -qq
DEBIAN_FRONTEND=noninteractive apt-get install -y -qq \
sqlite3 libsqlite3-dev libfmt-dev python3-pip nlohmann-json3-dev \
git build-essential pkg-config xxd
elif command -v sudo >/dev/null 2>&1; then
sudo apt-get update -qq
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -qq \
sqlite3 libsqlite3-dev libfmt-dev python3-pip nlohmann-json3-dev \
git build-essential pkg-config xxd
else
echo 'RPD pre-script: need root or sudo for apt-get' >&2
exit 1
fi
elif [ "$os" == 'centos' ]; then
sudo yum install -y libsqlite3x-devel.x86_64 fmt-devel python3-pip json-devel
if [ "$(id -u)" -eq 0 ]; then
yum install -y gcc gcc-c++ make git \
libsqlite3x-devel.x86_64 fmt-devel python3-pip json-devel vim-common
elif command -v sudo >/dev/null 2>&1; then
sudo yum install -y gcc gcc-c++ make git \
libsqlite3x-devel.x86_64 fmt-devel python3-pip json-devel vim-common
else
echo 'RPD pre-script: need root or sudo for yum' >&2
exit 1
fi
Comment on lines +224 to +287
def dump_hardware_information_in_csv(self, log_path):
"""Parse lshw output, extracting key hardware fields."""
lines = self.get_log_file_data(log_path)
info_list = []
info_list.append(lines[0].rstrip())
keywords = ("product:", "vendor:", "serial:", "width:", "size:",
"description:", "capabilities:", "clock:")
for j in range(1, len(lines)):
line = lines[j].rstrip()
if not line.strip():
continue
line_lower = line.strip().lower()
for kw in keywords:
if line_lower.startswith(kw):
key, _, value = line.strip().partition(":")
info_list.append(key.strip() + "|" + value.strip())
break
return info_list

def dump_bios_settings_in_csv(self, log_path):
"""Parse dmidecode output, extracting key:value pairs."""
lines = self.get_log_file_data(log_path)
info_list = []
info_list.append(lines[0].rstrip())
for j in range(1, len(lines)):
line = lines[j].rstrip()
if not line.strip() or line.startswith("Handle ") or line.startswith("#"):
continue
if ":" in line and line[0] in (" ", "\t"):
key, _, value = line.strip().partition(":")
value = value.strip()
if value:
info_list.append(key.strip() + "|" + value)
return info_list

def dump_dmsg_gpu_drm_atom_logs_in_csv(self, log_path):
"""Parse dmesg filtered log lines."""
lines = self.get_log_file_data(log_path)
info_list = []
info_list.append(lines[0].rstrip())
count = 0
for j in range(1, len(lines)):
line = lines[j].rstrip()
if not line.strip():
continue
info_list.append("Log|" + line.strip())
count += 1
if count >= 50:
break
return info_list

def dump_amdgpu_modinfo_in_csv(self, log_path):
"""Parse modinfo output (key:value per line, like lscpu)."""
lines = self.get_log_file_data(log_path)
info_list = []
info_list.append(lines[0].rstrip())
for j in range(1, len(lines)):
line = lines[j].rstrip()
if not line.strip():
continue
if ":" in line:
key, _, value = line.partition(":")
info_list.append(key.strip() + "|" + value.strip())
return info_list
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants