More robust bench execution for Dynamo over k8s by amaslenn · Pull Request #793 · NVIDIA/cloudai

amaslenn · 2026-01-30T10:34:30Z

Summary

Stream API can be sensitive to network issues, delays, broken ping, etc. Using kubectl is more robust for this case.

Test Plan

CI
Manual run on an internal cluster.

Additional Notes

—

coderabbitai · 2026-01-30T10:34:48Z

📝 Walkthrough

Walkthrough

The change replaces in-cluster Kubernetes API streaming with a subprocess.run call executing kubectl exec directly. Command output is written to genai_perf.log with enhanced error handling, and exception handling is broadened from specific API exceptions to general exceptions with debug-level logging.

Changes

Cohort / File(s)	Summary
Kubernetes Execution Method `src/cloudai/systems/kubernetes/kubernetes_system.py`	Replaced lazy.k8s.stream-based execution with subprocess.run calling kubectl exec; modified output capture to write stdout/stderr to genai_perf.log; broadened exception handling and shifted error logging to debug level; added exit code logging.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 The stream once flowed through API gates,
Now subprocess hops and navigates,
kubectl exec takes the lead with grace,
Output captured in its rightful place,
Exceptions caught with softer paws! 🌿

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: replacing Stream API with kubectl for more robust Kubernetes bench execution for Dynamo.
Description check	✅ Passed	The description is directly related to the changeset, explaining the motivation (Stream API sensitivity to network issues) and the solution (using kubectl instead).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch am/kube-exec

Tip

🧪 Unit Test Generation v2 is now available!

We have significantly improved our unit test generation capabilities.

To enable: Add this to your .coderabbit.yaml configuration:

reviews:
  finishing_touches:
    unit_tests:
      enabled: true

Try it out by using the @coderabbitai generate unit tests command on your code files or under ✨ Finishing Touches on the walkthrough!

Have feedback? Share your thoughts on our Discord thread!

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-01-30T10:37:24Z

Greptile Overview

Greptile Summary

Replaced Kubernetes Python Stream API with kubectl exec subprocess for executing genai-perf benchmarks in Dynamo frontend pods. This improves robustness against network issues and API timeouts.

Key Changes:

Switched from lazy.k8s.stream.stream() to subprocess.run() with kubectl
Changed exception handling from specific ApiException to generic Exception
Added separate stderr output capture in logs
Downgraded error logging to debug level

Issues Found:

Return code not validated - failures proceed silently
Error logging downgraded from error to debug level reduces visibility
Generic exception handling masks specific failure types

Confidence Score: 3/5

Safe to merge with minor improvements recommended for error handling
The approach is sound and addresses the stated network robustness issue, but missing return code validation and downgraded error logging could hide failures in production
Pay close attention to kubernetes_system.py:316-324 for error handling improvements

Important Files Changed

Filename	Overview
src/cloudai/systems/kubernetes/kubernetes_system.py	Replaced Kubernetes Stream API with kubectl subprocess for better network resilience, but missing return code validation

Sequence Diagram

sequenceDiagram
    participant System as KubernetesSystem
    participant Subprocess as subprocess.run
    participant Kubectl as kubectl exec
    participant Pod as Frontend Pod
    participant GenAI as genai-perf

    System->>System: _run_genai_perf(job)
    System->>System: _get_dynamo_pod_by_role("frontend")
    System->>System: Build kubectl_exec_cmd
    System->>Subprocess: run(kubectl_exec_cmd, timeout=600)
    Subprocess->>Kubectl: Execute kubectl exec
    Kubectl->>Pod: Connect to pod
    Pod->>GenAI: Run genai-perf command
    GenAI-->>Pod: stdout/stderr output
    Pod-->>Kubectl: Return output
    Kubectl-->>Subprocess: Return result
    Subprocess-->>System: Return result object
    System->>System: Log return code
    System->>System: Write stdout to genai_perf.log
    alt stderr exists
        System->>System: Append stderr to log
    end
    System->>Subprocess: kubectl cp results
    System->>System: Copy genai-perf results locally

greptile-apps

_{1 file reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

src/cloudai/systems/kubernetes/kubernetes_system.py

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Around line 316-324: The current subprocess.run call in kubernetes_system.py
(the try block around kubectl_exec_cmd) only logs result.returncode and swallows
failures; change it to either pass check=True to subprocess.run or explicitly
check result.returncode and raise a subprocess.CalledProcessError when non-zero,
and re-raise TimeoutExpired and other exceptions instead of only logging them so
failures propagate; update the except block to log and then raise the caught
exception (so _genai_perf_completed is only set when the command actually
succeeds).
- Around line 313-316: The kubectl subprocess command built in kubectl_exec_cmd
does not pass the resolved kubeconfig, so kubectl may target the wrong cluster
when self.kube_config_path differs from the default; update the construction of
kubectl_exec_cmd (the list used in subprocess.run for running genai_perf_cmd) to
include the flag "--kubeconfig" followed by self.kube_config_path (or its
resolved path used by load_kube_config()) before the "exec" invocation so the
subprocess uses the same kubeconfig as the API client.

src/cloudai/systems/kubernetes/kubernetes_system.py

More robust bench execution for Dynamo over k8s

b221d29

amaslenn requested review from alexmanle, jeffnvidia and srivatsankrishnan as code owners January 30, 2026 10:34

greptile-apps bot reviewed Jan 30, 2026

View reviewed changes

src/cloudai/systems/kubernetes/kubernetes_system.py Show resolved Hide resolved

src/cloudai/systems/kubernetes/kubernetes_system.py Show resolved Hide resolved

src/cloudai/systems/kubernetes/kubernetes_system.py Show resolved Hide resolved

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

src/cloudai/systems/kubernetes/kubernetes_system.py Show resolved Hide resolved

src/cloudai/systems/kubernetes/kubernetes_system.py Show resolved Hide resolved

coderabbitai bot mentioned this pull request Jan 30, 2026

Pass --kubeconfig to all kubectl subprocess invocations #794

Open

jeffnvidia approved these changes Feb 2, 2026

View reviewed changes

amaslenn merged commit 88d6f5c into main Feb 2, 2026
5 checks passed

amaslenn deleted the am/kube-exec branch February 2, 2026 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More robust bench execution for Dynamo over k8s#793

More robust bench execution for Dynamo over k8s#793
amaslenn merged 1 commit intomainfrom
am/kube-exec

amaslenn commented Jan 30, 2026

Uh oh!

coderabbitai bot commented Jan 30, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

greptile-apps bot commented Jan 30, 2026

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amaslenn commented Jan 30, 2026

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

greptile-apps bot commented Jan 30, 2026

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Jan 30, 2026 •

edited

Loading