Skip to content

[mellanox_firmware] PCC collector (opt in)#104

Open
DanGoldberg wants to merge 7 commits into
NVIDIA:mainfrom
DanGoldberg:ppcc-collector
Open

[mellanox_firmware] PCC collector (opt in)#104
DanGoldberg wants to merge 7 commits into
NVIDIA:mainfrom
DanGoldberg:ppcc-collector

Conversation

@DanGoldberg
Copy link
Copy Markdown

@DanGoldberg DanGoldberg commented Apr 28, 2026

Add a new PCC collector.
This collection is disabled by default and requires adding "-k mellanox_firmware.pcc=true" to the command line to enable it.

Key changes:

  • PCC collection via PPCC register with either mlxreg or mstreg
  • BaseTool: added subdir parameter for saving the output under a dedicated directory within mellanox_firmware output directory to reduce clutter

Signed-off-by: Dan Goldberg dgoldberg@nvidia.com


Please place an 'X' inside each '[]' to confirm you adhere to our Contributor Guidelines

  • Is the commit message split over multiple lines and hard-wrapped at 72 characters?
  • Is the subject and message clear and concise?
  • Does the subject start with [plugin_name] if submitting a plugin patch or a [section_name] if part of the core sosreport code?
  • Does the commit contain a Signed-off-by: First Lastname email@example.com?
  • Are any related Issues or existing PRs properly referenced via a Closes (Issue) or Resolved (PR) line?
  • Are all passwords or private data gathered by this PR obfuscated?

Summary by CodeRabbit

  • New Features
    • Added Performance Counter Collection (PCC) support to gather detailed firmware register data for enhanced diagnostics.
    • PCC collection is optional and disabled by default; can be enabled through plugin configuration.
    • Improved output organization with subdirectory support for collected artifacts.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

📝 Walkthrough

Walkthrough

This pull request introduces PCC (Performance Counter Configuration) collection functionality to the Mellanox firmware suite. It adds a new PccCollector that orchestrates firmware register data collection across devices via mlxreg or mstreg tools, triggered by a configurable plugin option.

Changes

Cohort / File(s) Summary
Collector Orchestration
sos/report/mellanox_firmware_suite/collectors/collector_manager.py, sos/report/mellanox_firmware_suite/collectors/pcc_collector.py
New collect_pcc_info() method added to manager; comprehensive new PccCollector class defines PpccCommand enum and methods to fetch algo info, status, and parameter metadata/counters with conditional execution based on device context and command return codes.
Tool Integration
sos/report/mellanox_firmware_suite/tools/MFT/mlxreg.py, sos/report/mellanox_firmware_suite/tools/MSTFlint/mstreg.py
New ppcc_get() methods added to both MlxregTool and MstregTool, converting operation mappings into command arguments and forwarding results with optional subdirectory placement.
Infrastructure
sos/report/mellanox_firmware_suite/tools/base_tool.py
execute_cmd() and _run_command() signatures extended with optional subdir parameter to support organized artifact collection in subdirectories.
Plugin Configuration
sos/report/plugins/mellanox_firmware.py
New pcc boolean option added to plugin configuration (defaults to False) to gate PCC collection enablement.

Sequence Diagram(s)

sequenceDiagram
    participant CM as CollectorManager
    participant PC as PccCollector
    participant Tool as MlxregTool/<br/>MstregTool
    participant Device as Device
    participant FS as Filesystem

    CM->>CM: collect_pcc_info() invoked
    CM->>CM: Check if "pcc" option enabled
    alt pcc enabled
        loop for each device context
            CM->>PC: PccCollector instantiated with context
            PC->>Tool: _collect_with_mft()/mstflint()
            Tool->>Device: ppcc_get(GET_ALGO_INFO_ARRAY)
            Device-->>Tool: algo slots returned
            Tool-->>PC: algo list parsed
            loop for each algo slot
                PC->>Tool: ppcc_get(GET_ALGO_STATUS)
                Device-->>Tool: status response
                Tool-->>PC: status value evaluated
                alt counter_en is true
                    PC->>Tool: ppcc_get(GET_COUNTER_INFO, ...)
                    Device-->>Tool: counter metadata
                    Tool-->>PC: metadata collected
                end
                PC->>Tool: ppcc_get(GET_PARAM_INFO, ...)
                Device-->>Tool: parameter metadata
                Tool-->>PC: parameters collected
            end
            PC->>FS: Write collected PPCC data to files
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰✨ A PCC collector hops into the scene,
With PPCC registers—algorithms pristine!
Parsing statuses, collecting with flair,
Through mlxreg and mstreg, everywhere!
Firmware insights now hopped into place,
In organized files, at a quickened pace! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a new PCC collector to the mellanox_firmware plugin that is optional (opt-in).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Pylint (4.0.5)
sos/report/mellanox_firmware_suite/tools/MFT/mlxreg.py
sos/report/mellanox_firmware_suite/tools/base_tool.py
sos/report/mellanox_firmware_suite/collectors/pcc_collector.py
  • 3 others

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@DanGoldberg DanGoldberg marked this pull request as ready for review April 30, 2026 13:37
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sos/report/mellanox_firmware_suite/tools/base_tool.py`:
- Around line 49-52: The cache early-return in execute_cmd is skipping the
required side-effect of writing artifacts when filename or subdir is provided;
modify execute_cmd so that when filename or subdir is set it does not
short-circuit on a cache hit but still calls _collect_cmd_output (or an
equivalent helper) to create/write the file into the requested subdir, while
still returning the cached stdout when appropriate; use the existing
cache_key/self.ctx.cache logic to retrieve stdout but ensure
_collect_cmd_output(cmd, output, filename, subdir) is invoked before returning
if filename or subdir are non-empty.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cf45634c-fd07-4972-a83c-11ebe05c891d

📥 Commits

Reviewing files that changed from the base of the PR and between d690337 and 7006fb3.

📒 Files selected for processing (6)
  • sos/report/mellanox_firmware_suite/collectors/collector_manager.py
  • sos/report/mellanox_firmware_suite/collectors/pcc_collector.py
  • sos/report/mellanox_firmware_suite/tools/MFT/mlxreg.py
  • sos/report/mellanox_firmware_suite/tools/MSTFlint/mstreg.py
  • sos/report/mellanox_firmware_suite/tools/base_tool.py
  • sos/report/plugins/mellanox_firmware.py

Comment on lines 49 to +52
if get_cached and cache_key in self.ctx.cache:
return self.ctx.cache[cache_key]

rc, output = self._run_command(cmd, timeout, filename)
rc, output = self._run_command(cmd, timeout, filename, subdir=subdir)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't short-circuit archival calls on cache hits.

When filename/subdir is provided, execute_cmd() has a required side effect: it must write the collected artifact. The cache guard returns before _collect_cmd_output() runs, so a repeated call for the same command can hand back cached stdout and never create the requested file in the requested subdirectory.

Suggested fix
-        if get_cached and cache_key in self.ctx.cache:
+        if filename is None and get_cached and cache_key in self.ctx.cache:
             return self.ctx.cache[cache_key]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if get_cached and cache_key in self.ctx.cache:
return self.ctx.cache[cache_key]
rc, output = self._run_command(cmd, timeout, filename)
rc, output = self._run_command(cmd, timeout, filename, subdir=subdir)
if filename is None and get_cached and cache_key in self.ctx.cache:
return self.ctx.cache[cache_key]
rc, output = self._run_command(cmd, timeout, filename, subdir=subdir)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sos/report/mellanox_firmware_suite/tools/base_tool.py` around lines 49 - 52,
The cache early-return in execute_cmd is skipping the required side-effect of
writing artifacts when filename or subdir is provided; modify execute_cmd so
that when filename or subdir is set it does not short-circuit on a cache hit but
still calls _collect_cmd_output (or an equivalent helper) to create/write the
file into the requested subdir, while still returning the cached stdout when
appropriate; use the existing cache_key/self.ctx.cache logic to retrieve stdout
but ensure _collect_cmd_output(cmd, output, filename, subdir) is invoked before
returning if filename or subdir are non-empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant