Skip to content

Using semaphore to sync with all peer processes in finalization stage#114

Closed
rocm-devops wants to merge 8 commits into
amd-stagingfrom
huanran/finalization_processes_sync
Closed

Using semaphore to sync with all peer processes in finalization stage#114
rocm-devops wants to merge 8 commits into
amd-stagingfrom
huanran/finalization_processes_sync

Conversation

@rocm-devops

Copy link
Copy Markdown

[rocprofv3] Implement synchronization using POSIX semaphore in finalization

PR Details

Associated Jira Ticket Number/Link

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update
  • Continuous Integration

Technical details

This PR is to fix https://ontrack-internal.amd.com/browse/SWDEV-539022. In some workload, during the exiting stage, the framework uses a fast-fail pattern: if one process finishes, it kills all peer processes. E.g. in vLLM:
vllm/executor/multiproc_worker_utils.py:127:

 # Cleanup any remaining workers
            if logger:
                logger.info("Killing local vLLM worker processes")
            for worker in self.workers:
                worker.kill_worker()

This fix is using a POSIX semaphore to sync with peer processes, wait all peers finish and keep executing.

Added/updated tests?

  • Yes
  • No, Does not apply to this PR.

Updated CHANGELOG?

  • Yes
  • No, Does not apply to this PR.

Added/Updated documentation?

  • Yes
  • No, Does not apply to this PR.

@rocm-devops

Copy link
Copy Markdown
Author

Manual review required for ce3efa
Assigned Auditors: @udaykikr

@rocm-devops

Copy link
Copy Markdown
Author

Manual review required for 4794bd
Assigned Auditors: @yuxuanli

@rocm-devops

Copy link
Copy Markdown
Author

My concerns lie in the while loop (e.g. is it possible for one process to get beyond this function but still exist when another process reads <ppid>/children -- resulting in an infinite loop) and whether it is possible that the parent PID could have unrelated child processes that have daemon-like runtimes -- resulting in a hang. At the very least, there needs to be a rocprofv3 option, e.g. --process-sync, which enables this. I am extremely hesitant to have this be the default and for it to be unconditional because apps do strange things.

I need some more time to review but first, the --process-sync option should be added

Thanks for the comments. The --process-sync is added and default is false (not sync).

@rocm-devops

Copy link
Copy Markdown
Author

Code Coverage Report

Code Coverage Report

Tests Only

code coverage tests.png

Samples Only

code coverage samples.png

Tests + Samples

code coverage all.png

@jayhawk-commits

Copy link
Copy Markdown
Contributor

Imported to ROCm/rocm-systems.

ammallya pushed a commit that referenced this pull request Oct 28, 2025
…y Mi300 UndefinedBehaviorSanitizer job (#114)

* Initial fix for runtime error in id_decode.hpp:set_dim_in_rec()

* actual fix: corrected the handling of case where dim==1 (ROCPROFILER_DIMENSION_NONE)

* removing magic numbers

* minor fix

* fix for invalid bool value at runtime

* clang format

* build fix

---------

Co-authored-by: Welton, Benjamin <Benjamin.Welton@amd.com>
Co-authored-by: Benjamin Welton <bewelton@amd.com>

[ROCm/rocprofiler-sdk commit: cffda33]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants