Skip to content

Add function args serialization to RCCL buffer records #129

Closed
rocm-devops wants to merge 8 commits into
amd-stagingfrom
mkuriche/rccl-api-args
Closed

Add function args serialization to RCCL buffer records #129
rocm-devops wants to merge 8 commits into
amd-stagingfrom
mkuriche/rccl-api-args

Conversation

@rocm-devops

Copy link
Copy Markdown

PR Details

Adds RCCL function parameter serialization. Requested by RCCL team.

No ticket yet, but they asked if nvtx ranges were supported. This request is not in this ticket / pr, but wanted to add this here.

Associated Jira Ticket Number/Link

SWDEV-528449
SWDEV-527517

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update
  • Continuous Integration

Technical details

Adds RCCL function parameter serialization

Added/updated tests?

  • Yes
  • No, Does not apply to this PR.

Updated CHANGELOG?

  • Yes
  • No, Does not apply to this PR.

Added/Updated documentation?

  • Yes
  • No, Does not apply to this PR.

@rocm-devops

Copy link
Copy Markdown
Author

Code Coverage Report

Code Coverage Report

Tests Only

code coverage tests.png

Samples Only

code coverage samples.png

Tests + Samples

code coverage all.png

@rocm-devops

Copy link
Copy Markdown
Author

RCCL test is failing because of this failed with error code in amd_comgr

code_object.cpp:153] amd_comgr_lookup_code_object(data_object, query_list.data(), query_list.size()) failed with error code 2 :: INVALID_ARGUMENT
code_object.cpp:196] amd_comgr_set_data(binary_data, isa_offset.size, static_cast<const char*>(bin_offset)) returned error code 2 :: INVALID_ARGUMENT :: binary_data=859417280, isa=(amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-, 0, 0), fat_bin=0x7f3e49a1e000
code_object.cpp:196] amd_comgr_set_data(binary_data, isa_offset.size, static_cast<const char*>(bin_offset)) returned error code 2 :: INVALID_ARGUMENT :: binary_data=843925536, isa=(amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack-, 0, 0), fat_bin=0x7f3e49a1e000

Probably because of this (new?) ISA amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack-?

@rocm-devops

Copy link
Copy Markdown
Author

Created #536 for some failing tests.

@amd-hsivasun

Copy link
Copy Markdown

Imported to rocm-systems

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for peer review PR needs initial review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants