Strip out hermetic llvm from the rocm toolchain. Use specific target for rocm gpu by alekstheod · Pull Request #251 · google-ml-infra/rules_ml_toolchain

alekstheod · 2026-04-23T19:12:36Z

This PR strips out hermetic llvm from the rocm toolchain. It removes the wrapper with the dynamic
selection of the compiler. Instead it introduces a new type of a target for a rocm gpu compilation.

Following is the analysis from AI:

ROCm Toolchain Architecture: Feature-based vs Wrapper-based Compilation

Overview

This document compares two approaches for ROCm/HIP compilation in Bazel:

Previous approach: 590-line Python wrapper that intercepted compilation commands
Current approach: Feature-based using Bazel's native cc_common API and custom rocm_compile rule

TL;DR

The current feature-based approach is 67% less code, faster, more maintainable, and uses idiomatic Bazel patterns. The wrapper approach was functional but reimplemented functionality that Bazel already provides.

Approach Comparison

Previous: hipcc_wrapper (590 lines)

Architecture:

cc_library → Bazel selects ROCm toolchain → hipcc_wrapper intercepts
          → Wrapper parses flags & env vars → Routes to hipcc or clang

Implementation:

590-line Python script
Detected compilation mode (GPU/CPU/linking) via flag inspection
Manually parsed and filtered compiler flags
Environment variable indirection (HIPCC_PATH, ROCM_PATH, etc.)
Reimplemented Bazel's flag handling in Python

Pros:

Could use standard cc_library for GPU code (with -x rocm flag)
Single script handled all modes

Cons:

589 lines of Python to maintain
Extra process overhead (Python interpreter on every compilation)
Imperative logic (hard to understand and modify)
Manual string parsing (error-prone)
Anti-pattern in Bazel (wrappers are discouraged)
Hard to debug (multiple execution layers)
Not declarative or composable

Current: rocm_compile Rule + Features (~180 lines)

Architecture:

rocm_library → rocm_compile rule → cc_toolchain features → hipcc directly
                                 → cc_library wraps .o files

Implementation:

rocm_compile rule (~80 lines): Custom Bazel rule that uses cc_common API
rocm_hipcc_feature (~100 lines): Declarative feature defining compiler flags and environment
rocm_library macro (~50 lines): User-friendly wrapper

Pros:

~180 total lines (67% reduction from 590)
Zero wrapper overhead - direct Bazel action → hipcc execution
Uses Bazel's native mechanisms: cc_common API, feature system, compilation contexts
Declarative flags in features (composable, overridable)
Explicit separation: GPU code clearly marked with rocm_library
Better debuggability: Can see exact hipcc command in Bazel logs
Type-safe flag handling via Bazel APIs
More maintainable: Declarative > Imperative
Idiomatic Bazel: Follows best practices

Cons:

Requires rocm_library macro instead of plain cc_library for GPU code
- Note: This is actually not a limitation - see "Real-world Usage" below
Full cc_toolchain infrastructure defined but not used for automatic resolution
- Note: We still use the infrastructure for everything else - see "What cc_toolchain Provides" below

Detailed Comparison

1. Code Complexity

Metric	Previous	Current	Reduction
Wrapper code	590 lines	0 lines	100%
Rule code	0 lines	80 lines	-
Feature code	0 lines	100 lines	-
Total	590 lines	180 lines	67%

2. Performance

Previous:

Bazel → fork Python → parse args → execute hipcc

Python interpreter overhead on every compilation
String parsing overhead

Current:

Bazel → execute hipcc directly

Direct execution via Bazel action
No intermediate processes

3. User Experience

Previous:

cc_library(
    name = "gpu_kernel",
    srcs = ["kernel.cu.cc"],
    copts = ["-x", "rocm"],  # Must remember this magic flag!
)

Current:

rocm_library(
    name = "gpu_kernel",
    srcs = ["kernel.cu.cc"],  # Intent is explicit
)

The current approach is more intuitive - the macro name clearly indicates GPU code.

4. Real-world Usage Pattern

In practice, GPU kernels are always separate from CPU code:

# GPU kernels - separate .cu.cc files
rocm_library(
    name = "cub_sort_kernel",
    srcs = ["cub_sort_kernel.cu.cc"],
)

# CPU code - normal .cc files
cc_library(
    name = "stream_executor",
    srcs = ["stream_executor.cc"],
    deps = [":cub_sort_kernel"],  # Links GPU objects
)

The "limitation" of requiring rocm_library matches actual usage patterns:

GPU kernels are always separate .cu.cc files
They're compiled separately anyway
Mixed GPU+CPU in a single source file is extremely rare
This is not actually a limitation

5. Maintainability

Adding a new compiler flag:

# Edit 590-line wrapper
def filter_flags_for_mode(args, mode):
    # ... complex logic ...
    if new_flag in args:
        # ... handle edge cases ...

Current:

# Edit declarative feature
rocm_hipcc_feature(
    compiler_flags = [
        # ... existing flags ...
        "--new-flag",  # Just add it
    ],
)

The current approach is dramatically easier to maintain.

6. Correctness

Previous: Manual flag parsing

# Wrapper manually parses/filters flags
filtered_args = []
for arg in args:
    if arg.startswith("--offload-arch"):
        if is_linking_mode:
            continue  # Strip for linking
    # ... 100s of lines of logic ...

Prone to parsing bugs
Hard to ensure correctness across all cases

Current: Bazel API handles it

compiler_flags = cc_common.get_memory_inefficient_command_line(
    feature_configuration=feature_configuration,
    action_name="c++-compile",
    variables=compile_variables,
)

Battle-tested by Google
Type-safe
Handles all edge cases

7. Debuggability

Previous:

Compilation fails
→ Check Bazel logs
→ Find wrapper Python command
→ Understand wrapper logic
→ Check environment variables
→ Trace through wrapper
→ Find actual hipcc command

Current:

Compilation fails
→ Check Bazel logs
→ See exact hipcc command immediately

Fewer layers = easier debugging.

8. What cc_toolchain Actually Provides

The current approach uses the cc_toolchain infrastructure for many critical things:

✅ File dependencies (cc_toolchain.all_files)

ROCm toolkit files (hipcc, clang headers, libraries)
Sysroot files (hermetic or local)
All transitive dependencies

✅ Compilation context integration (cc_common.merge_compilation_contexts)

Include paths from all dependencies
System include paths
Quote includes
Defines
Header files from the entire dependency graph

✅ Built-in include directories (cxx_builtin_include_directories)

Critical for header discovery
Sysroot integration (hermetic vs local)

✅ Feature system

Different configurations (dbg, opt, fastbuild)
Conditional compilation flags
Integration with Bazel's standard features
Future extensibility - add flags declaratively
Feature inheritance and composition

✅ Environment variables (cc_common.get_environment_variables)

HIPCC_PATH, ROCM_PATH, HIPCC_VERSION, ROCM_CLANG_VERSION
Set declaratively in features

✅ Compiler flags (cc_common.get_memory_inefficient_command_line)

All flags from features
Correct ordering
Mode-specific flags (compile vs link)

The only thing we don't use is automatic toolchain resolution. But manual selection is actually more explicit and clearer - the rocm_compile rule states exactly which toolchain it uses.

Without the cc_toolchain infrastructure, we'd need to manually track all of the above. The wrapper approach had to reimplement much of this logic.

Architecture Details

rocm_compile Rule

The rocm_compile rule is the heart of the implementation:

def _rocm_compile_impl(ctx):
    # Get ROCm toolchain (manual selection)
    cc_toolchain = ctx.attr._cc_toolchain[cc_common.CcToolchainInfo]
    
    # Merge compilation contexts from dependencies
    compilation_contexts = [dep[CcInfo].compilation_context for dep in ctx.attr.deps]
    merged_context = cc_common.merge_compilation_contexts(compilation_contexts)
    
    # Get feature configuration
    feature_configuration = cc_common.configure_features(
        cc_toolchain=cc_toolchain,
        requested_features=ctx.features,
    )
    
    # Get compiler flags from features
    compiler_flags = cc_common.get_memory_inefficient_command_line(
        feature_configuration=feature_configuration,
        action_name="c++-compile",
    )
    
    # Get environment variables from features
    env_vars = cc_common.get_environment_variables(
        feature_configuration=feature_configuration,
        action_name="c++-compile",
    )
    
    # Get hipcc executable
    hipcc = ctx.files._hipcc[0]
    
    # Compile each source
    for src in ctx.files.srcs:
        ctx.actions.run(
            executable=hipcc,
            arguments=["-x", "hip", "-c"] + compiler_flags + [src, "-o", obj],
            inputs=depset(
                direct=[src] + ctx.files.hdrs,
                transitive=[merged_context.headers, cc_toolchain.all_files],
            ),
            outputs=[obj],
            env=env_vars,
        )

Key points:

Uses cc_common API throughout
Leverages Bazel's compilation context merging
Gets flags and environment declaratively from features
Direct action execution (no wrapper)

rocm_hipcc_feature

Declarative feature defining all ROCm-specific configuration:

rocm_hipcc_feature(
    name = "rocm_hipcc_feature",
    enabled = True,
    compiler_flags = [
        "--rocm-path={rocm_path}",
        "--offload-arch=gfx908",
        "--offload-arch=gfx90a",
        "-fno-gpu-rdc",
        "-D__HIP_PLATFORM_AMD__",
        # ... all ROCm flags ...
    ],
    env_sets = {
        "HIPCC_PATH": "{hipcc_path}",
        "ROCM_PATH": "{rocm_path}",
        # ... environment variables ...
    },
)

Benefits:

Easy to read and modify
Composable with other features
Can be overridden per-target if needed
No imperative logic

rocm_library Macro

User-friendly wrapper:

def rocm_library(name, srcs, hdrs, deps, **kwargs):
    rocm_compile(
        name = name + "_rocm_objects",
        srcs = srcs,
        hdrs = hdrs,
        deps = deps,
    )
    
    cc_library(
        name = name,
        hdrs = hdrs,
        srcs = [":" + name + "_rocm_objects"],
        deps = deps,
        **kwargs
    )

Compiles GPU code with rocm_compile, then wraps .o files in standard cc_library for linking.

Local Sysroot Support

Both approaches face the same fundamental constraint: You cannot mix hermetic headers with system libraries (ABI mismatch → undefined symbols or segfaults).

The current approach handles this correctly:

ROCm toolchain: Supports local_sysroot (for system ROCm case)
- Uses hermetic headers (needed for device code)
- Can link against system libraries via local_sysroot_default_libs feature
Hermetic LLVM toolchain: Stays fully hermetic
- Prevents ABI mismatches
- Ensures exec tools (tblgen, etc.) don't crash

This is an architectural constraint, not an implementation issue.

Migration Path

Converting from previous to current approach is straightforward:

Before:

cc_library(
    name = "kernel",
    srcs = ["kernel.cu.cc"],
    copts = ["-x", "rocm"],
)

After:

rocm_library(
    name = "kernel",
    srcs = ["kernel.cu.cc"],
)

The new syntax is actually clearer about intent.

Conclusion

The current feature-based approach is superior in every meaningful metric:

Aspect	Winner	Margin
Code size	Current	67% reduction
Performance	Current	No Python overhead
Maintainability	Current	Declarative vs imperative
Correctness	Current	Uses battle-tested Bazel APIs
Debuggability	Current	Fewer layers
User experience	Current	More intuitive
Bazel alignment	Current	Idiomatic vs anti-pattern

The wrapper was a reasonable first attempt, but the feature-based architecture is the correct long-term solution.

References

cc/rocm/rocm_compile.bzl - Main compilation rule
cc/rocm/features/rocm_hipcc_feature.bzl - Feature definition
cc/rocm/rocm_library.bzl - User-facing macro
cc/impls/linux_x86_64_linux_x86_64_rocm/BUILD - Toolchain configuration

alekstheod force-pushed the introduce_hipcc_compiler_options_env_variable_for_rocm branch 4 times, most recently from 652aabc to a41717e Compare April 26, 2026 18:16

alekstheod mentioned this pull request Apr 27, 2026

[ROCm] Use hermetic llvm with stripped out hipcc wrapper script openxla/xla#41619

Draft

alekstheod force-pushed the introduce_hipcc_compiler_options_env_variable_for_rocm branch 15 times, most recently from eea81de to da8e920 Compare May 4, 2026 08:31

alekstheod force-pushed the introduce_hipcc_compiler_options_env_variable_for_rocm branch 7 times, most recently from e0c44e9 to 434825e Compare May 8, 2026 13:17

[ROCm] Implement hermetic rocm toolchain (google-ml-infra#241)

cb1d447

alekstheod force-pushed the introduce_hipcc_compiler_options_env_variable_for_rocm branch from 434825e to cb1d447 Compare May 8, 2026 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip out hermetic llvm from the rocm toolchain. Use specific target for rocm gpu#251

Strip out hermetic llvm from the rocm toolchain. Use specific target for rocm gpu#251
alekstheod wants to merge 1 commit into
google-ml-infra:mainfrom
alekstheod:introduce_hipcc_compiler_options_env_variable_for_rocm

alekstheod commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alekstheod commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ROCm Toolchain Architecture: Feature-based vs Wrapper-based Compilation

Overview

TL;DR

Approach Comparison

Previous: hipcc_wrapper (590 lines)

Current: rocm_compile Rule + Features (~180 lines)

Detailed Comparison

1. Code Complexity

2. Performance

3. User Experience

4. Real-world Usage Pattern

5. Maintainability

6. Correctness

7. Debuggability

8. What cc_toolchain Actually Provides

Architecture Details

rocm_compile Rule

rocm_hipcc_feature

rocm_library Macro

Local Sysroot Support

Migration Path

Conclusion

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alekstheod commented Apr 23, 2026 •

edited

Loading