Skip to content

Enable using Kokkos with CUDA backend for MIRCO and in general#2012

Draft
PhilipOesterlePekrun wants to merge 4 commits into
4C-multiphysics:mainfrom
PhilipOesterlePekrun:mirco/KokkosCudaCleanedUp
Draft

Enable using Kokkos with CUDA backend for MIRCO and in general#2012
PhilipOesterlePekrun wants to merge 4 commits into
4C-multiphysics:mainfrom
PhilipOesterlePekrun:mirco/KokkosCudaCleanedUp

Conversation

@PhilipOesterlePekrun

@PhilipOesterlePekrun PhilipOesterlePekrun commented May 4, 2026

Copy link
Copy Markdown
Member

@mayrmt

Description and Context

This PR adds build-system support for compiling 4C with CUDA-enabled Kokkos. For this purpose, this PR introduces a compiler wrapper, clangcuda++, which should be used as the CMAKE_CXX_COMPILER (or, in the case of MPI, the OMPI_CXX backend), as well as some CMake additions which are optionally enabled.

This change is based on my attempts to get 4C to compile while using MIRCO (library) with Kokkos' CUDA backend for GPU offloading. Several 4C translation units fail to compile with the kokkos_launch_compiler or nvcc_wrapper provided by Kokkos because NVCC does not seem to be able to compile some more "advanced" C++ code, which we have a fair bit of in 4C since it is a large project. Clang happens to compile CUDA or CUDA-related code much better than Nvidia's own compiler (don't ask me why), but still has some issues by itself. Using the clangcuda++ wrapper as the CMAKE_CXX_COMPILER (or, in the case of MPI, the OMPI_CXX backend), along with the corresponding target properties and compile definitions by setting FOUR_C_CLANGCUDA=ON, allows compiling all of 4C with CUDA-enabled Kokkos.

The current implementation distinguishes between CUDA host-side compilation and CUDA device compilation. At the moment, 4C itself only requires CUDA host-side compilation, since it does not yet contain raw Kokkos device kernels such as Kokkos::parallel_for, KOKKOS_LAMBDA, etc. in its own sources. But, this is now possible by simply marking a target or source with FOUR_C_CLANGCUDA_DEVICE_COMPILE, allowing GPU offloading anywhere in 4C. As long as Trilinos is built with TPETRA_INST_CUDA=OFF, this also does not conflict with any existing MPI parallelism (NO will be KokkosSerial).

The changes were verified by compiling 4C successfully with the relevant Kokkos CUDA-enabled setup and I also tested small Kokkos kernel examples to confirm that host-side and device CUDA compilation are distinguished correctly. I have documented how Trilinos, MIRCO, and 4C should be compiled for this combination to work. I'll just attach that to this PR for now: DocumentKokkosCuda_4C_Trilinos_MIRCO.zip

Once imcs-compsim/MIRCO#146 is merged, FetchContent will work without issue for that MIRCO state.

My questions:

  • Should I make a build test?
  • Should I make a framework test (if we can test with GPUs)?
  • Should I make a small example or tutorial to show exactly how to enable using raw Kokkos device kernels?

Related Issues and Pull Requests

Blocked by imcs-compsim/MIRCO#146

Docker and Workflow tests

I have added a Dockerfile, Trilinos installation scripts, and a workflow .yml which are similar to the trilinos_develop concept that we have, simply because it is also using (two) special Trilinos installations. I'm not sure whether we should always test what I called trilinos_kokkosparallel on each pull request or not. I also made it so it pushes to the same docker as trilinos_develop with a different tag. These are design choices, so please tell me if you have an opinion on it.

Signed-off-by: Philip Oesterle-Pekrun <philipoesterlepekrun@gmail.com>
@PhilipOesterlePekrun PhilipOesterlePekrun marked this pull request as draft May 4, 2026 10:32
@PhilipOesterlePekrun PhilipOesterlePekrun force-pushed the mirco/KokkosCudaCleanedUp branch from da5207d to 575c3b3 Compare May 4, 2026 12:53
#else

using ExecutionSpace = Kokkos::DefaultExecutionSpace;
using ExecutionSpace = Kokkos::DefaultHostExecutionSpace;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changed to host?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply @isteinbrecher.

This was causing an error related to ArborX when compiling (ArborX_compileFailure.log), but I just realized that this is because I let 4C fetch ArborX and then ArborX's CUDA support is off by default (ArborX also uses CUDA through Kokkos, but I think you need to enable it explictly in ArborX as well). You can build ArborX with CUDA and then it should work. However, I'm not sure whether this might lead to oversubscription of the GPU with our typical use case of MPI (multiple MPI ranks per compute node). I'll look into this further, but of course for my use case I can also turn ArborX off because it is not exactly related, so you're right, I'll amend that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I'd be fine with switching to host execution space here. This preserves prior behavior, so everything remains exactly as it was for ArborX.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think nothing in the typical MIRCO execution path uses the geometric search, so I can actually just undo the change and turn off ArborX.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, MIRCO-based simulations do not use ArborX at the moment. Yet, I actually support the switch to the host execution space just to be clear, that ArborX is working on host only so far.

@maxfirmbach Any thoughts on this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have decided to keep it as DefaultExecutionSpace and throw a warning when compiling with FOUR_C_CLANGCUDA and FOUR_C_WITH_ARBORX.

@maxfirmbach

Copy link
Copy Markdown
Contributor

@PhilipOesterlePekrun Thanks for the work! I really like the efforts, yet I also have some concerns. The current PR comes with quite a few assumptions and restrictions we should clearly state and discuss (maybe something for the next community meeting?).

@maxfirmbach

maxfirmbach commented May 6, 2026

Copy link
Copy Markdown
Contributor

Concerning your questions:

  • I think a build test is necessary (currently this PR is clang related, do we have any clues about gcc?). Maybe with Mirco and ArborX activated, which can actually use GPUs under the hood.
  • Testing would be nice, yet I don't know if it makes too much sense. Only a handful of tests can make use of GPUs.

For my personal taste, the current state has too many constraints to get things to work. Are there parts we can work on separately to make the use of CUDA easier in the near future?

@PhilipOesterlePekrun

Copy link
Copy Markdown
Member Author

@maxfirmbach Thanks for the feedback. I tried to keep it as isolated as possible, but yes it does change some global CMake files (though everything is conditional upon the global option I introduced). We can definitely discuss this in the community meeting (or even earlier).

Regarding your second comment, the constraint here is really that GCC cannot compile CUDA-related code but the kokkos_launch_compiler or nvcc_wrapper use NVCC which cannot compile some of our code--hence, Clang is required to compile this. My solution really does work, but I would be happy about any suggestions to integrate it more cleanly.

I do have a general question, @isteinbrecher and @maxfirmbach, to clarify the purpose of my PR.
It was my understanding that nobody ever really used device-enabled Kokkos (CUDA or HIP etc.) with 4C. But, that was I guess an assumption, so I should clearly ask: have you ever used device-enabled Kokkos with 4C, or do you know of any instance of that? I had to introduce the changes in this PR to make it work (for the reasons I stated), but obviously my solution came from a lot of trial and error and maybe someone else has done it better (but not pushed it to 4C?).

@mayrmt

mayrmt commented May 7, 2026

Copy link
Copy Markdown
Member

@PhilipOesterlePekrun To my knowledge, nobody ever ran 4C on device so far.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds build-system support to compile 4C with a CUDA-enabled Kokkos installation by introducing a clangcuda++ compiler wrapper and a FOUR_C_CLANGCUDA CMake option that adjusts compile definitions/launchers for clang-based CUDA host/device compilation. It also updates ArborX/Kokkos usage to explicitly run on the host when Kokkos’ default execution space may be CUDA, and refreshes MIRCO integration settings.

Changes:

  • Added utilities/clangcuda++ wrapper to drive clang CUDA host-only/device compilation based on compile definitions and file extensions.
  • Introduced FOUR_C_CLANGCUDA global option and applied related target/global launcher settings + FOUR_C_CLANGCUDA_HOST_ONLY compile definition in key build targets.
  • Switched geometric search ArborX execution/memory spaces to Kokkos::DefaultHostExecutionSpace to avoid unintended CUDA default execution space usage; updated MIRCO fetch configuration and git tag.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
utilities/clangcuda++ New clang CUDA compiler wrapper handling host-only/device compilation modes.
src/cut/4C_cut_pointgraph.cpp Adds a clang-CUDA-related preprocessor workaround before including Boost graphviz.
src/core/geometric_search/src/4C_geometric_search_distributed_tree.cpp Forces ArborX distributed tree build/query to use host execution space.
src/core/geometric_search/src/4C_geometric_search_bvh.hpp Forces BVH execution/memory space selection to host execution space.
cmake/setup_global_options.cmake Adds FOUR_C_CLANGCUDA global option.
cmake/functions/four_c_auto_define_module.cmake Applies launcher overrides + host-only define to module object libraries when enabled.
cmake/configure/configure_Trilinos.cmake Disables compiler/rule launchers when FOUR_C_CLANGCUDA is enabled.
cmake/configure/configure_MIRCO.cmake Updates MIRCO fetch configuration and moves dependency registration outside the conditional.
apps/global_full/CMakeLists.txt Applies launcher overrides + host-only define to the main executable when enabled.

Comment thread utilities/clangcuda++
Comment on lines +75 to +110
# Sanity check
if [[ "$cuda_host_only" == "1" && "$cuda_device" == "1" ]]; then
has_explicit_device=0
for arg in "$@"; do
if [[ "$arg" == "-DFOUR_C_CLANGCUDA_DEVICE_COMPILE" ]]; then
has_explicit_device=1
break
fi
done

if [[ "$has_explicit_device" == "1" ]]; then
echo "clangcuda++ wrapper error: both FOUR_C_CLANGCUDA_HOST_ONLY and FOUR_C_CLANGCUDA_DEVICE_COMPILE were set" >&2
exit 1
fi
fi

if [[ "$compile" == "1" && "$cuda_host_only" == "1" ]]; then
final=(
"$clang"
-x cuda
--cuda-host-only
--cuda-path="$cuda_path"
--cuda-gpu-arch="$arch"
-Wno-unknown-cuda-version
"${args[@]}"
)
elif [[ "$compile" == "1" && "$cuda_device" == "1" ]]; then
final=(
"$clang"
-x cuda
--cuda-path="$cuda_path"
--cuda-gpu-arch="$arch"
-Wno-unknown-cuda-version
"${args[@]}"
)
else

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, I took the .cu idea from kokkos_launch_compiler, but I guess we really don't need it.

Comment on lines +16 to 18
#undef __noinline__
#endif
#include <boost/graph/graphviz.hpp>
Comment thread cmake/setup_global_options.cmake Outdated
four_c_process_global_option(
FOUR_C_CLANGCUDA
DESCRIPTION
"Enable the relevant CMake compile definitions needed to use utilities/ClangCuda++ as the compiler. This is currently necessary to use the CUDA backend of Kokkos, e.g. along with MIRCO."
Comment on lines +84 to +86

get_property(_global_rule GLOBAL PROPERTY RULE_LAUNCH_COMPILE)
get_property(_dir_rule DIRECTORY PROPERTY RULE_LAUNCH_COMPILE)
@maxfirmbach

maxfirmbach commented May 7, 2026

Copy link
Copy Markdown
Contributor

@maxfirmbach Thanks for the feedback. I tried to keep it as isolated as possible, but yes it does change some global CMake files (though everything is conditional upon the global option I introduced). We can definitely discuss this in the community meeting (or even earlier).

Regarding your second comment, the constraint here is really that GCC cannot compile CUDA-related code but the kokkos_launch_compiler or nvcc_wrapper use NVCC which cannot compile some of our code--hence, Clang is required to compile this. My solution really does work, but I would be happy about any suggestions to integrate it more cleanly.

I do have a general question, @isteinbrecher and @maxfirmbach, to clarify the purpose of my PR. It was my understanding that nobody ever really used device-enabled Kokkos (CUDA or HIP etc.) with 4C. But, that was I guess an assumption, so I should clearly ask: have you ever used device-enabled Kokkos with 4C, or do you know of any instance of that? I had to introduce the changes in this PR to make it work (for the reasons I stated), but obviously my solution came from a lot of trial and error and maybe someone else has done it better (but not pushed it to 4C?).

@PhilipOesterlePekrun Understood! I also don't think that someone has tried this so far, because there was frankly no reason for it ... all of the code in 4C itself is host only.

@PhilipOesterlePekrun

Copy link
Copy Markdown
Member Author

@maxfirmbach A built test is possible, however this requires the following:

  • To change the core Dockerfile of 4C to apt-get the relevant cuda-related packages like nvcc, cusolver, cublas, etc. Then the docker image would have to be updated of course. I see the last update was in 2024, so I suspect this is something we don't want to do too often.
  • Add a second dependencies/current/trilinos/install.sh, which installs Trilinos with CUDA enabled, and potentially another installation if we want to test with Kokkos' OpenMP backend.

Again, the current changes are isolated with FOUR_C_CLANGCUDA, so I might suggest that we wait on introducing a build test until other parts of the code implement Kokkos device-side code. But, I'll make a build test if that's what we want. You can see an example of the changes to make such a build test in this MIRCO PR imcs-compsim/MIRCO#146.

By the way, an actual run test is of course not possible unless we have GPU test runners for the workflow/action.

Signed-off-by: Philip Oesterle-Pekrun <philipoesterlepekrun@gmail.com>
@mayrmt

mayrmt commented May 16, 2026

Copy link
Copy Markdown
Member

@PhilipOesterlePekrun Updating the Dockerfile should be fine. Especially due to the specialized nature of the Kokkos-Cuda integration of MIRCO into 4C, where the number of users and experts is limited to less than a handful, proper testing is really important. So, I'd support to change the docker base containers and provide Cuda with them.

@ppraegla

Copy link
Copy Markdown
Member

@PhilipOesterlePekrun Updating the Dockerfile should be fine. Especially due to the specialized nature of the Kokkos-Cuda integration of MIRCO into 4C, where the number of users and experts is limited to less than a handful, proper testing is really important. So, I'd support to change the docker base containers and provide Cuda with them.

Just a heads up, it might not be possible to install the cuda packages into the normal Docker image we use for testing due to the size of cuda. In the past, we had problems that there was not enough space left on the runner (14 GB disk space) to build 4C because our docker image is so big. So, we need to be careful what we add to the dependency image.

Doing sudo apt install nvidia-cuda-toolkit in the docker container shows that the installation will be 7 GB, which is definitely too large for the runner. This would only leave 2 GB to clone and build 4C.

Maybe there is a way to install only a minimal version of cuda that is sufficient for the pipeline. Or you can try to create a separate docker image that only has the minimum dependencies to build 4C with Cuda, e.g., the 4c-minimal image is 5 GB smaller than the 4c image by removing doxygen, clang, ...

@PhilipOesterlePekrun

Copy link
Copy Markdown
Member Author

@ppraegla Yes, a separate DockerFile like the one for trilinos_develop is the way to go. I do have it with the minimal cuda toolkit requirements, which includes cusolver, but it still adds a few GB so I don't want to affect the main 4C docker image.

@davidrudlstorfer

Copy link
Copy Markdown
Contributor

We now also briefly discussed after the meeting, and I (in my opinion) still strongly oppose adding a lot of new functionality without testing it.

My main question would be: why is then the docker container even created in the first place if it's not used for anything afterwards?

If you use this workflow daily in the IMCS you will probably not download the docker container but install it on your local system. So why creating the docker container when it won't be tested at all?

@PhilipOesterlePekrun @4C-multiphysics/maintainer

@PhilipOesterlePekrun

PhilipOesterlePekrun commented May 27, 2026

Copy link
Copy Markdown
Member Author

@davidrudlstorfer The compilation is tested, but it's true that this may not be necessary until we have Kokkos device code in some other parts of 4C, as the compilation of MIRCO itself is already tested in MIRCO's repository. So, if we agree that it is not needed currently, I can keep those changes in a branch of my fork in case we do want it later, that is totally fine too. The docker would have just been for the build test, yes.

@PhilipOesterlePekrun

PhilipOesterlePekrun commented May 28, 2026

Copy link
Copy Markdown
Member Author

@mayrmt
I accidentally clicked close PR. Regarding nvcc and the need for clangcuda++, I've compiled with both nvcc_wrapper and kokkos_launch_compiler. The latter automatically uses nvcc on everything with -DKOKKOS_DEPENDENCE, which ends up being all or almost all 4C targets because they transitively link to Kokkos due to Xpetra being used in core/linalg. I think nvcc_wrapper and kokkos_launch_compiler are functionally the same then, and everything is compiled with nvcc as device code (even though it isn't). And nvcc has some issues

I've attached the full build.log which I made with ninja -j64 -k 0 2>&1 | tee build.log, as well as a filtered version which is an attempt to show the errors uniquely (there are 15 different headers which are root issues for many TUs/targets). It is identical whether nvcc_wrapper or kokkos_launch_compiler is used. Basically, fixing this for nvcc might require changing multiple different parts of the code, whereas clangcuda++ is at least isolated (in this PR, I did have to conditionally undefine one macro for an #include, but that is a CUDA thing in general). Clang is simply more compliant with C++ than nvcc.

build.log.gz (compressed due to GitHub file size limit)
unique-error-files.txt

@PhilipOesterlePekrun PhilipOesterlePekrun force-pushed the mirco/KokkosCudaCleanedUp branch from 3d2f036 to 6da5580 Compare May 28, 2026 14:37
Signed-off-by: Philip Oesterle-Pekrun <philipoesterlepekrun@gmail.com>
@PhilipOesterlePekrun PhilipOesterlePekrun force-pushed the mirco/KokkosCudaCleanedUp branch from 6da5580 to b261b1c Compare May 29, 2026 09:38
@jeremylt

jeremylt commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Commenting to e-meet on the GitHub side of things. I'm Jeremy, joining Helmholz-Hereon on 1 Jul to start working on GPU for 4C as well. Kokkos seems like the approach that makes sense for me for 4C from what I've seen so far.

@PhilipOesterlePekrun

Copy link
Copy Markdown
Member Author

@jeremylt Awesome! This is more or less the first time we are really trying shared memory parallelism in 4C, so it will be great to have another person who knows their way around it. Kokkos works with both OpenMP and GPUs, so we don't have to write everything twice (or more, for more GPU vendors), and it is already baked into Trilinos. It takes some effort to make it work correctly, but it will hopefully be worth it in the end :)

@georghammerl

Copy link
Copy Markdown
Member

@jeremylt Awesome! This is more or less the first time we are really trying shared memory parallelism in 4C, so it will be great to have another person who knows their way around it. Kokkos works with both OpenMP and GPUs, so we don't have to write everything twice (or more, for more GPU vendors), and it is already baked into Trilinos. It takes some effort to make it work correctly, but it will hopefully be worth it in the end :)

Let us have a kick-off meeting for further discussion going beyond this PR, please fill this poll: #2070

Signed-off-by: Philip Oesterle-Pekrun <philipoesterlepekrun@gmail.com>
Comment on lines +24 to +25
# Location of script to apply patches later
SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

@mayrmt

mayrmt commented Jun 15, 2026

Copy link
Copy Markdown
Member

@PhilipOesterlePekrun What is needed to move forward with this PR?

message(STATUS "Trilinos packages: ${Trilinos_PACKAGE_LIST}")

if(FOUR_C_CLANGCUDA)
set(CMAKE_CXX_COMPILER_LAUNCHER

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not silently mess with these global user-defined variables. What you should do instead is report a combination of variables that is impossible with a clear error. So I would check that all these variables are indeed empty in the case of FOUR_C_CLANGCUDA and otherwise abort.

if(FOUR_C_CLANGCUDA)
set_target_properties(
${_target}_objs
PROPERTIES CXX_COMPILER_LAUNCHER ""

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but this seems duplicated with the global var checks in configure_Trilinos.

RULE_LAUNCH_COMPILE ""
RULE_LAUNCH_LINK ""
)
target_compile_definitions(${_target}_objs PRIVATE CLANGCUDA_MODE_HOST)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to add this to our four_c_private_compile_interface which is used for everything we build.

"Enabling both FOUR_C_CLANGCUDA and FOUR_C_WITH_ARBORX is not advised. This requires using an external CUDA-enabled ArborX installation and has not been tested."
)
endif()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the check for FOUR_C_CUDACLANG here please (and add the flags to four_c_private_compile_interface here).

},
{
"name": "docker_kokkosopenmp",
"displayName": "Release build forOpenMP-enabled Kokkos",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"displayName": "Release build forOpenMP-enabled Kokkos",
"displayName": "Release build for OpenMP-enabled Kokkos",

Twice

See the `MIRCO repository <https://github.com/imcs-compsim/MIRCO>`_ for details and downloads.

Building |FOURC| with MIRCO enabled automatically fetches the repository during the configure stage and later builds the library as dependency.
Building |FOURC| with MIRCO enabled automatically fetches the repository during the configure stage and later builds the library as dependency. Alternatively, one can specify an external MIRCO installation. In either case, MIRCO can make use of shared memory parallelism through Kokkos :ref:`when enabled <build4Cwithkokkoscuda>` in |FOURC|. Note that 4C and MIRCO must depend on the same Kokkos installation. In case using Kokkos with CUDA enabled, MIRCO must be built with `CMAKE_POSITION_INDEPENDENT_CODE=ON`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing down details about TPLs is bound to be outdated quickly.

)
four_c_set_up_executable(${FOUR_C_EXECUTABLE_NAME})

if(FOUR_C_CLANGCUDA)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be obsolete if you use four_c_private_compile_interface

@sebproell

Copy link
Copy Markdown
Member

@4C-multiphysics/maintainer The offer still stands to ping me for CMake reviews :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants