Enable using Kokkos with CUDA backend for MIRCO and in general#2012
Enable using Kokkos with CUDA backend for MIRCO and in general#2012PhilipOesterlePekrun wants to merge 4 commits into
Conversation
Signed-off-by: Philip Oesterle-Pekrun <philipoesterlepekrun@gmail.com>
da5207d to
575c3b3
Compare
| #else | ||
|
|
||
| using ExecutionSpace = Kokkos::DefaultExecutionSpace; | ||
| using ExecutionSpace = Kokkos::DefaultHostExecutionSpace; |
There was a problem hiding this comment.
Why is this changed to host?
There was a problem hiding this comment.
Sorry for the late reply @isteinbrecher.
This was causing an error related to ArborX when compiling (ArborX_compileFailure.log), but I just realized that this is because I let 4C fetch ArborX and then ArborX's CUDA support is off by default (ArborX also uses CUDA through Kokkos, but I think you need to enable it explictly in ArborX as well). You can build ArborX with CUDA and then it should work. However, I'm not sure whether this might lead to oversubscription of the GPU with our typical use case of MPI (multiple MPI ranks per compute node). I'll look into this further, but of course for my use case I can also turn ArborX off because it is not exactly related, so you're right, I'll amend that.
There was a problem hiding this comment.
For now, I'd be fine with switching to host execution space here. This preserves prior behavior, so everything remains exactly as it was for ArborX.
There was a problem hiding this comment.
I think nothing in the typical MIRCO execution path uses the geometric search, so I can actually just undo the change and turn off ArborX.
There was a problem hiding this comment.
Yes, MIRCO-based simulations do not use ArborX at the moment. Yet, I actually support the switch to the host execution space just to be clear, that ArborX is working on host only so far.
@maxfirmbach Any thoughts on this?
There was a problem hiding this comment.
We have decided to keep it as DefaultExecutionSpace and throw a warning when compiling with FOUR_C_CLANGCUDA and FOUR_C_WITH_ARBORX.
|
@PhilipOesterlePekrun Thanks for the work! I really like the efforts, yet I also have some concerns. The current PR comes with quite a few assumptions and restrictions we should clearly state and discuss (maybe something for the next community meeting?). |
|
Concerning your questions:
For my personal taste, the current state has too many constraints to get things to work. Are there parts we can work on separately to make the use of CUDA easier in the near future? |
|
@maxfirmbach Thanks for the feedback. I tried to keep it as isolated as possible, but yes it does change some global CMake files (though everything is conditional upon the global option I introduced). We can definitely discuss this in the community meeting (or even earlier). Regarding your second comment, the constraint here is really that GCC cannot compile CUDA-related code but the I do have a general question, @isteinbrecher and @maxfirmbach, to clarify the purpose of my PR. |
|
@PhilipOesterlePekrun To my knowledge, nobody ever ran 4C on device so far. |
There was a problem hiding this comment.
Pull request overview
This PR adds build-system support to compile 4C with a CUDA-enabled Kokkos installation by introducing a clangcuda++ compiler wrapper and a FOUR_C_CLANGCUDA CMake option that adjusts compile definitions/launchers for clang-based CUDA host/device compilation. It also updates ArborX/Kokkos usage to explicitly run on the host when Kokkos’ default execution space may be CUDA, and refreshes MIRCO integration settings.
Changes:
- Added
utilities/clangcuda++wrapper to drive clang CUDA host-only/device compilation based on compile definitions and file extensions. - Introduced
FOUR_C_CLANGCUDAglobal option and applied related target/global launcher settings +FOUR_C_CLANGCUDA_HOST_ONLYcompile definition in key build targets. - Switched geometric search ArborX execution/memory spaces to
Kokkos::DefaultHostExecutionSpaceto avoid unintended CUDA default execution space usage; updated MIRCO fetch configuration and git tag.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| utilities/clangcuda++ | New clang CUDA compiler wrapper handling host-only/device compilation modes. |
| src/cut/4C_cut_pointgraph.cpp | Adds a clang-CUDA-related preprocessor workaround before including Boost graphviz. |
| src/core/geometric_search/src/4C_geometric_search_distributed_tree.cpp | Forces ArborX distributed tree build/query to use host execution space. |
| src/core/geometric_search/src/4C_geometric_search_bvh.hpp | Forces BVH execution/memory space selection to host execution space. |
| cmake/setup_global_options.cmake | Adds FOUR_C_CLANGCUDA global option. |
| cmake/functions/four_c_auto_define_module.cmake | Applies launcher overrides + host-only define to module object libraries when enabled. |
| cmake/configure/configure_Trilinos.cmake | Disables compiler/rule launchers when FOUR_C_CLANGCUDA is enabled. |
| cmake/configure/configure_MIRCO.cmake | Updates MIRCO fetch configuration and moves dependency registration outside the conditional. |
| apps/global_full/CMakeLists.txt | Applies launcher overrides + host-only define to the main executable when enabled. |
| # Sanity check | ||
| if [[ "$cuda_host_only" == "1" && "$cuda_device" == "1" ]]; then | ||
| has_explicit_device=0 | ||
| for arg in "$@"; do | ||
| if [[ "$arg" == "-DFOUR_C_CLANGCUDA_DEVICE_COMPILE" ]]; then | ||
| has_explicit_device=1 | ||
| break | ||
| fi | ||
| done | ||
|
|
||
| if [[ "$has_explicit_device" == "1" ]]; then | ||
| echo "clangcuda++ wrapper error: both FOUR_C_CLANGCUDA_HOST_ONLY and FOUR_C_CLANGCUDA_DEVICE_COMPILE were set" >&2 | ||
| exit 1 | ||
| fi | ||
| fi | ||
|
|
||
| if [[ "$compile" == "1" && "$cuda_host_only" == "1" ]]; then | ||
| final=( | ||
| "$clang" | ||
| -x cuda | ||
| --cuda-host-only | ||
| --cuda-path="$cuda_path" | ||
| --cuda-gpu-arch="$arch" | ||
| -Wno-unknown-cuda-version | ||
| "${args[@]}" | ||
| ) | ||
| elif [[ "$compile" == "1" && "$cuda_device" == "1" ]]; then | ||
| final=( | ||
| "$clang" | ||
| -x cuda | ||
| --cuda-path="$cuda_path" | ||
| --cuda-gpu-arch="$arch" | ||
| -Wno-unknown-cuda-version | ||
| "${args[@]}" | ||
| ) | ||
| else |
There was a problem hiding this comment.
Fair point, I took the .cu idea from kokkos_launch_compiler, but I guess we really don't need it.
| #undef __noinline__ | ||
| #endif | ||
| #include <boost/graph/graphviz.hpp> |
| four_c_process_global_option( | ||
| FOUR_C_CLANGCUDA | ||
| DESCRIPTION | ||
| "Enable the relevant CMake compile definitions needed to use utilities/ClangCuda++ as the compiler. This is currently necessary to use the CUDA backend of Kokkos, e.g. along with MIRCO." |
|
|
||
| get_property(_global_rule GLOBAL PROPERTY RULE_LAUNCH_COMPILE) | ||
| get_property(_dir_rule DIRECTORY PROPERTY RULE_LAUNCH_COMPILE) |
@PhilipOesterlePekrun Understood! I also don't think that someone has tried this so far, because there was frankly no reason for it ... all of the code in |
5332d8c to
3d2f036
Compare
|
@maxfirmbach A built test is possible, however this requires the following:
Again, the current changes are isolated with By the way, an actual run test is of course not possible unless we have GPU test runners for the workflow/action. |
Signed-off-by: Philip Oesterle-Pekrun <philipoesterlepekrun@gmail.com>
|
@PhilipOesterlePekrun Updating the Dockerfile should be fine. Especially due to the specialized nature of the Kokkos-Cuda integration of MIRCO into 4C, where the number of users and experts is limited to less than a handful, proper testing is really important. So, I'd support to change the docker base containers and provide Cuda with them. |
Just a heads up, it might not be possible to install the cuda packages into the normal Docker image we use for testing due to the size of cuda. In the past, we had problems that there was not enough space left on the runner (14 GB disk space) to build 4C because our docker image is so big. So, we need to be careful what we add to the dependency image. Doing Maybe there is a way to install only a minimal version of cuda that is sufficient for the pipeline. Or you can try to create a separate docker image that only has the minimum dependencies to build 4C with Cuda, e.g., the |
|
@ppraegla Yes, a separate DockerFile like the one for trilinos_develop is the way to go. I do have it with the minimal cuda toolkit requirements, which includes cusolver, but it still adds a few GB so I don't want to affect the main 4C docker image. |
|
We now also briefly discussed after the meeting, and I (in my opinion) still strongly oppose adding a lot of new functionality without testing it. My main question would be: why is then the docker container even created in the first place if it's not used for anything afterwards? If you use this workflow daily in the IMCS you will probably not download the docker container but install it on your local system. So why creating the docker container when it won't be tested at all? @PhilipOesterlePekrun @4C-multiphysics/maintainer |
|
@davidrudlstorfer The compilation is tested, but it's true that this may not be necessary until we have Kokkos device code in some other parts of 4C, as the compilation of MIRCO itself is already tested in MIRCO's repository. So, if we agree that it is not needed currently, I can keep those changes in a branch of my fork in case we do want it later, that is totally fine too. The docker would have just been for the build test, yes. |
|
@mayrmt I've attached the full build.log which I made with build.log.gz (compressed due to GitHub file size limit) |
3d2f036 to
6da5580
Compare
Signed-off-by: Philip Oesterle-Pekrun <philipoesterlepekrun@gmail.com>
6da5580 to
b261b1c
Compare
|
Commenting to e-meet on the GitHub side of things. I'm Jeremy, joining Helmholz-Hereon on 1 Jul to start working on GPU for 4C as well. Kokkos seems like the approach that makes sense for me for 4C from what I've seen so far. |
|
@jeremylt Awesome! This is more or less the first time we are really trying shared memory parallelism in 4C, so it will be great to have another person who knows their way around it. Kokkos works with both OpenMP and GPUs, so we don't have to write everything twice (or more, for more GPU vendors), and it is already baked into Trilinos. It takes some effort to make it work correctly, but it will hopefully be worth it in the end :) |
Let us have a kick-off meeting for further discussion going beyond this PR, please fill this poll: #2070 |
Signed-off-by: Philip Oesterle-Pekrun <philipoesterlepekrun@gmail.com>
| # Location of script to apply patches later | ||
| SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )" |
|
@PhilipOesterlePekrun What is needed to move forward with this PR? |
| message(STATUS "Trilinos packages: ${Trilinos_PACKAGE_LIST}") | ||
|
|
||
| if(FOUR_C_CLANGCUDA) | ||
| set(CMAKE_CXX_COMPILER_LAUNCHER |
There was a problem hiding this comment.
Please do not silently mess with these global user-defined variables. What you should do instead is report a combination of variables that is impossible with a clear error. So I would check that all these variables are indeed empty in the case of FOUR_C_CLANGCUDA and otherwise abort.
| if(FOUR_C_CLANGCUDA) | ||
| set_target_properties( | ||
| ${_target}_objs | ||
| PROPERTIES CXX_COMPILER_LAUNCHER "" |
There was a problem hiding this comment.
Not sure, but this seems duplicated with the global var checks in configure_Trilinos.
| RULE_LAUNCH_COMPILE "" | ||
| RULE_LAUNCH_LINK "" | ||
| ) | ||
| target_compile_definitions(${_target}_objs PRIVATE CLANGCUDA_MODE_HOST) |
There was a problem hiding this comment.
I think it might be better to add this to our four_c_private_compile_interface which is used for everything we build.
| "Enabling both FOUR_C_CLANGCUDA and FOUR_C_WITH_ARBORX is not advised. This requires using an external CUDA-enabled ArborX installation and has not been tested." | ||
| ) | ||
| endif() | ||
|
|
There was a problem hiding this comment.
Move the check for FOUR_C_CUDACLANG here please (and add the flags to four_c_private_compile_interface here).
| }, | ||
| { | ||
| "name": "docker_kokkosopenmp", | ||
| "displayName": "Release build forOpenMP-enabled Kokkos", |
There was a problem hiding this comment.
| "displayName": "Release build forOpenMP-enabled Kokkos", | |
| "displayName": "Release build for OpenMP-enabled Kokkos", |
Twice
| See the `MIRCO repository <https://github.com/imcs-compsim/MIRCO>`_ for details and downloads. | ||
|
|
||
| Building |FOURC| with MIRCO enabled automatically fetches the repository during the configure stage and later builds the library as dependency. | ||
| Building |FOURC| with MIRCO enabled automatically fetches the repository during the configure stage and later builds the library as dependency. Alternatively, one can specify an external MIRCO installation. In either case, MIRCO can make use of shared memory parallelism through Kokkos :ref:`when enabled <build4Cwithkokkoscuda>` in |FOURC|. Note that 4C and MIRCO must depend on the same Kokkos installation. In case using Kokkos with CUDA enabled, MIRCO must be built with `CMAKE_POSITION_INDEPENDENT_CODE=ON`. |
There was a problem hiding this comment.
Writing down details about TPLs is bound to be outdated quickly.
| ) | ||
| four_c_set_up_executable(${FOUR_C_EXECUTABLE_NAME}) | ||
|
|
||
| if(FOUR_C_CLANGCUDA) |
There was a problem hiding this comment.
Should be obsolete if you use four_c_private_compile_interface
|
@4C-multiphysics/maintainer The offer still stands to ping me for CMake reviews :) |
@mayrmt
Description and Context
This PR adds build-system support for compiling 4C with CUDA-enabled Kokkos. For this purpose, this PR introduces a compiler wrapper,
clangcuda++, which should be used as theCMAKE_CXX_COMPILER(or, in the case of MPI, theOMPI_CXXbackend), as well as some CMake additions which are optionally enabled.This change is based on my attempts to get 4C to compile while using MIRCO (library) with Kokkos' CUDA backend for GPU offloading. Several 4C translation units fail to compile with the
kokkos_launch_compilerornvcc_wrapperprovided by Kokkos because NVCC does not seem to be able to compile some more "advanced" C++ code, which we have a fair bit of in 4C since it is a large project. Clang happens to compile CUDA or CUDA-related code much better than Nvidia's own compiler (don't ask me why), but still has some issues by itself. Using theclangcuda++wrapper as theCMAKE_CXX_COMPILER(or, in the case of MPI, theOMPI_CXXbackend), along with the corresponding target properties and compile definitions by settingFOUR_C_CLANGCUDA=ON, allows compiling all of 4C with CUDA-enabled Kokkos.The current implementation distinguishes between CUDA host-side compilation and CUDA device compilation. At the moment, 4C itself only requires CUDA host-side compilation, since it does not yet contain raw Kokkos device kernels such as
Kokkos::parallel_for,KOKKOS_LAMBDA, etc. in its own sources. But, this is now possible by simply marking a target or source withFOUR_C_CLANGCUDA_DEVICE_COMPILE, allowing GPU offloading anywhere in 4C. As long as Trilinos is built withTPETRA_INST_CUDA=OFF, this also does not conflict with any existing MPI parallelism (NOwill be KokkosSerial).The changes were verified by compiling 4C successfully with the relevant Kokkos CUDA-enabled setup and I also tested small Kokkos kernel examples to confirm that host-side and device CUDA compilation are distinguished correctly. I have documented how Trilinos, MIRCO, and 4C should be compiled for this combination to work. I'll just attach that to this PR for now: DocumentKokkosCuda_4C_Trilinos_MIRCO.zip
Once imcs-compsim/MIRCO#146 is merged, FetchContent will work without issue for that MIRCO state.
My questions:
Related Issues and Pull Requests
Blocked by imcs-compsim/MIRCO#146
Docker and Workflow tests
I have added a Dockerfile, Trilinos installation scripts, and a workflow .yml which are similar to the trilinos_develop concept that we have, simply because it is also using (two) special Trilinos installations. I'm not sure whether we should always test what I called trilinos_kokkosparallel on each pull request or not. I also made it so it pushes to the same docker as trilinos_develop with a different tag. These are design choices, so please tell me if you have an opinion on it.