Update scph.cpp - Reducing memory usage in SCPH V4 kernels (kpoint/band) via temporary reuse and in-place MPI reduction#305
Open
andersonprizzi wants to merge 2 commits intottadano:developfrom
Open
Conversation
Optimize memory usage in compute_V4_elements_mpi_over_kpoint and compute_V4_elements_mpi_over_band by reusing temporaries and using in-place MPI reduction. This refactor preserves the same contractions keeping results unchanged within floating-point roundoff.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request reduces peak memory usage in the SCPH quartic matrix-element routines:
Scph::compute_V4_elements_mpi_over_kpointScph::compute_V4_elements_mpi_over_bandThe refactor reuses temporary buffers and removes an extra per-rank MPI staging tensor. The algebraic contractions and index transforms are preserved, and results remain unchanged within floating-point roundoff.
Changes:
In both functions:
The intermediate
v4_mpibuffer has been removed. Each MPI rank writes its local contributions directly intov4_out. The final accumulation is performed withMPI_Allreduce(MPI_IN_PLACE, &v4_out[0][0][0], ...). This avoids keeping two full copies of the V4 tensor per rank.In the
compute_V4_elements_mpi_over_kpointfunction:The original implementation allocated several large temporary buffers at once, even though only two are needed at any given step. This refactor therefore reuses two complex
ns2xns2buffers across the successive index transformations.In the
compute_V4_elements_mpi_over_bandfunction:Memory usage is reduced by replacing
v4_tmp0with a compact sparse representation of the non-zero φ4 elements stored as (row, col, value) entries inphi4_array, along with acol_ptroffset array to iterate efficiently over the non-zeros of each column during the first-index transformation (preserving the original access pattern). In addition, the temporary workspace is reduced by reusing onlyv4_tmp1andv4_tmp2in a ping-pong manner instead of allocatingv4_tmp1,v4_tmp2,v4_tmp3andv4_tmp4simultaneously, which can significantly reduce memory usage when the number of non-zero entries is much smaller thanns2xns2.These changes were motivated by the fact that, for large
nkand/orns, the V4 kernels can dominate memory usage due to multiplens2xns2temporary buffers and duplicated tensors per MPI rank. This can limit the maximum feasible system size.Tests were performed by running an SCPH calculation using a representative input deck (a test case including the IFCs and the associated strain/force dataset). The baseline version and this PR were executed with the same MPI/OMP configuration, and the resulting outputs were compared, and no differences were observed.