Skip to content

Update scph.cpp - Reducing memory usage in SCPH V4 kernels (kpoint/band) via temporary reuse and in-place MPI reduction#305

Open
andersonprizzi wants to merge 2 commits intottadano:developfrom
andersonprizzi:patch-1
Open

Update scph.cpp - Reducing memory usage in SCPH V4 kernels (kpoint/band) via temporary reuse and in-place MPI reduction#305
andersonprizzi wants to merge 2 commits intottadano:developfrom
andersonprizzi:patch-1

Conversation

@andersonprizzi
Copy link

@andersonprizzi andersonprizzi commented Jan 5, 2026

This pull request reduces peak memory usage in the SCPH quartic matrix-element routines:

  • Scph::compute_V4_elements_mpi_over_kpoint
  • Scph::compute_V4_elements_mpi_over_band

The refactor reuses temporary buffers and removes an extra per-rank MPI staging tensor. The algebraic contractions and index transforms are preserved, and results remain unchanged within floating-point roundoff.

Changes:

  1. In both functions:
    The intermediate v4_mpi buffer has been removed. Each MPI rank writes its local contributions directly into v4_out. The final accumulation is performed with MPI_Allreduce(MPI_IN_PLACE, &v4_out[0][0][0], ...). This avoids keeping two full copies of the V4 tensor per rank.

  2. In the compute_V4_elements_mpi_over_kpoint function:
    The original implementation allocated several large temporary buffers at once, even though only two are needed at any given step. This refactor therefore reuses two complex ns2 x ns2 buffers across the successive index transformations.

  3. In the compute_V4_elements_mpi_over_band function:
    Memory usage is reduced by replacing v4_tmp0 with a compact sparse representation of the non-zero φ4 elements stored as (row, col, value) entries in phi4_array, along with a col_ptr offset array to iterate efficiently over the non-zeros of each column during the first-index transformation (preserving the original access pattern). In addition, the temporary workspace is reduced by reusing only v4_tmp1 and v4_tmp2 in a ping-pong manner instead of allocating v4_tmp1, v4_tmp2, v4_tmp3 and v4_tmp4 simultaneously, which can significantly reduce memory usage when the number of non-zero entries is much smaller than ns2 x ns2.

These changes were motivated by the fact that, for large nk and/or ns, the V4 kernels can dominate memory usage due to multiple ns2 x ns2 temporary buffers and duplicated tensors per MPI rank. This can limit the maximum feasible system size.

Tests were performed by running an SCPH calculation using a representative input deck (a test case including the IFCs and the associated strain/force dataset). The baseline version and this PR were executed with the same MPI/OMP configuration, and the resulting outputs were compared, and no differences were observed.

Optimize memory usage in compute_V4_elements_mpi_over_kpoint and compute_V4_elements_mpi_over_band by reusing temporaries and using in-place MPI reduction. This refactor preserves the same contractions keeping results unchanged within floating-point roundoff.
@andersonprizzi andersonprizzi changed the title Update scph.cpp - Reduce memory footprint in SCPH V4 kernels (kpoint/band) via temporary reuse and in-place MPI reduction Update scph.cpp - Reduce memory in SCPH V4 kernels (kpoint/band) via temporary reuse and in-place MPI reduction Jan 5, 2026
@andersonprizzi andersonprizzi changed the title Update scph.cpp - Reduce memory in SCPH V4 kernels (kpoint/band) via temporary reuse and in-place MPI reduction Update scph.cpp - Reducing memory usage in SCPH V4 kernels (kpoint/band) via temporary reuse and in-place MPI reduction Jan 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant