From e35c9619b14f7b815be3c565df8b12412b1ddbcd Mon Sep 17 00:00:00 2001 From: Jacob Moore Date: Thu, 7 May 2026 14:35:53 -0500 Subject: [PATCH] ENH: Correcting minor errors in LLM context file --- MATAR_LLM_CONTEXT.md | 551 +++++++++++++++++++++++++------------------ 1 file changed, 315 insertions(+), 236 deletions(-) diff --git a/MATAR_LLM_CONTEXT.md b/MATAR_LLM_CONTEXT.md index 4d6d5201..d66a31fd 100644 --- a/MATAR_LLM_CONTEXT.md +++ b/MATAR_LLM_CONTEXT.md @@ -1,6 +1,6 @@ # MATAR Code Conversion Reference -This document provides comprehensive context for converting existing C/C++ (and Fortran-interop) code to use the MATAR library. MATAR (Multi-dimensional Arrays, Tensors, And Ragged data structures) is a header-only C++ library that provides performance-portable data structures and parallel loop macros. It abstracts over Kokkos to support CPU serial, OpenMP, Pthreads, CUDA, and HIP backends with a single source code. +This document provides comprehensive context for converting existing C/C++(and Fortran-interop) code to use the MATAR library. MATAR (Multi-dimensional Arrays, Tensors, And Ragged data structures) is a header-only C++ library that provides performance-portable data structures and parallel loop macros. It abstracts over Kokkos to support CPU serial, OpenMP, Pthreads, CUDA, and HIP backends with a single source code. --- @@ -49,11 +49,35 @@ int main(int argc, char* argv[]) } ``` -| Macro | Behavior with Kokkos | Behavior without Kokkos | -|-------|----------------------|------------------------| -| `MATAR_INITIALIZE(...)` | `Kokkos::initialize(...)` | No-op | -| `MATAR_FINALIZE()` | `Kokkos::finalize()` | No-op | -| `MATAR_FENCE()` | `Kokkos::fence()` | No-op | +If building with MPI, the main function also requires: + +```cpp +#include +#include +using namespace mtr; + +int main(int argc, char* argv[]) +{ + MPI_Init(&argc, &argv); + MATAR_INITIALIZE(argc, argv); // or MATAR_INITIALIZE() without args + { + // All MATAR data structures and parallel work go here. + // The scoping braces ensure all MATAR objects are destroyed + // before MATAR_FINALIZE is called. + } + MATAR_FINALIZE(); + MPI_Finalize(); + return 0; +} +``` + + +| Macro | Behavior with Kokkos | Behavior without Kokkos | +| ----------------------- | ------------------------- | ----------------------- | +| `MATAR_INITIALIZE(...)` | `Kokkos::initialize(...)` | No-op | +| `MATAR_FINALIZE()` | `Kokkos::finalize()` | No-op | +| `MATAR_FENCE()` | `Kokkos::fence()` | No-op | + `MATAR_FENCE()` is a synchronization barrier. It ensures all device operations have completed before the CPU proceeds. See the [fence placement rules](#fence-placement-rules) below for when fences are required vs. redundant. @@ -61,15 +85,17 @@ int main(int argc, char* argv[]) A fence is **needed** between two `FOR_ALL` blocks only if the second reads data written by the first. Over-fencing kills performance; under-fencing causes correctness bugs. Apply these rules: -| Situation | Fence Needed? | Why | -|-----------|:---:|-----| -| `FOR_ALL` writes to `A`, next `FOR_ALL` reads `A` | **Yes** | Data dependency between kernels | -| `FOR_ALL` writes to `A`, next `FOR_ALL` writes/reads only `B` | **No** | Independent data, no dependency | -| `FOR_ALL` writes to `A`, then host code reads `A.host(i)` | **Yes** | Host reads device-written data | -| `FOR_REDUCE_SUM(..., total)`, then host uses `total` | **No** | Reduction macros implicitly fence on the result variable | -| `FOR_ALL` followed by `update_host()` | **Yes** | Must ensure kernel completes before sync | -| `FOR_ALL` followed by timing (`std::chrono`) | **Yes** | Timer must see completed work | -| Two `FOR_ALL` blocks, both only read (no writes) | **No** | Read-only kernels have no conflict | + +| Situation | Fence Needed? | Why | +| ------------------------------------------------------------- | ------------- | -------------------------------------------------------- | +| `FOR_ALL` writes to `A`, next `FOR_ALL` reads `A` | **Yes** | Data dependency between kernels | +| `FOR_ALL` writes to `A`, next `FOR_ALL` writes/reads only `B` | **No** | Independent data, no dependency | +| `FOR_ALL` writes to `A`, then host code reads `A.host(i)` | **Yes** | Host reads device-written data | +| `FOR_REDUCE_SUM(..., total)`, then host uses `total` | **No** | Reduction macros implicitly fence on the result variable | +| `FOR_ALL` followed by `update_host()` | **Yes** | Must ensure kernel completes before sync | +| `FOR_ALL` followed by timing (`std::chrono`) | **Yes** | Timer must see completed work | +| Two `FOR_ALL` blocks, both only read (no writes) | **No** | Read-only kernels have no conflict | + ```cpp // NO fence needed: independent data @@ -95,86 +121,98 @@ MATAR data structures are organized along four axes: ### Axis 1: Memory Layout -| Prefix | Layout | Index Convention | Best Loop Order | -|--------|--------|-----------------|-----------------| -| **C** | Row-major (C-style) | Last index varies fastest in memory | Outermost loop = first index | -| **F** | Column-major (Fortran-style) | First index varies fastest in memory | Outermost loop = last index | + +| Prefix | Layout | Index Convention | Best Loop Order | +| ------ | ---------------------------- | ------------------------------------ | ---------------------------- | +| **C** | Row-major (C-style) | Last index varies fastest in memory | Outermost loop = first index | +| **F** | Column-major (Fortran-style) | First index varies fastest in memory | Outermost loop = last index | + **GPU coalescing rule:** CUDA and HIP achieve coalesced memory access when adjacent threads access adjacent memory addresses. In `FOR_ALL(i, 0, N, j, 0, M, {...})`, the **last** index (`j`) is mapped to thread IDs. Therefore: -- **`CArrayDevice` (last index fastest):** `FOR_ALL(i, ..., j, ..., { a(i,j) })` gives coalesced access because adjacent threads (adjacent `j`) access adjacent memory locations. -- **`FArrayDevice` (first index fastest):** The same `FOR_ALL` ordering does **not** give coalesced access because adjacent threads (adjacent `j`) hit non-adjacent memory. +- `**CArrayDevice` (last index fastest):** `FOR_ALL(i, ..., j, ..., { a(i,j) })` gives coalesced access because adjacent threads (adjacent `j`) access adjacent memory locations. +- `**FArrayDevice` (first index fastest):** The same `FOR_ALL` ordering does **not** give coalesced access because adjacent threads (adjacent `j`) hit non-adjacent memory. **Optimization rule for GPU targets:** Prefer `CArrayDevice` with loop nests ordered so the innermost `FOR_ALL` index corresponds to the last array dimension. When correctness of an initial translation from Fortran is the priority and the access pattern is complex, use `FArrayDevice` to preserve Fortran semantics and optimize later. The `FArray` types give a correct first-pass translation from Fortran (preserving the original memory layout and loop ordering), but they may not give optimal GPU performance without reordering. ### Axis 2: Index Base -| Suffix | Index Base | Range | -|--------|-----------|-------| -| **Array** | 0-based | `[0, N)` | -| **Matrix** | 1-based | `[1, N]` | + +| Suffix | Index Base | Range | +| ---------- | ---------- | -------- | +| **Array** | 0-based | `[0, N)` | +| **Matrix** | 1-based | `[1, N]` | + ### Axis 3: Memory Residence -| Suffix | Resides On | Usage | -|--------|-----------|-------| -| **Host** | CPU only | Serial code, I/O, initialization | -| **Device** | Device only (GPU or CPU depending on build) | Parallel kernels | -| **Dual** | Both host and device, with explicit sync | Mixed host/device workflows | + +| Suffix | Resides On | Usage | +| ---------- | ------------------------------------------- | -------------------------------- | +| **Host** | CPU only | Serial code, I/O, initialization | +| **Device** | Device only (GPU or CPU depending on build) | Parallel kernels | +| **Dual** | Both host and device, with explicit sync | Mixed host/device workflows | + ### Axis 4: Ownership -| Prefix | Owns Memory? | Construction | -|--------|-------------|--------------| -| (none) | Yes — allocates and manages storage | `CArrayDevice(10, 10)` | -| **View** | No — wraps an existing pointer | `ViewCArrayDevice(ptr, 10, 10)` | + +| Prefix | Owns Memory? | Construction | +| -------- | ----------------------------------- | ------------------------------------ | +| (none) | Yes — allocates and manages storage | `CArrayDevice(10, 10)` | +| **View** | No — wraps an existing pointer | `ViewCArrayDevice(ptr, 10, 10)` | + ### Complete Naming Map (via `mtr::` aliases) -| Alias (preferred) | Underlying Class | Layout | Index | Residence | -|---|---|---|---|---| -| `CArrayHost` | `CArray` | Row-major | 0-based | Host | -| `CMatrixHost` | `CMatrix` | Row-major | 1-based | Host | -| `FArrayHost` | `FArray` | Column-major | 0-based | Host | -| `FMatrixHost` | `FMatrix` | Column-major | 1-based | Host | -| `ViewCArrayHost` | `ViewCArray` | Row-major | 0-based | Host (non-owning) | -| `ViewCMatrixHost` | `ViewCMatrix` | Row-major | 1-based | Host (non-owning) | -| `ViewFArrayHost` | `ViewFArray` | Column-major | 0-based | Host (non-owning) | -| `ViewFMatrixHost` | `ViewFMatrix` | Column-major | 1-based | Host (non-owning) | -| `CArrayDevice` | `CArrayKokkos` | Row-major | 0-based | Device | -| `CMatrixDevice` | `CMatrixKokkos` | Row-major | 1-based | Device | -| `FArrayDevice` | `FArrayKokkos` | Column-major | 0-based | Device | -| `FMatrixDevice` | `FMatrixKokkos` | Column-major | 1-based | Device | -| `ViewCArrayDevice` | `ViewCArrayKokkos` | Row-major | 0-based | Device (non-owning) | -| `ViewCMatrixDevice` | `ViewCMatrixKokkos` | Row-major | 1-based | Device (non-owning) | -| `ViewFArrayDevice` | `ViewFArrayKokkos` | Column-major | 0-based | Device (non-owning) | -| `ViewFMatrixDevice` | `ViewFMatrixKokkos` | Column-major | 1-based | Device (non-owning) | -| `CArrayDual` | `DCArrayKokkos` | Row-major | 0-based | Dual | -| `CMatrixDual` | `DCMatrixKokkos` | Row-major | 1-based | Dual | -| `FArrayDual` | `DFArrayKokkos` | Column-major | 0-based | Dual | -| `FMatrixDual` | `DFMatrixKokkos` | Column-major | 1-based | Dual | -| `ViewCArrayDual` | `DViewCArrayKokkos` | Row-major | 0-based | Dual (non-owning) | -| `ViewCMatrixDual` | `DViewCMatrixKokkos` | Row-major | 1-based | Dual (non-owning) | -| `ViewFArrayDual` | `DViewFArrayKokkos` | Column-major | 0-based | Dual (non-owning) | -| `ViewFMatrixDual` | `DViewFMatrixKokkos` | Column-major | 1-based | Dual (non-owning) | + +| Alias (preferred) | Underlying Class | Layout | Index | Residence | +| ---------------------- | ----------------------- | ------------ | ------- | ------------------- | +| `CArrayHost` | `CArray` | Row-major | 0-based | Host | +| `CMatrixHost` | `CMatrix` | Row-major | 1-based | Host | +| `FArrayHost` | `FArray` | Column-major | 0-based | Host | +| `FMatrixHost` | `FMatrix` | Column-major | 1-based | Host | +| `ViewCArrayHost` | `ViewCArray` | Row-major | 0-based | Host (non-owning) | +| `ViewCMatrixHost` | `ViewCMatrix` | Row-major | 1-based | Host (non-owning) | +| `ViewFArrayHost` | `ViewFArray` | Column-major | 0-based | Host (non-owning) | +| `ViewFMatrixHost` | `ViewFMatrix` | Column-major | 1-based | Host (non-owning) | +| `CArrayDevice` | `CArrayKokkos` | Row-major | 0-based | Device | +| `CMatrixDevice` | `CMatrixKokkos` | Row-major | 1-based | Device | +| `FArrayDevice` | `FArrayKokkos` | Column-major | 0-based | Device | +| `FMatrixDevice` | `FMatrixKokkos` | Column-major | 1-based | Device | +| `ViewCArrayDevice` | `ViewCArrayKokkos` | Row-major | 0-based | Device (non-owning) | +| `ViewCMatrixDevice` | `ViewCMatrixKokkos` | Row-major | 1-based | Device (non-owning) | +| `ViewFArrayDevice` | `ViewFArrayKokkos` | Column-major | 0-based | Device (non-owning) | +| `ViewFMatrixDevice` | `ViewFMatrixKokkos` | Column-major | 1-based | Device (non-owning) | +| `CArrayDual` | `DCArrayKokkos` | Row-major | 0-based | Dual | +| `CMatrixDual` | `DCMatrixKokkos` | Row-major | 1-based | Dual | +| `FArrayDual` | `DFArrayKokkos` | Column-major | 0-based | Dual | +| `FMatrixDual` | `DFMatrixKokkos` | Column-major | 1-based | Dual | +| `ViewCArrayDual` | `DViewCArrayKokkos` | Row-major | 0-based | Dual (non-owning) | +| `ViewCMatrixDual` | `DViewCMatrixKokkos` | Row-major | 1-based | Dual (non-owning) | +| `ViewFArrayDual` | `DViewFArrayKokkos` | Column-major | 0-based | Dual (non-owning) | +| `ViewFMatrixDual` | `DViewFMatrixKokkos` | Column-major | 1-based | Dual (non-owning) | + ### Sparse/Ragged Type Aliases -| Alias | Underlying Class | Description | -|---|---|---| -| `RaggedCArrayHost` | `RaggedRightArray` | Variable-length rows, host | -| `RaggedFArrayHost` | `RaggedDownArray` | Variable-length columns, host | -| `DynamicRaggedCArrayHost` | `DynamicRaggedRightArray` | Dynamic row lengths, host | -| `DynamicRaggedFArrayHost` | `DynamicRaggedDownArray` | Dynamic column lengths, host | -| `CSRArrayHost` | `CSRArray` | Compressed Sparse Row, host | -| `CSCArrayHost` | `CSCArray` | Compressed Sparse Column, host | -| `RaggedCArrayDevice` | `RaggedRightArrayKokkos` | Variable-length rows, device | -| `RaggedCArrayDual` | `DRaggedRightArrayKokkos` | Variable-length rows, dual | -| `RaggedFArrayDevice` | `RaggedDownArrayKokkos` | Variable-length columns, device | -| `DynamicRaggedCArrayDevice` | `DynamicRaggedRightArrayKokkos` | Dynamic rows, device | -| `DynamicRaggedFArrayDevice` | `DynamicRaggedDownArrayKokkos` | Dynamic columns, device | -| `CSRArrayDevice` | `CSRArrayKokkos` | Compressed Sparse Row, device | -| `CSCArrayDevice` | `CSCArrayKokkos` | Compressed Sparse Column, device | + +| Alias | Underlying Class | Description | +| ------------------------------ | ---------------------------------- | -------------------------------- | +| `RaggedCArrayHost` | `RaggedRightArray` | Variable-length rows, host | +| `RaggedFArrayHost` | `RaggedDownArray` | Variable-length columns, host | +| `DynamicRaggedCArrayHost` | `DynamicRaggedRightArray` | Dynamic row lengths, host | +| `DynamicRaggedFArrayHost` | `DynamicRaggedDownArray` | Dynamic column lengths, host | +| `CSRArrayHost` | `CSRArray` | Compressed Sparse Row, host | +| `CSCArrayHost` | `CSCArray` | Compressed Sparse Column, host | +| `RaggedCArrayDevice` | `RaggedRightArrayKokkos` | Variable-length rows, device | +| `RaggedCArrayDual` | `DRaggedRightArrayKokkos` | Variable-length rows, dual | +| `RaggedFArrayDevice` | `RaggedDownArrayKokkos` | Variable-length columns, device | +| `DynamicRaggedCArrayDevice` | `DynamicRaggedRightArrayKokkos` | Dynamic rows, device | +| `DynamicRaggedFArrayDevice` | `DynamicRaggedDownArrayKokkos` | Dynamic columns, device | +| `CSRArrayDevice` | `CSRArrayKokkos` | Compressed Sparse Row, device | +| `CSCArrayDevice` | `CSCArrayKokkos` | Compressed Sparse Column, device | + --- @@ -191,12 +229,12 @@ CArrayDevice b(10, 20); // 2D, 10x20 CArrayDevice c(10, 20, 30); // 3D CArrayDual d(10, 20, "my_label"); // Dual with optional debug label -// View types — wrap existing pointer, no allocation -ViewCArrayDevice v(ptr, 10, 20); // Wraps raw pointer with 2D shape +// View types — wrap existing data, no allocation +ViewCArrayDevice v(arr[0], 10, 20); // Wraps raw data with 2D shape // Host-only types CArray h(100, 100); // Host-only, owning -ViewCArray hv(raw_ptr, 100, 100); // Host-only, non-owning +ViewCArray hv(h(0,0), 100, 100); // Host-only, non-owning ``` ### Element Access @@ -233,7 +271,7 @@ View types can wrap a pointer into the **middle** of an existing array, reinterp CArrayDual B(num_blocks, rows, cols); // Create a 2D view into block 1: starts at &B(1, 0, 0), shape is (rows, cols) -ViewCArray block1(B.host_pointer() + 1 * rows * cols, rows, cols); +ViewCArray block1(&B(1,0,0), rows, cols); // block1(i, j) now aliases B(1, i, j) — no data copy for (int i = 0; i < rows; i++) @@ -241,9 +279,11 @@ for (int i = 0; i < rows; i++) block1(i, j) = some_value; // Device-side slicing for passing subarrays to functions -ViewCArrayDevice dev_slice(A.pointer() + offset, slice_len); -FOR_ALL(i, 0, slice_len, { - dev_slice(i) *= 2.0; +CArrayDual B(num_blocks, rows, cols); +FOR_ALL(i, 0, rows, + j, 0, cols { + ViewCArrayDevice B_slice(&B(1, 0, 0), rows, cols); + B_slice(i,j) = (double)i + (double)j; }); ``` @@ -251,20 +291,22 @@ This pattern naturally handles legacy codes that pass array slices to subroutine ### Common Methods -| Method | Description | -|--------|-------------| -| `size()` | Total number of elements | -| `extent()` | Same as `size()` | -| `dims(n)` | Size along dimension `n` (0-indexed for Arrays, 1-indexed for Matrices) | -| `order()` | Number of dimensions | -| `pointer()` | Raw pointer to underlying data. Available on Device and Host types only. **Dual types do NOT have `pointer()`** — use `device_pointer()` or `host_pointer()` instead. | -| `device_pointer()` | Raw pointer to device-side memory (Dual types only) | -| `host_pointer()` | Raw pointer to host-side memory (Dual types only) | -| `set_values(val)` | Set all elements to `val` via `parallel_for` on the device. **Does NOT fence** — follow with `MATAR_FENCE()` if host code depends on the result, or `update_host()` if you need the values on the host. | -| `update_host()` | Copy device data to host (Dual types only) | -| `update_device()` | Copy host data to device (Dual types only) | -| `get_kokkos_view()` | Access the underlying `Kokkos::View` (Device types) | -| `get_kokkos_dual_view()` | Access the underlying `Kokkos::DualView` (Dual types) | + +| Method | Description | +| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `size()` | Total number of elements | +| `extent()` | Same as `size()` | +| `dims(n)` | Size along dimension `n` (0-indexed for Arrays, 1-indexed for Matrices) | +| `order()` | Number of dimensions | +| `pointer()` | Raw pointer to underlying data. Available on Device and Host types only. **Dual types do NOT have `pointer()`** — use `device_pointer()` or `host_pointer()` instead. | +| `device_pointer()` | Raw pointer to device-side memory (Dual types only) | +| `host_pointer()` | Raw pointer to host-side memory (Dual types only) | +| `set_values(val)` | Set all elements to `val` via `parallel_for` on the device. **Does NOT fence** — follow with `MATAR_FENCE()` if host code depends on the result, or `update_host()` if you need the values on the host. | +| `update_host()` | Copy device data to host (Dual types only) | +| `update_device()` | Copy host data to device (Dual types only) | +| `get_kokkos_view()` | Access the underlying `Kokkos::View` (Device types) | +| `get_kokkos_dual_view()` | Access the underlying `Kokkos::DualView` (Dual types) | + --- @@ -348,15 +390,17 @@ RUN({ The macros use variadic dispatch based on argument count. This table shows the exact signatures: -| Macro | 1D args (4) | 2D args (7) | 3D args (10) | -|-------|------------|------------|-------------| -| `FOR_ALL` | `(i, lo, hi, {body})` | `(i, lo, hi, j, lo, hi, {body})` | `(i, lo, hi, j, lo, hi, k, lo, hi, {body})` | -| `DO_ALL` | same pattern | same pattern | same pattern | -| `FOR_REDUCE_SUM` | `(i, lo, hi, var, {body}, result)` (6 args) | `(i, lo, hi, j, lo, hi, var, {body}, result)` (9 args) | `(i, lo, hi, j, lo, hi, k, lo, hi, var, {body}, result)` (12 args) | -| `FOR_REDUCE_MAX` | same as SUM | same as SUM | same as SUM | -| `FOR_REDUCE_MIN` | same as SUM | same as SUM | same as SUM | + +| Macro | 1D args (4) | 2D args (7) | 3D args (10) | +| -------------------- | ------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------ | +| `FOR_ALL` | `(i, lo, hi, {body})` | `(i, lo, hi, j, lo, hi, {body})` | `(i, lo, hi, j, lo, hi, k, lo, hi, {body})` | +| `DO_ALL` | same pattern | same pattern | same pattern | +| `FOR_REDUCE_SUM` | `(i, lo, hi, var, {body}, result)` (6 args) | `(i, lo, hi, j, lo, hi, var, {body}, result)` (9 args) | `(i, lo, hi, j, lo, hi, k, lo, hi, var, {body}, result)` (12 args) | +| `FOR_REDUCE_MAX` | same as SUM | same as SUM | same as SUM | +| `FOR_REDUCE_MIN` | same as SUM | same as SUM | same as SUM | | `FOR_REDUCE_PRODUCT` | `(i, lo, hi, var, {body}, result)` (6 args) | `(i, lo, hi, j, lo, hi, var, {body}, result)` (9 args) | `(i, lo, hi, j, lo, hi, k, lo, hi, var, {body}, result)` (12 args) | + The `DO_REDUCE_*` variants follow the same argument patterns as `FOR_REDUCE_*` but with inclusive upper bounds **for SUM/MAX/MIN**. There is no `DO_REDUCE_PRODUCT` macro in the current `macros.h`. ### `FOR_ALL` — C-style half-open range `[start, end)` @@ -403,6 +447,7 @@ FOR_ALL(i, 0, N, j, 0, M, { For finer-grained control over nested parallelism (e.g., tiled algorithms), use the hierarchical `FOR_FIRST`/`FOR_SECOND`/`FOR_THIRD` macros described in Section 7. **Loop ordering:** The inner loop varies the fastest and the outer loop varies the slowest. For optimal cache performance, match the loop index order to the data type's memory layout: + - `CArray` / `CArrayDevice`: last index fastest → `FOR_ALL(i, ..., j, ..., { a(i,j) })` is correct - `FArray` / `FArrayDevice`: first index fastest → `FOR_ALL(i, ..., j, ..., { a(j,i) })` maps the fast-varying loop index to the fast-varying array index @@ -427,9 +472,10 @@ DO_ALL(j, 1, M, ``` **When to use which:** + - `FOR_ALL` with `CArray`/`CArrayDevice`: C-style code, 0-based indexing, `[0, N)` - `DO_ALL` with `FMatrix`/`FMatrixDevice`: Fortran-style code, 1-based indexing, `[1, N]` -- The same `FOR_ALL` vs `DO_ALL` distinction applies to all reduction variants (`FOR_REDUCE_*` vs `DO_REDUCE_*`) +- The same `FOR_ALL` vs `DO_ALL` distinction applies to all reduction variants (`FOR_REDUCE_`* vs `DO_REDUCE_*`) ### `RUN` — Execute Once on Device @@ -592,6 +638,7 @@ FOR_FIRST(i, 0, N, { ``` Available hierarchical reductions: + - `FOR_REDUCE_SUM_SECOND`, `FOR_REDUCE_SUM_THIRD` - `DO_REDUCE_SUM_SECOND`, `DO_REDUCE_SUM_THIRD` - `FOR_REDUCE_MAX_SECOND`, `DO_REDUCE_MAX_THIRD` @@ -626,6 +673,7 @@ class MySimulation { ``` Available `_CLASS` variants: + - `FOR_ALL_CLASS` - `FOR_REDUCE_SUM_CLASS` - `FOR_REDUCE_MAX_CLASS` @@ -667,6 +715,7 @@ for (int i = 0; i < N; i++) ``` **Key rules:** + - Inside `FOR_ALL`/`RUN` blocks: use `data(i, j)` — this is the device accessor. - Outside parallel regions on the CPU: use `data.host(i, j)`. - Call `update_device()` after host writes, before device reads. @@ -679,36 +728,42 @@ for (int i = 0; i < N; i++) ### Step 1: Replace Data Structures -| Original C++ | MATAR Replacement | When to Use | -|---|---|---| -| `double a[N]` | `CArrayDevice a(N)` | Device-only computation | -| `double a[N][M]` | `CArrayDevice a(N, M)` | Device-only computation | -| `double a[N][M][P]` | `CArrayDevice a(N, M, P)` | Device-only computation | -| `double a[N]` | `CArrayDual a(N)` | Need host and device access | -| `double* a = new double[N]` | `CArrayDevice a(N)` | Dynamic allocation → MATAR | -| `std::vector a(N)` | `CArrayDual a(N)` | Need both host and device | -| `double a[N]` (used as function arg) | `ViewCArrayDevice a(ptr, N)` | Wrap existing memory | -| Fortran-style 1-based `A(i,j)` | `FMatrixDevice A(N, M)` | Fortran interop | + +| Original C++ | MATAR Replacement | When to Use | +| ------------------------------------ | ------------------------------------ | --------------------------- | +| `double a[N]` | `CArrayDevice a(N)` | Device-only computation | +| `double a[N][M]` | `CArrayDevice a(N, M)` | Device-only computation | +| `double a[N][M][P]` | `CArrayDevice a(N, M, P)` | Device-only computation | +| `double a[N]` | `CArrayDual a(N)` | Need host and device access | +| `double* a = new double[N]` | `CArrayDevice a(N)` | Dynamic allocation → MATAR | +| `std::vector a(N)` | `CArrayDual a(N)` | Need both host and device | +| `double a[N]` (used as function arg) | `ViewCArrayDevice a(ptr, N)` | Wrap existing memory | +| Fortran-style 1-based `A(i,j)` | `FMatrixDevice A(N, M)` | Fortran interop | + ### Step 2: Replace Index Access -| Original | MATAR | -|----------|-------| -| `a[i]` | `a(i)` | -| `a[i][j]` | `a(i, j)` | -| `a[i][j][k]` | `a(i, j, k)` | -| Pointer arithmetic `*(a + i*M + j)` | `a(i, j)` | + +| Original | MATAR | +| ----------------------------------- | ------------ | +| `a[i]` | `a(i)` | +| `a[i][j]` | `a(i, j)` | +| `a[i][j][k]` | `a(i, j, k)` | +| Pointer arithmetic `*(a + i*M + j)` | `a(i, j)` | + ### Step 3: Replace Loops with Macros -| Original | MATAR | Notes | -|----------|-------|-------| -| `for (int i = 0; i < N; i++) { ... }` | `FOR_ALL(i, 0, N, { ... });` | Half-open range | -| `for (int i = 0; i < N; i++) for (int j = 0; j < M; j++) { ... }` | `FOR_ALL(i, 0, N, j, 0, M, { ... });` | 2D parallel | -| `for (int i = 1; i <= N; i++) { ... }` | `DO_ALL(i, 1, N, { ... });` | Inclusive range | -| Serial `for` with accumulator `sum += ...` | `FOR_REDUCE_SUM(i, 0, N, loc, { loc += ...; }, sum);` | Parallel reduction | -| Serial `for` tracking max | `FOR_REDUCE_MAX(i, 0, N, loc, { if(x > loc) loc = x; }, max_val);` | Max reduction | -| Serial `for` tracking min | `FOR_REDUCE_MIN(i, 0, N, loc, { if(x < loc) loc = x; }, min_val);` | Min reduction | + +| Original | MATAR | Notes | +| ----------------------------------------------------------------- | ------------------------------------------------------------------ | ------------------ | +| `for (int i = 0; i < N; i++) { ... }` | `FOR_ALL(i, 0, N, { ... });` | Half-open range | +| `for (int i = 0; i < N; i++) for (int j = 0; j < M; j++) { ... }` | `FOR_ALL(i, 0, N, j, 0, M, { ... });` | 2D parallel | +| `for (int i = 1; i <= N; i++) { ... }` | `DO_ALL(i, 1, N, { ... });` | Inclusive range | +| Serial `for` with accumulator `sum += ...` | `FOR_REDUCE_SUM(i, 0, N, loc, { loc += ...; }, sum);` | Parallel reduction | +| Serial `for` tracking max | `FOR_REDUCE_MAX(i, 0, N, loc, { if(x > loc) loc = x; }, max_val);` | Max reduction | +| Serial `for` tracking min | `FOR_REDUCE_MIN(i, 0, N, loc, { if(x < loc) loc = x; }, min_val);` | Min reduction | + ### Step 4: Handle Dependencies and Synchronization @@ -722,17 +777,22 @@ When multiple parallel iterations write to the same output location: ```cpp // WRONG: race condition — multiple (i,j,k) combos update the same C(i,j) -FOR_ALL(i, 0, N, j, 0, N, k, 0, N, { +FOR_ALL(i, 0, N, + j, 0, N, + k, 0, N, { C(i, j) += A(i, k) * B(k, j); }); // FIX OPTION 1: Atomics -FOR_ALL(i, 0, N, j, 0, N, k, 0, N, { +FOR_ALL(i, 0, N, + j, 0, N, + k, 0, N, { Kokkos::atomic_add(&C(i, j), A(i, k) * B(k, j)); }); // FIX OPTION 2: Inner serial loop (preferred for matmul) -FOR_ALL(i, 0, N, j, 0, N, { +FOR_ALL(i, 0, N, + j, 0, N, { double local_sum = 0.0; for (int k = 0; k < N; k++) { local_sum += A(i, k) * B(k, j); @@ -787,11 +847,12 @@ MATAR_INITIALIZE(argc, argv); CArrayDual temperature(height + 2, width + 2); CArrayDual temperature_previous(height + 2, width + 2); - // Initialize + // Initialize on device temperature_previous.set_values(0.0); FOR_ALL(i, 0, height + 2, { temperature_previous(i, width + 1) = (1000.0 / height) * i; }); + MATAR_FENCE(); temperature_previous.update_host(); while (worst_dt > tolerance) { @@ -845,7 +906,7 @@ for (int i = 0; i < 10; i++) sum += A[i] * A[i]; ```cpp CArrayDevice A(10); FOR_ALL(i, 0, 10, { - A(i) = 314; + A(i) = 314; // OR, initialize using A.set_values(314); }); int loc_sum = 0, sum = 0; @@ -870,8 +931,13 @@ for (int i = 0; i < N; i++) **After:** `PSEUDOCODE_PATTERN` ```cpp -CArrayDevice A(N, N), B(N, N), C(N, N); +CArrayDevice A(N, N); +CArrayDevice B(N, N); +CArrayDevice C(N, N); // ... initialize A, B ... +A.set_values(1.0); +B.set_values(1.0); +C.set_values(0.0); FOR_ALL(i, 0, N, j, 0, N, { double local_sum = 0.0; @@ -906,7 +972,8 @@ for (int i = 0; i < 3; i++) ViewCArrayDual view(&some_array[0], 3, 3); // Now usable in parallel regions on device -FOR_ALL(i, 0, 3, j, 0, 3, { +FOR_ALL(i, 0, 3, + j, 0, 3, { view(i, j) *= 2; }); MATAR_FENCE(); @@ -941,12 +1008,14 @@ FOR_ALL(i, 0, num_shapes, { ```cpp // BAD: inner loop on i with CArray (row-major) — poor cache behavior -FOR_ALL(j, 0, M, i, 0, N, { +FOR_ALL(j, 0, M, + i, 0, N, { carr(i, j) = ...; // i changes in inner loop but is the slow index in CArray }); // GOOD: inner loop on j matches CArray row-major layout -FOR_ALL(i, 0, N, j, 0, M, { +FOR_ALL(i, 0, N, + j, 0, M, { carr(i, j) = ...; // j changes in inner loop = fast index = good cache use }); ``` @@ -955,23 +1024,35 @@ FOR_ALL(i, 0, N, j, 0, M, { ```cpp // BAD: immediately reading on host without fence -FOR_ALL(i, 0, N, { a(i) = compute(i); }); -double val = a.host(0); // may not reflect device computation +FOR_ALL(i, 0, N, { + a(i) = compute(i); +}); +double val = a.host(0); // will not reflect device computation without a fence and a.update_host() // GOOD: fence then sync to host -FOR_ALL(i, 0, N, { a(i) = compute(i); }); +FOR_ALL(i, 0, N, { + a(i) = compute(i); +}); MATAR_FENCE(); a.update_host(); double val = a.host(0); // correct // BAD: unnecessary fence between independent kernels (kills performance) -FOR_ALL(i, 0, N, { A(i) = i; }); +FOR_ALL(i, 0, N, { + A(i) = i; +}); MATAR_FENCE(); // wasteful — B doesn't depend on A -FOR_ALL(i, 0, N, { B(i) = i * 2; }); +FOR_ALL(i, 0, N, { + B(i) = i * 2; +}); // GOOD: no fence needed for independent data -FOR_ALL(i, 0, N, { A(i) = i; }); -FOR_ALL(i, 0, N, { B(i) = i * 2; }); +FOR_ALL(i, 0, N, { + A(i) = i; +}); +FOR_ALL(i, 0, N, { + B(i) = i * 2; +}); // NOTE: reduction results are immediately available (implicit fence) double loc = 0.0, total = 0.0; @@ -987,7 +1068,7 @@ See the [Fence Placement Rules](#fence-placement-rules) in Section 1 for the com CArrayDual data(N); for (int i = 0; i < N; i++) data.host(i) = i; // writes to host copy -// BAD: device copy is stale +// BAD: device copy is stale (not copied to device) FOR_ALL(i, 0, N, { data(i) *= 2.0; }); // reads stale device data // GOOD: @@ -1105,21 +1186,18 @@ using u_int = unsigned int; Use this decision tree: 1. **Do you need host and device access?** - - Yes → Use a **Dual** type - - No, device only → Use a **Device** type - - No, host only → Use a **Host** type - + - Yes → Use a **Dual** type + - No, device only → Use a **Device** type + - No, host only → Use a **Host** type 2. **Do you own the memory?** - - Yes → Use an owning type (no `View` prefix) - - No, wrapping existing pointer → Use a **View** type - + - Yes → Use an owning type (no `View` prefix) + - No, wrapping existing pointer → Use a **View** type 3. **What index convention?** - - 0-based `[0, N)` → Use an **Array** type - - 1-based `[1, N]` → Use a **Matrix** type - + - 0-based `[0, N)` → Use an **Array** type + - 1-based `[1, N]` → Use a **Matrix** type 4. **What memory layout?** - - Row-major (C-style, last index fastest) → Use a **C** prefix - - Column-major (Fortran-style, first index fastest) → Use an **F** prefix + - Row-major (C-style, last index fastest) → Use a **C** prefix + - Column-major (Fortran-style, first index fastest) → Use an **F** prefix **Example decision:** "I need a 2D row-major 0-based array that lives on both host and device" → `CArrayDual a(N, M);` @@ -1135,16 +1213,18 @@ Code inside `FOR_ALL`, `DO_ALL`, `RUN`, and reduction macros runs on the device ### What You Cannot Do Inside Device Kernels -| Forbidden | Why | Alternative | -|-----------|-----|-------------| -| `std::cout`, `std::cerr` | Not available on GPU | Use `printf` | -| `new` / `delete` / `malloc` / `free` | Host memory allocators | Use `Kokkos::kokkos_malloc` / placement `new` | -| `std::vector`, `std::map`, etc. | STL containers are host-only | Use MATAR arrays | -| `throw` / `try` / `catch` | Exceptions not supported on GPU | Use error codes or assertions | -| File I/O (`fopen`, `fstream`) | No filesystem on GPU | Do I/O on host, sync with `update_host()` | -| Virtual function calls | Only work via placement `new` on device memory | See Example E in Section 11 | -| `std::string` | Host-only | Use `const char*` or integer codes | -| Recursive function calls | Limited/unsupported on some GPU architectures | Rewrite iteratively | + +| Forbidden | Why | Alternative | +| ------------------------------------ | ---------------------------------------------- | --------------------------------------------- | +| `std::cout`, `std::cerr` | Not available on GPU | Use `printf` | +| `new` / `delete` / `malloc` / `free` | Host memory allocators | Use `Kokkos::kokkos_malloc` / placement `new` | +| `std::vector`, `std::map`, etc. | STL containers are host-only | Use MATAR arrays | +| `throw` / `try` / `catch` | Exceptions not supported on GPU | Use error codes or assertions | +| File I/O (`fopen`, `fstream`) | No filesystem on GPU | Do I/O on host, sync with `update_host()` | +| Virtual function calls | Only work via placement `new` on device memory | See Example E in Section 11 | +| `std::string` | Host-only | Use `const char`* or integer codes | +| Recursive function calls | Limited/unsupported on some GPU architectures | Rewrite iteratively | + ### Lambda Capture Rules @@ -1160,6 +1240,7 @@ FOR_ALL(i, 0, N, { ``` You **cannot** capture: + - Host-only objects (`std::vector`, `std::string`, etc.) - References to local variables that won't exist on the device - Large stack objects (copies are expensive) @@ -1247,6 +1328,7 @@ end program ``` Key rules: + - Use `ViewFMatrixDual` to wrap Fortran arrays (column-major, 1-based). - Use `DO_ALL` with 1-based ranges for natural Fortran indexing. - C++ functions callable from Fortran must use `extern "C"` and trailing `_` in the name. @@ -1265,15 +1347,17 @@ target_link_libraries(my_target matar) ### Preprocessor Defines -| Define | Effect | -|--------|--------| -| `HAVE_KOKKOS` | Enables Kokkos-backed device/dual types and parallel macros | -| `HAVE_CUDA` | Kokkos execution space = CUDA; layout defaults to `LayoutLeft` | -| `HAVE_HIP` | Kokkos execution space = HIP | -| `HAVE_OPENMP` | Kokkos execution space = OpenMP; layout = `LayoutRight` | -| `HAVE_THREADS` | Kokkos execution space = Threads | -| `HAVE_MPI` | Enables MPI types (`MPICArrayKokkos`, `PartitionMap`, etc.) | -| `TRILINOS_INTERFACE` | Enables Tpetra wrapper types | + +| Define | Effect | +| -------------------- | -------------------------------------------------------------- | +| `HAVE_KOKKOS` | Enables Kokkos-backed device/dual types and parallel macros | +| `HAVE_CUDA` | Kokkos execution space = CUDA; layout defaults to `LayoutLeft` | +| `HAVE_HIP` | Kokkos execution space = HIP | +| `HAVE_OPENMP` | Kokkos execution space = OpenMP; layout = `LayoutRight` | +| `HAVE_THREADS` | Kokkos execution space = Threads | +| `HAVE_MPI` | Enables MPI types (`MPICArrayKokkos`, `PartitionMap`, etc.) | +| `TRILINOS_INTERFACE` | Enables Tpetra wrapper types | + When no GPU backend is specified with Kokkos, the default execution space and layout from Kokkos are used. @@ -1302,19 +1386,22 @@ MATAR provides MPI-aware data types and a communication plan abstraction for dis ### When to Use MPICArrayKokkos vs. Plain DCArrayKokkos -| Type | Use Case | -|------|----------| -| `DCArrayKokkos` (`CArrayDual`) | Single-rank data. No MPI communication needed. Host/device sync only. | -| `MPICArrayKokkos` | Distributed data with ghost/halo regions. Wraps `DCArrayKokkos` internally and adds MPI send/recv buffer management, a `CommunicationPlan`, and a `communicate()` method that handles the full pack → exchange → unpack cycle. | + +| Type | Use Case | +| ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `DCArrayKokkos` (`CArrayDual`) | Single-rank data. No MPI communication needed. Host/device sync only. | +| `MPICArrayKokkos` | Distributed data with ghost/halo regions. Wraps `DCArrayKokkos` internally and adds MPI send/recv buffer management, a `CommunicationPlan`, and a `communicate()` method that handles the full pack → exchange → unpack cycle. | + Use `MPICArrayKokkos` when a data array is partitioned across MPI ranks and neighboring ranks need to exchange boundary (ghost/halo) data. Use plain `DCArrayKokkos` for rank-local data that never crosses MPI boundaries. ### MPICArrayKokkos Overview `MPICArrayKokkos` is a template class that wraps a `DCArrayKokkos` with: + - **Send and receive buffers** (`DCArrayKokkos`) for packing/unpacking halo data - A **stride** value computed from trailing dimensions (for multi-dimensional arrays, each first-index element is a contiguous block of `dim1 * dim2 * ... * dimN` values) -- A pointer to a shared **`CommunicationPlan`** that defines the neighbor topology +- A pointer to a shared `**CommunicationPlan`** that defines the neighbor topology - A `host` member (`ViewCArray`) for convenient host-side access ```cpp @@ -1362,19 +1449,21 @@ comm_plan.setup_send_recv(rank_send_ids, rank_recv_ids); #### CommunicationPlan Members -| Member | Type | Description | -|--------|------|-------------| -| `mpi_comm_world` | `MPI_Comm` | The base MPI communicator | -| `mpi_comm_graph` | `MPI_Comm` | Distributed graph communicator for neighbor collectives | -| `num_send_ranks` | `int` | Number of ranks this process sends to (out-degree) | -| `num_recv_ranks` | `int` | Number of ranks this process receives from (in-degree) | -| `send_rank_ids` | `DCArrayKokkos` | Destination rank IDs | -| `recv_rank_ids` | `DCArrayKokkos` | Source rank IDs | -| `send_indices_` | `DRaggedRightArrayKokkos` | Per-neighbor indices of elements to send | -| `recv_indices_` | `DRaggedRightArrayKokkos` | Per-neighbor indices of elements to receive | -| `send_counts_` / `recv_counts_` | `DCArrayKokkos` | Number of elements per neighbor | -| `send_displs_` / `recv_displs_` | `DCArrayKokkos` | Displacement offsets for packing | -| `total_send_count` / `total_recv_count` | `int` | Total elements across all neighbors | + +| Member | Type | Description | +| --------------------------------------- | ------------------------------ | ------------------------------------------------------- | +| `mpi_comm_world` | `MPI_Comm` | The base MPI communicator | +| `mpi_comm_graph` | `MPI_Comm` | Distributed graph communicator for neighbor collectives | +| `num_send_ranks` | `int` | Number of ranks this process sends to (out-degree) | +| `num_recv_ranks` | `int` | Number of ranks this process receives from (in-degree) | +| `send_rank_ids` | `DCArrayKokkos` | Destination rank IDs | +| `recv_rank_ids` | `DCArrayKokkos` | Source rank IDs | +| `send_indices_` | `DRaggedRightArrayKokkos` | Per-neighbor indices of elements to send | +| `recv_indices_` | `DRaggedRightArrayKokkos` | Per-neighbor indices of elements to receive | +| `send_counts_` / `recv_counts_` | `DCArrayKokkos` | Number of elements per neighbor | +| `send_displs_` / `recv_displs_` | `DCArrayKokkos` | Displacement offsets for packing | +| `total_send_count` / `total_recv_count` | `int` | Total elements across all neighbors | + ### Connecting MPICArrayKokkos to a CommunicationPlan @@ -1392,13 +1481,10 @@ field.initialize_comm_plan(comm_plan); Calling `field.communicate()` performs the full halo exchange: -1. **`fill_send_buffer()`**: Copies `this_array_` to host (`update_host`), then packs elements listed in `send_indices_` into the contiguous `send_buffer_` on the host. For multi-dimensional arrays, each element index packs `stride_` contiguous values. - -2. **`MPI_Neighbor_alltoallv()`**: Exchanges data with all neighbors using the graph communicator. Send/recv counts and displacements are pre-computed and scaled by stride. - -3. **`copy_recv_buffer()`**: Unpacks `recv_buffer_` into the ghost positions of `this_array_` using `recv_indices_`. - -4. **`update_device()`**: Syncs the updated array (with fresh ghost data) back to the device. +1. `**fill_send_buffer()**`: Copies `this_array_` to host (`update_host`), then packs elements listed in `send_indices_` into the contiguous `send_buffer_` on the host. For multi-dimensional arrays, each element index packs `stride_` contiguous values. +2. `**MPI_Neighbor_alltoallv()**`: Exchanges data with all neighbors using the graph communicator. Send/recv counts and displacements are pre-computed and scaled by stride. +3. `**copy_recv_buffer()**`: Unpacks `recv_buffer_` into the ghost positions of `this_array_` using `recv_indices_`. +4. `**update_device()**`: Syncs the updated array (with fresh ghost data) back to the device. ```cpp // COMPILES_AS_IS (assuming variables are declared) @@ -1419,18 +1505,21 @@ FOR_ALL(i, 0, num_owned + num_ghost, { ### Interaction Between update_host/update_device and MPI The `communicate()` method internally calls `update_host()` before packing and `update_device()` after unpacking. This means: + - Before calling `communicate()`, ensure device data is current (place a `MATAR_FENCE()` after any kernel that wrote to the array). - After `communicate()` returns, the device copy already has fresh ghost data — no additional `update_device()` is needed. - If you need to inspect the data on the host after communication, call `update_host()` explicitly. ### MPI Convenience Macros -| Macro | Expands To | -|-------|-----------| -| `MATAR_MPI_INIT` | `MPI_Init(&argc, &argv)` | -| `MATAR_MPI_FINALIZE` | `MPI_Finalize()` | -| `MATAR_MPI_TIME` | `MPI_Wtime()` | -| `MATAR_MPI_BARRIER` | `MPI_Barrier(MPI_COMM_WORLD)` | + +| Macro | Expands To | +| -------------------- | ----------------------------- | +| `MATAR_MPI_INIT` | `MPI_Init(&argc, &argv)` | +| `MATAR_MPI_FINALIZE` | `MPI_Finalize()` | +| `MATAR_MPI_TIME` | `MPI_Wtime()` | +| `MATAR_MPI_BARRIER` | `MPI_Barrier(MPI_COMM_WORLD)` | + ### Complete MPI Example Pattern @@ -1503,33 +1592,27 @@ int main(int argc, char* argv[]) { Use these constraints as hard rules when generating MATAR code: 1. **Use only macros that exist in `MATAR/src/include/macros.h`.** - - **Flat parallel loops:** `FOR_ALL`, `DO_ALL`, `RUN` - - **Serial convenience loops:** `FOR_LOOP`, `DO_LOOP` (and stride variants `FOR_LOOP_DIM` / `DO_LOOP_DIM` for loops with a delta/increment) - - **Reductions (half-open):** `FOR_REDUCE_SUM`, `FOR_REDUCE_MAX`, `FOR_REDUCE_MIN`, `FOR_REDUCE_PRODUCT` - - **Reductions (inclusive):** `DO_REDUCE_SUM`, `DO_REDUCE_MAX`, `DO_REDUCE_MIN` — **`DO_REDUCE_PRODUCT` does NOT exist** - - **Class-member variants** (use `KOKKOS_CLASS_LAMBDA`): `FOR_ALL_CLASS`, `RUN_CLASS`, `FOR_REDUCE_SUM_CLASS`, `FOR_REDUCE_MAX_CLASS`, `FOR_REDUCE_MIN_CLASS` - - **Hierarchical (team/thread/vector):** `FOR_FIRST`, `FOR_SECOND`, `FOR_THIRD`, `DO_FIRST`, `DO_SECOND`, `DO_THIRD` - - **Hierarchical reductions:** `FOR_REDUCE_SUM_SECOND`, `FOR_REDUCE_SUM_THIRD`, `DO_REDUCE_SUM_SECOND`, `DO_REDUCE_SUM_THIRD`, `FOR_REDUCE_MAX_SECOND`, `DO_REDUCE_MAX_THIRD`, `FOR_REDUCE_MIN_SECOND`, `DO_REDUCE_MIN_THIRD` - -2. **`FOR_ALL` and `DO_ALL` parallelize all listed dimensions.** - - If a dimension must remain sequential, keep it as a plain inner `for` loop inside the macro body, or use hierarchical macros where appropriate. - + - **Flat parallel loops:** `FOR_ALL`, `DO_ALL`, `RUN` + - **Serial convenience loops:** `FOR_LOOP`, `DO_LOOP` (and stride variants `FOR_LOOP_DIM` / `DO_LOOP_DIM` for loops with a delta/increment) + - **Reductions (half-open):** `FOR_REDUCE_SUM`, `FOR_REDUCE_MAX`, `FOR_REDUCE_MIN`, `FOR_REDUCE_PRODUCT` + - **Reductions (inclusive):** `DO_REDUCE_SUM`, `DO_REDUCE_MAX`, `DO_REDUCE_MIN` — `**DO_REDUCE_PRODUCT` does NOT exist** + - **Class-member variants** (use `KOKKOS_CLASS_LAMBDA`): `FOR_ALL_CLASS`, `RUN_CLASS`, `FOR_REDUCE_SUM_CLASS`, `FOR_REDUCE_MAX_CLASS`, `FOR_REDUCE_MIN_CLASS` + - **Hierarchical (team/thread/vector):** `FOR_FIRST`, `FOR_SECOND`, `FOR_THIRD`, `DO_FIRST`, `DO_SECOND`, `DO_THIRD` + - **Hierarchical reductions:** `FOR_REDUCE_SUM_SECOND`, `FOR_REDUCE_SUM_THIRD`, `DO_REDUCE_SUM_SECOND`, `DO_REDUCE_SUM_THIRD`, `FOR_REDUCE_MAX_SECOND`, `DO_REDUCE_MAX_THIRD`, `FOR_REDUCE_MIN_SECOND`, `DO_REDUCE_MIN_THIRD` +2. `**FOR_ALL` and `DO_ALL` parallelize all listed dimensions.** + - If a dimension must remain sequential, keep it as a plain inner `for` loop inside the macro body, or use hierarchical macros where appropriate. 3. **Indexing is always `()` for MATAR arrays.** - - Never emit `[]` indexing for MATAR types. - + - Never emit `[]` indexing for MATAR types. 4. **For dual types, synchronize explicitly.** - - Host writes require `update_device()` before device reads. - - Device writes require `update_host()` before host reads. - + - Host writes require `update_device()` before device reads. + - Device writes require `update_host()` before host reads. 5. **Fence only when needed.** - - Required at dependency boundaries or before host/timing use. - - Avoid unnecessary fences between independent kernels. - + - Required at dependency boundaries or before host/timing use. + - Avoid unnecessary fences between independent kernels. 6. **Prefer aliases in `mtr::` namespace.** - - Example: `CArrayDual` instead of direct underlying class names unless low-level control is needed. - + - Example: `CArrayDual` instead of direct underlying class names unless low-level control is needed. 7. **MPI-aware distributed arrays use `MPICArrayKokkos`.** - - Plain `DCArrayKokkos` / `CArrayDual` are rank-local and do not perform communication. + - Plain `DCArrayKokkos` / `CArrayDual` are rank-local and do not perform communication. --- @@ -1541,29 +1624,25 @@ When producing MATAR conversions, output must satisfy this checklist: - Include `#include ` - Use `using namespace mtr;` - Wrap compute region between `MATAR_INITIALIZE(...)` and `MATAR_FINALIZE()` - - **Data structure mapping** - - Choose `C*` vs `F*` based on target layout/coalescing goals + - Choose `C`* vs `F*` based on target layout/coalescing goals - Choose `Array` vs `Matrix` based on 0-based vs 1-based indexing - Choose `Device` vs `Dual` vs `Host` based on execution/sync needs - Use `View*` when wrapping existing pointers or slices - - **Loop mapping** - Replace parallelizable loops with `FOR_ALL` / `DO_ALL` - - Replace reductions with `FOR_REDUCE_*` / `DO_REDUCE_*` + - Replace reductions with `FOR_REDUCE_`* / `DO_REDUCE_*` - Keep dependency-carrying inner loops serial inside macro bodies - - **Correctness and synchronization** - Add `update_device()` / `update_host()` where required - Add `MATAR_FENCE()` only at true synchronization boundaries - Avoid race conditions (use reductions, atomics, or serial inner loops) - - **MPI (if distributed)** - Build `CommunicationPlan` - Attach with `initialize_comm_plan(...)` - Use `communicate()` at halo exchange points - - **Output quality** - Label examples as either: - `COMPILES_AS_IS` (fully concrete except obvious variable declarations), or - `PSEUDOCODE_PATTERN` (contains app-specific placeholders) +