From e35c9619b14f7b815be3c565df8b12412b1ddbcd Mon Sep 17 00:00:00 2001
From: Jacob Moore <jacoblinleymoore@gmail.com>
Date: Thu, 7 May 2026 14:35:53 -0500
Subject: [PATCH] ENH: Correcting minor errors in LLM context file

---
 MATAR_LLM_CONTEXT.md | 551 +++++++++++++++++++++++++------------------
 1 file changed, 315 insertions(+), 236 deletions(-)

diff --git a/MATAR_LLM_CONTEXT.md b/MATAR_LLM_CONTEXT.md
index 4d6d5201..d66a31fd 100644
--- a/MATAR_LLM_CONTEXT.md
+++ b/MATAR_LLM_CONTEXT.md
@@ -1,6 +1,6 @@
 # MATAR Code Conversion Reference
 
-This document provides comprehensive context for converting existing C/C++ (and Fortran-interop) code to use the MATAR library. MATAR (Multi-dimensional Arrays, Tensors, And Ragged data structures) is a header-only C++ library that provides performance-portable data structures and parallel loop macros. It abstracts over Kokkos to support CPU serial, OpenMP, Pthreads, CUDA, and HIP backends with a single source code.
+This document provides comprehensive context for converting existing C/C++(and Fortran-interop) code to use the MATAR library. MATAR (Multi-dimensional Arrays, Tensors, And Ragged data structures) is a header-only C++ library that provides performance-portable data structures and parallel loop macros. It abstracts over Kokkos to support CPU serial, OpenMP, Pthreads, CUDA, and HIP backends with a single source code.
 
 ---
 
@@ -49,11 +49,35 @@ int main(int argc, char* argv[])
 }
 ```
 
-| Macro | Behavior with Kokkos | Behavior without Kokkos |
-|-------|----------------------|------------------------|
-| `MATAR_INITIALIZE(...)` | `Kokkos::initialize(...)` | No-op |
-| `MATAR_FINALIZE()` | `Kokkos::finalize()` | No-op |
-| `MATAR_FENCE()` | `Kokkos::fence()` | No-op |
+If building with MPI, the main function also requires:
+
+```cpp
+#include <mpi.h>
+#include <matar.h>
+using namespace mtr;
+
+int main(int argc, char* argv[])
+{
+    MPI_Init(&argc, &argv);
+    MATAR_INITIALIZE(argc, argv);  // or MATAR_INITIALIZE() without args
+    {
+        // All MATAR data structures and parallel work go here.
+        // The scoping braces ensure all MATAR objects are destroyed
+        // before MATAR_FINALIZE is called.
+    }
+    MATAR_FINALIZE();
+    MPI_Finalize();
+    return 0;
+}
+```
+
+
+| Macro                   | Behavior with Kokkos      | Behavior without Kokkos |
+| ----------------------- | ------------------------- | ----------------------- |
+| `MATAR_INITIALIZE(...)` | `Kokkos::initialize(...)` | No-op                   |
+| `MATAR_FINALIZE()`      | `Kokkos::finalize()`      | No-op                   |
+| `MATAR_FENCE()`         | `Kokkos::fence()`         | No-op                   |
+
 
 `MATAR_FENCE()` is a synchronization barrier. It ensures all device operations have completed before the CPU proceeds. See the [fence placement rules](#fence-placement-rules) below for when fences are required vs. redundant.
 
@@ -61,15 +85,17 @@ int main(int argc, char* argv[])
 
 A fence is **needed** between two `FOR_ALL` blocks only if the second reads data written by the first. Over-fencing kills performance; under-fencing causes correctness bugs. Apply these rules:
 
-| Situation | Fence Needed? | Why |
-|-----------|:---:|-----|
-| `FOR_ALL` writes to `A`, next `FOR_ALL` reads `A` | **Yes** | Data dependency between kernels |
-| `FOR_ALL` writes to `A`, next `FOR_ALL` writes/reads only `B` | **No** | Independent data, no dependency |
-| `FOR_ALL` writes to `A`, then host code reads `A.host(i)` | **Yes** | Host reads device-written data |
-| `FOR_REDUCE_SUM(..., total)`, then host uses `total` | **No** | Reduction macros implicitly fence on the result variable |
-| `FOR_ALL` followed by `update_host()` | **Yes** | Must ensure kernel completes before sync |
-| `FOR_ALL` followed by timing (`std::chrono`) | **Yes** | Timer must see completed work |
-| Two `FOR_ALL` blocks, both only read (no writes) | **No** | Read-only kernels have no conflict |
+
+| Situation                                                     | Fence Needed? | Why                                                      |
+| ------------------------------------------------------------- | ------------- | -------------------------------------------------------- |
+| `FOR_ALL` writes to `A`, next `FOR_ALL` reads `A`             | **Yes**       | Data dependency between kernels                          |
+| `FOR_ALL` writes to `A`, next `FOR_ALL` writes/reads only `B` | **No**        | Independent data, no dependency                          |
+| `FOR_ALL` writes to `A`, then host code reads `A.host(i)`     | **Yes**       | Host reads device-written data                           |
+| `FOR_REDUCE_SUM(..., total)`, then host uses `total`          | **No**        | Reduction macros implicitly fence on the result variable |
+| `FOR_ALL` followed by `update_host()`                         | **Yes**       | Must ensure kernel completes before sync                 |
+| `FOR_ALL` followed by timing (`std::chrono`)                  | **Yes**       | Timer must see completed work                            |
+| Two `FOR_ALL` blocks, both only read (no writes)              | **No**        | Read-only kernels have no conflict                       |
+
 
 ```cpp
 // NO fence needed: independent data
@@ -95,86 +121,98 @@ MATAR data structures are organized along four axes:
 
 ### Axis 1: Memory Layout
 
-| Prefix | Layout | Index Convention | Best Loop Order |
-|--------|--------|-----------------|-----------------|
-| **C** | Row-major (C-style) | Last index varies fastest in memory | Outermost loop = first index |
-| **F** | Column-major (Fortran-style) | First index varies fastest in memory | Outermost loop = last index |
+
+| Prefix | Layout                       | Index Convention                     | Best Loop Order              |
+| ------ | ---------------------------- | ------------------------------------ | ---------------------------- |
+| **C**  | Row-major (C-style)          | Last index varies fastest in memory  | Outermost loop = first index |
+| **F**  | Column-major (Fortran-style) | First index varies fastest in memory | Outermost loop = last index  |
+
 
 **GPU coalescing rule:** CUDA and HIP achieve coalesced memory access when adjacent threads access adjacent memory addresses. In `FOR_ALL(i, 0, N, j, 0, M, {...})`, the **last** index (`j`) is mapped to thread IDs. Therefore:
 
-- **`CArrayDevice` (last index fastest):** `FOR_ALL(i, ..., j, ..., { a(i,j) })` gives coalesced access because adjacent threads (adjacent `j`) access adjacent memory locations.
-- **`FArrayDevice` (first index fastest):** The same `FOR_ALL` ordering does **not** give coalesced access because adjacent threads (adjacent `j`) hit non-adjacent memory.
+- `**CArrayDevice` (last index fastest):** `FOR_ALL(i, ..., j, ..., { a(i,j) })` gives coalesced access because adjacent threads (adjacent `j`) access adjacent memory locations.
+- `**FArrayDevice` (first index fastest):** The same `FOR_ALL` ordering does **not** give coalesced access because adjacent threads (adjacent `j`) hit non-adjacent memory.
 
 **Optimization rule for GPU targets:** Prefer `CArrayDevice` with loop nests ordered so the innermost `FOR_ALL` index corresponds to the last array dimension. When correctness of an initial translation from Fortran is the priority and the access pattern is complex, use `FArrayDevice` to preserve Fortran semantics and optimize later. The `FArray` types give a correct first-pass translation from Fortran (preserving the original memory layout and loop ordering), but they may not give optimal GPU performance without reordering.
 
 ### Axis 2: Index Base
 
-| Suffix | Index Base | Range |
-|--------|-----------|-------|
-| **Array** | 0-based | `[0, N)` |
-| **Matrix** | 1-based | `[1, N]` |
+
+| Suffix     | Index Base | Range    |
+| ---------- | ---------- | -------- |
+| **Array**  | 0-based    | `[0, N)` |
+| **Matrix** | 1-based    | `[1, N]` |
+
 
 ### Axis 3: Memory Residence
 
-| Suffix | Resides On | Usage |
-|--------|-----------|-------|
-| **Host** | CPU only | Serial code, I/O, initialization |
-| **Device** | Device only (GPU or CPU depending on build) | Parallel kernels |
-| **Dual** | Both host and device, with explicit sync | Mixed host/device workflows |
+
+| Suffix     | Resides On                                  | Usage                            |
+| ---------- | ------------------------------------------- | -------------------------------- |
+| **Host**   | CPU only                                    | Serial code, I/O, initialization |
+| **Device** | Device only (GPU or CPU depending on build) | Parallel kernels                 |
+| **Dual**   | Both host and device, with explicit sync    | Mixed host/device workflows      |
+
 
 ### Axis 4: Ownership
 
-| Prefix | Owns Memory? | Construction |
-|--------|-------------|--------------|
-| (none) | Yes — allocates and manages storage | `CArrayDevice<int>(10, 10)` |
-| **View** | No — wraps an existing pointer | `ViewCArrayDevice<int>(ptr, 10, 10)` |
+
+| Prefix   | Owns Memory?                        | Construction                         |
+| -------- | ----------------------------------- | ------------------------------------ |
+| (none)   | Yes — allocates and manages storage | `CArrayDevice<int>(10, 10)`          |
+| **View** | No — wraps an existing pointer      | `ViewCArrayDevice<int>(ptr, 10, 10)` |
+
 
 ### Complete Naming Map (via `mtr::` aliases)
 
-| Alias (preferred) | Underlying Class | Layout | Index | Residence |
-|---|---|---|---|---|
-| `CArrayHost<T>` | `CArray<T>` | Row-major | 0-based | Host |
-| `CMatrixHost<T>` | `CMatrix<T>` | Row-major | 1-based | Host |
-| `FArrayHost<T>` | `FArray<T>` | Column-major | 0-based | Host |
-| `FMatrixHost<T>` | `FMatrix<T>` | Column-major | 1-based | Host |
-| `ViewCArrayHost<T>` | `ViewCArray<T>` | Row-major | 0-based | Host (non-owning) |
-| `ViewCMatrixHost<T>` | `ViewCMatrix<T>` | Row-major | 1-based | Host (non-owning) |
-| `ViewFArrayHost<T>` | `ViewFArray<T>` | Column-major | 0-based | Host (non-owning) |
-| `ViewFMatrixHost<T>` | `ViewFMatrix<T>` | Column-major | 1-based | Host (non-owning) |
-| `CArrayDevice<T>` | `CArrayKokkos<T>` | Row-major | 0-based | Device |
-| `CMatrixDevice<T>` | `CMatrixKokkos<T>` | Row-major | 1-based | Device |
-| `FArrayDevice<T>` | `FArrayKokkos<T>` | Column-major | 0-based | Device |
-| `FMatrixDevice<T>` | `FMatrixKokkos<T>` | Column-major | 1-based | Device |
-| `ViewCArrayDevice<T>` | `ViewCArrayKokkos<T>` | Row-major | 0-based | Device (non-owning) |
-| `ViewCMatrixDevice<T>` | `ViewCMatrixKokkos<T>` | Row-major | 1-based | Device (non-owning) |
-| `ViewFArrayDevice<T>` | `ViewFArrayKokkos<T>` | Column-major | 0-based | Device (non-owning) |
-| `ViewFMatrixDevice<T>` | `ViewFMatrixKokkos<T>` | Column-major | 1-based | Device (non-owning) |
-| `CArrayDual<T>` | `DCArrayKokkos<T>` | Row-major | 0-based | Dual |
-| `CMatrixDual<T>` | `DCMatrixKokkos<T>` | Row-major | 1-based | Dual |
-| `FArrayDual<T>` | `DFArrayKokkos<T>` | Column-major | 0-based | Dual |
-| `FMatrixDual<T>` | `DFMatrixKokkos<T>` | Column-major | 1-based | Dual |
-| `ViewCArrayDual<T>` | `DViewCArrayKokkos<T>` | Row-major | 0-based | Dual (non-owning) |
-| `ViewCMatrixDual<T>` | `DViewCMatrixKokkos<T>` | Row-major | 1-based | Dual (non-owning) |
-| `ViewFArrayDual<T>` | `DViewFArrayKokkos<T>` | Column-major | 0-based | Dual (non-owning) |
-| `ViewFMatrixDual<T>` | `DViewFMatrixKokkos<T>` | Column-major | 1-based | Dual (non-owning) |
+
+| Alias (preferred)      | Underlying Class        | Layout       | Index   | Residence           |
+| ---------------------- | ----------------------- | ------------ | ------- | ------------------- |
+| `CArrayHost<T>`        | `CArray<T>`             | Row-major    | 0-based | Host                |
+| `CMatrixHost<T>`       | `CMatrix<T>`            | Row-major    | 1-based | Host                |
+| `FArrayHost<T>`        | `FArray<T>`             | Column-major | 0-based | Host                |
+| `FMatrixHost<T>`       | `FMatrix<T>`            | Column-major | 1-based | Host                |
+| `ViewCArrayHost<T>`    | `ViewCArray<T>`         | Row-major    | 0-based | Host (non-owning)   |
+| `ViewCMatrixHost<T>`   | `ViewCMatrix<T>`        | Row-major    | 1-based | Host (non-owning)   |
+| `ViewFArrayHost<T>`    | `ViewFArray<T>`         | Column-major | 0-based | Host (non-owning)   |
+| `ViewFMatrixHost<T>`   | `ViewFMatrix<T>`        | Column-major | 1-based | Host (non-owning)   |
+| `CArrayDevice<T>`      | `CArrayKokkos<T>`       | Row-major    | 0-based | Device              |
+| `CMatrixDevice<T>`     | `CMatrixKokkos<T>`      | Row-major    | 1-based | Device              |
+| `FArrayDevice<T>`      | `FArrayKokkos<T>`       | Column-major | 0-based | Device              |
+| `FMatrixDevice<T>`     | `FMatrixKokkos<T>`      | Column-major | 1-based | Device              |
+| `ViewCArrayDevice<T>`  | `ViewCArrayKokkos<T>`   | Row-major    | 0-based | Device (non-owning) |
+| `ViewCMatrixDevice<T>` | `ViewCMatrixKokkos<T>`  | Row-major    | 1-based | Device (non-owning) |
+| `ViewFArrayDevice<T>`  | `ViewFArrayKokkos<T>`   | Column-major | 0-based | Device (non-owning) |
+| `ViewFMatrixDevice<T>` | `ViewFMatrixKokkos<T>`  | Column-major | 1-based | Device (non-owning) |
+| `CArrayDual<T>`        | `DCArrayKokkos<T>`      | Row-major    | 0-based | Dual                |
+| `CMatrixDual<T>`       | `DCMatrixKokkos<T>`     | Row-major    | 1-based | Dual                |
+| `FArrayDual<T>`        | `DFArrayKokkos<T>`      | Column-major | 0-based | Dual                |
+| `FMatrixDual<T>`       | `DFMatrixKokkos<T>`     | Column-major | 1-based | Dual                |
+| `ViewCArrayDual<T>`    | `DViewCArrayKokkos<T>`  | Row-major    | 0-based | Dual (non-owning)   |
+| `ViewCMatrixDual<T>`   | `DViewCMatrixKokkos<T>` | Row-major    | 1-based | Dual (non-owning)   |
+| `ViewFArrayDual<T>`    | `DViewFArrayKokkos<T>`  | Column-major | 0-based | Dual (non-owning)   |
+| `ViewFMatrixDual<T>`   | `DViewFMatrixKokkos<T>` | Column-major | 1-based | Dual (non-owning)   |
+
 
 ### Sparse/Ragged Type Aliases
 
-| Alias | Underlying Class | Description |
-|---|---|---|
-| `RaggedCArrayHost<T>` | `RaggedRightArray<T>` | Variable-length rows, host |
-| `RaggedFArrayHost<T>` | `RaggedDownArray<T>` | Variable-length columns, host |
-| `DynamicRaggedCArrayHost<T>` | `DynamicRaggedRightArray<T>` | Dynamic row lengths, host |
-| `DynamicRaggedFArrayHost<T>` | `DynamicRaggedDownArray<T>` | Dynamic column lengths, host |
-| `CSRArrayHost<T>` | `CSRArray<T>` | Compressed Sparse Row, host |
-| `CSCArrayHost<T>` | `CSCArray<T>` | Compressed Sparse Column, host |
-| `RaggedCArrayDevice<T>` | `RaggedRightArrayKokkos<T>` | Variable-length rows, device |
-| `RaggedCArrayDual<T>` | `DRaggedRightArrayKokkos<T>` | Variable-length rows, dual |
-| `RaggedFArrayDevice<T>` | `RaggedDownArrayKokkos<T>` | Variable-length columns, device |
-| `DynamicRaggedCArrayDevice<T>` | `DynamicRaggedRightArrayKokkos<T>` | Dynamic rows, device |
-| `DynamicRaggedFArrayDevice<T>` | `DynamicRaggedDownArrayKokkos<T>` | Dynamic columns, device |
-| `CSRArrayDevice<T>` | `CSRArrayKokkos<T>` | Compressed Sparse Row, device |
-| `CSCArrayDevice<T>` | `CSCArrayKokkos<T>` | Compressed Sparse Column, device |
+
+| Alias                          | Underlying Class                   | Description                      |
+| ------------------------------ | ---------------------------------- | -------------------------------- |
+| `RaggedCArrayHost<T>`          | `RaggedRightArray<T>`              | Variable-length rows, host       |
+| `RaggedFArrayHost<T>`          | `RaggedDownArray<T>`               | Variable-length columns, host    |
+| `DynamicRaggedCArrayHost<T>`   | `DynamicRaggedRightArray<T>`       | Dynamic row lengths, host        |
+| `DynamicRaggedFArrayHost<T>`   | `DynamicRaggedDownArray<T>`        | Dynamic column lengths, host     |
+| `CSRArrayHost<T>`              | `CSRArray<T>`                      | Compressed Sparse Row, host      |
+| `CSCArrayHost<T>`              | `CSCArray<T>`                      | Compressed Sparse Column, host   |
+| `RaggedCArrayDevice<T>`        | `RaggedRightArrayKokkos<T>`        | Variable-length rows, device     |
+| `RaggedCArrayDual<T>`          | `DRaggedRightArrayKokkos<T>`       | Variable-length rows, dual       |
+| `RaggedFArrayDevice<T>`        | `RaggedDownArrayKokkos<T>`         | Variable-length columns, device  |
+| `DynamicRaggedCArrayDevice<T>` | `DynamicRaggedRightArrayKokkos<T>` | Dynamic rows, device             |
+| `DynamicRaggedFArrayDevice<T>` | `DynamicRaggedDownArrayKokkos<T>`  | Dynamic columns, device          |
+| `CSRArrayDevice<T>`            | `CSRArrayKokkos<T>`                | Compressed Sparse Row, device    |
+| `CSCArrayDevice<T>`            | `CSCArrayKokkos<T>`                | Compressed Sparse Column, device |
+
 
 ---
 
@@ -191,12 +229,12 @@ CArrayDevice<double> b(10, 20);              // 2D, 10x20
 CArrayDevice<double> c(10, 20, 30);          // 3D
 CArrayDual<double> d(10, 20, "my_label");    // Dual with optional debug label
 
-// View types — wrap existing pointer, no allocation
-ViewCArrayDevice<int> v(ptr, 10, 20);        // Wraps raw pointer with 2D shape
+// View types — wrap existing data, no allocation
+ViewCArrayDevice<int> v(arr[0], 10, 20);        // Wraps raw data with 2D shape
 
 // Host-only types
 CArray<double> h(100, 100);                  // Host-only, owning
-ViewCArray<double> hv(raw_ptr, 100, 100);    // Host-only, non-owning
+ViewCArray<double> hv(h(0,0), 100, 100);     // Host-only, non-owning
 ```
 
 ### Element Access
@@ -233,7 +271,7 @@ View types can wrap a pointer into the **middle** of an existing array, reinterp
 CArrayDual<double> B(num_blocks, rows, cols);
 
 // Create a 2D view into block 1: starts at &B(1, 0, 0), shape is (rows, cols)
-ViewCArray<double> block1(B.host_pointer() + 1 * rows * cols, rows, cols);
+ViewCArray<double> block1(&B(1,0,0), rows, cols);
 
 // block1(i, j) now aliases B(1, i, j) — no data copy
 for (int i = 0; i < rows; i++)
@@ -241,9 +279,11 @@ for (int i = 0; i < rows; i++)
         block1(i, j) = some_value;
 
 // Device-side slicing for passing subarrays to functions
-ViewCArrayDevice<double> dev_slice(A.pointer() + offset, slice_len);
-FOR_ALL(i, 0, slice_len, {
-    dev_slice(i) *= 2.0;
+CArrayDual<double> B(num_blocks, rows, cols);
+FOR_ALL(i, 0, rows,
+        j, 0, cols {
+    ViewCArrayDevice<double> B_slice(&B(1, 0, 0), rows, cols);
+    B_slice(i,j) = (double)i + (double)j;
 });
 ```
 
@@ -251,20 +291,22 @@ This pattern naturally handles legacy codes that pass array slices to subroutine
 
 ### Common Methods
 
-| Method | Description |
-|--------|-------------|
-| `size()` | Total number of elements |
-| `extent()` | Same as `size()` |
-| `dims(n)` | Size along dimension `n` (0-indexed for Arrays, 1-indexed for Matrices) |
-| `order()` | Number of dimensions |
-| `pointer()` | Raw pointer to underlying data. Available on Device and Host types only. **Dual types do NOT have `pointer()`** — use `device_pointer()` or `host_pointer()` instead. |
-| `device_pointer()` | Raw pointer to device-side memory (Dual types only) |
-| `host_pointer()` | Raw pointer to host-side memory (Dual types only) |
-| `set_values(val)` | Set all elements to `val` via `parallel_for` on the device. **Does NOT fence** — follow with `MATAR_FENCE()` if host code depends on the result, or `update_host()` if you need the values on the host. |
-| `update_host()` | Copy device data to host (Dual types only) |
-| `update_device()` | Copy host data to device (Dual types only) |
-| `get_kokkos_view()` | Access the underlying `Kokkos::View` (Device types) |
-| `get_kokkos_dual_view()` | Access the underlying `Kokkos::DualView` (Dual types) |
+
+| Method                   | Description                                                                                                                                                                                             |
+| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `size()`                 | Total number of elements                                                                                                                                                                                |
+| `extent()`               | Same as `size()`                                                                                                                                                                                        |
+| `dims(n)`                | Size along dimension `n` (0-indexed for Arrays, 1-indexed for Matrices)                                                                                                                                 |
+| `order()`                | Number of dimensions                                                                                                                                                                                    |
+| `pointer()`              | Raw pointer to underlying data. Available on Device and Host types only. **Dual types do NOT have `pointer()`** — use `device_pointer()` or `host_pointer()` instead.                                   |
+| `device_pointer()`       | Raw pointer to device-side memory (Dual types only)                                                                                                                                                     |
+| `host_pointer()`         | Raw pointer to host-side memory (Dual types only)                                                                                                                                                       |
+| `set_values(val)`        | Set all elements to `val` via `parallel_for` on the device. **Does NOT fence** — follow with `MATAR_FENCE()` if host code depends on the result, or `update_host()` if you need the values on the host. |
+| `update_host()`          | Copy device data to host (Dual types only)                                                                                                                                                              |
+| `update_device()`        | Copy host data to device (Dual types only)                                                                                                                                                              |
+| `get_kokkos_view()`      | Access the underlying `Kokkos::View` (Device types)                                                                                                                                                     |
+| `get_kokkos_dual_view()` | Access the underlying `Kokkos::DualView` (Dual types)                                                                                                                                                   |
+
 
 ---
 
@@ -348,15 +390,17 @@ RUN({
 
 The macros use variadic dispatch based on argument count. This table shows the exact signatures:
 
-| Macro | 1D args (4) | 2D args (7) | 3D args (10) |
-|-------|------------|------------|-------------|
-| `FOR_ALL` | `(i, lo, hi, {body})` | `(i, lo, hi, j, lo, hi, {body})` | `(i, lo, hi, j, lo, hi, k, lo, hi, {body})` |
-| `DO_ALL` | same pattern | same pattern | same pattern |
-| `FOR_REDUCE_SUM` | `(i, lo, hi, var, {body}, result)` (6 args) | `(i, lo, hi, j, lo, hi, var, {body}, result)` (9 args) | `(i, lo, hi, j, lo, hi, k, lo, hi, var, {body}, result)` (12 args) |
-| `FOR_REDUCE_MAX` | same as SUM | same as SUM | same as SUM |
-| `FOR_REDUCE_MIN` | same as SUM | same as SUM | same as SUM |
+
+| Macro                | 1D args (4)                                 | 2D args (7)                                            | 3D args (10)                                                       |
+| -------------------- | ------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------ |
+| `FOR_ALL`            | `(i, lo, hi, {body})`                       | `(i, lo, hi, j, lo, hi, {body})`                       | `(i, lo, hi, j, lo, hi, k, lo, hi, {body})`                        |
+| `DO_ALL`             | same pattern                                | same pattern                                           | same pattern                                                       |
+| `FOR_REDUCE_SUM`     | `(i, lo, hi, var, {body}, result)` (6 args) | `(i, lo, hi, j, lo, hi, var, {body}, result)` (9 args) | `(i, lo, hi, j, lo, hi, k, lo, hi, var, {body}, result)` (12 args) |
+| `FOR_REDUCE_MAX`     | same as SUM                                 | same as SUM                                            | same as SUM                                                        |
+| `FOR_REDUCE_MIN`     | same as SUM                                 | same as SUM                                            | same as SUM                                                        |
 | `FOR_REDUCE_PRODUCT` | `(i, lo, hi, var, {body}, result)` (6 args) | `(i, lo, hi, j, lo, hi, var, {body}, result)` (9 args) | `(i, lo, hi, j, lo, hi, k, lo, hi, var, {body}, result)` (12 args) |
 
+
 The `DO_REDUCE_*` variants follow the same argument patterns as `FOR_REDUCE_*` but with inclusive upper bounds **for SUM/MAX/MIN**. There is no `DO_REDUCE_PRODUCT` macro in the current `macros.h`.
 
 ### `FOR_ALL` — C-style half-open range `[start, end)`
@@ -403,6 +447,7 @@ FOR_ALL(i, 0, N, j, 0, M, {
 For finer-grained control over nested parallelism (e.g., tiled algorithms), use the hierarchical `FOR_FIRST`/`FOR_SECOND`/`FOR_THIRD` macros described in Section 7.
 
 **Loop ordering:** The inner loop varies the fastest and the outer loop varies the slowest. For optimal cache performance, match the loop index order to the data type's memory layout:
+
 - `CArray` / `CArrayDevice`: last index fastest → `FOR_ALL(i, ..., j, ..., { a(i,j) })` is correct
 - `FArray` / `FArrayDevice`: first index fastest → `FOR_ALL(i, ..., j, ..., { a(j,i) })` maps the fast-varying loop index to the fast-varying array index
 
@@ -427,9 +472,10 @@ DO_ALL(j, 1, M,
 ```
 
 **When to use which:**
+
 - `FOR_ALL` with `CArray`/`CArrayDevice`: C-style code, 0-based indexing, `[0, N)`
 - `DO_ALL` with `FMatrix`/`FMatrixDevice`: Fortran-style code, 1-based indexing, `[1, N]`
-- The same `FOR_ALL` vs `DO_ALL` distinction applies to all reduction variants (`FOR_REDUCE_*` vs `DO_REDUCE_*`)
+- The same `FOR_ALL` vs `DO_ALL` distinction applies to all reduction variants (`FOR_REDUCE_`* vs `DO_REDUCE_*`)
 
 ### `RUN` — Execute Once on Device
 
@@ -592,6 +638,7 @@ FOR_FIRST(i, 0, N, {
 ```
 
 Available hierarchical reductions:
+
 - `FOR_REDUCE_SUM_SECOND`, `FOR_REDUCE_SUM_THIRD`
 - `DO_REDUCE_SUM_SECOND`, `DO_REDUCE_SUM_THIRD`
 - `FOR_REDUCE_MAX_SECOND`, `DO_REDUCE_MAX_THIRD`
@@ -626,6 +673,7 @@ class MySimulation {
 ```
 
 Available `_CLASS` variants:
+
 - `FOR_ALL_CLASS`
 - `FOR_REDUCE_SUM_CLASS`
 - `FOR_REDUCE_MAX_CLASS`
@@ -667,6 +715,7 @@ for (int i = 0; i < N; i++)
 ```
 
 **Key rules:**
+
 - Inside `FOR_ALL`/`RUN` blocks: use `data(i, j)` — this is the device accessor.
 - Outside parallel regions on the CPU: use `data.host(i, j)`.
 - Call `update_device()` after host writes, before device reads.
@@ -679,36 +728,42 @@ for (int i = 0; i < N; i++)
 
 ### Step 1: Replace Data Structures
 
-| Original C++ | MATAR Replacement | When to Use |
-|---|---|---|
-| `double a[N]` | `CArrayDevice<double> a(N)` | Device-only computation |
-| `double a[N][M]` | `CArrayDevice<double> a(N, M)` | Device-only computation |
-| `double a[N][M][P]` | `CArrayDevice<double> a(N, M, P)` | Device-only computation |
-| `double a[N]` | `CArrayDual<double> a(N)` | Need host and device access |
-| `double* a = new double[N]` | `CArrayDevice<double> a(N)` | Dynamic allocation → MATAR |
-| `std::vector<double> a(N)` | `CArrayDual<double> a(N)` | Need both host and device |
-| `double a[N]` (used as function arg) | `ViewCArrayDevice<double> a(ptr, N)` | Wrap existing memory |
-| Fortran-style 1-based `A(i,j)` | `FMatrixDevice<double> A(N, M)` | Fortran interop |
+
+| Original C++                         | MATAR Replacement                    | When to Use                 |
+| ------------------------------------ | ------------------------------------ | --------------------------- |
+| `double a[N]`                        | `CArrayDevice<double> a(N)`          | Device-only computation     |
+| `double a[N][M]`                     | `CArrayDevice<double> a(N, M)`       | Device-only computation     |
+| `double a[N][M][P]`                  | `CArrayDevice<double> a(N, M, P)`    | Device-only computation     |
+| `double a[N]`                        | `CArrayDual<double> a(N)`            | Need host and device access |
+| `double* a = new double[N]`          | `CArrayDevice<double> a(N)`          | Dynamic allocation → MATAR  |
+| `std::vector<double> a(N)`           | `CArrayDual<double> a(N)`            | Need both host and device   |
+| `double a[N]` (used as function arg) | `ViewCArrayDevice<double> a(ptr, N)` | Wrap existing memory        |
+| Fortran-style 1-based `A(i,j)`       | `FMatrixDevice<double> A(N, M)`      | Fortran interop             |
+
 
 ### Step 2: Replace Index Access
 
-| Original | MATAR |
-|----------|-------|
-| `a[i]` | `a(i)` |
-| `a[i][j]` | `a(i, j)` |
-| `a[i][j][k]` | `a(i, j, k)` |
-| Pointer arithmetic `*(a + i*M + j)` | `a(i, j)` |
+
+| Original                            | MATAR        |
+| ----------------------------------- | ------------ |
+| `a[i]`                              | `a(i)`       |
+| `a[i][j]`                           | `a(i, j)`    |
+| `a[i][j][k]`                        | `a(i, j, k)` |
+| Pointer arithmetic `*(a + i*M + j)` | `a(i, j)`    |
+
 
 ### Step 3: Replace Loops with Macros
 
-| Original | MATAR | Notes |
-|----------|-------|-------|
-| `for (int i = 0; i < N; i++) { ... }` | `FOR_ALL(i, 0, N, { ... });` | Half-open range |
-| `for (int i = 0; i < N; i++) for (int j = 0; j < M; j++) { ... }` | `FOR_ALL(i, 0, N, j, 0, M, { ... });` | 2D parallel |
-| `for (int i = 1; i <= N; i++) { ... }` | `DO_ALL(i, 1, N, { ... });` | Inclusive range |
-| Serial `for` with accumulator `sum += ...` | `FOR_REDUCE_SUM(i, 0, N, loc, { loc += ...; }, sum);` | Parallel reduction |
-| Serial `for` tracking max | `FOR_REDUCE_MAX(i, 0, N, loc, { if(x > loc) loc = x; }, max_val);` | Max reduction |
-| Serial `for` tracking min | `FOR_REDUCE_MIN(i, 0, N, loc, { if(x < loc) loc = x; }, min_val);` | Min reduction |
+
+| Original                                                          | MATAR                                                              | Notes              |
+| ----------------------------------------------------------------- | ------------------------------------------------------------------ | ------------------ |
+| `for (int i = 0; i < N; i++) { ... }`                             | `FOR_ALL(i, 0, N, { ... });`                                       | Half-open range    |
+| `for (int i = 0; i < N; i++) for (int j = 0; j < M; j++) { ... }` | `FOR_ALL(i, 0, N, j, 0, M, { ... });`                              | 2D parallel        |
+| `for (int i = 1; i <= N; i++) { ... }`                            | `DO_ALL(i, 1, N, { ... });`                                        | Inclusive range    |
+| Serial `for` with accumulator `sum += ...`                        | `FOR_REDUCE_SUM(i, 0, N, loc, { loc += ...; }, sum);`              | Parallel reduction |
+| Serial `for` tracking max                                         | `FOR_REDUCE_MAX(i, 0, N, loc, { if(x > loc) loc = x; }, max_val);` | Max reduction      |
+| Serial `for` tracking min                                         | `FOR_REDUCE_MIN(i, 0, N, loc, { if(x < loc) loc = x; }, min_val);` | Min reduction      |
+
 
 ### Step 4: Handle Dependencies and Synchronization
 
@@ -722,17 +777,22 @@ When multiple parallel iterations write to the same output location:
 
 ```cpp
 // WRONG: race condition — multiple (i,j,k) combos update the same C(i,j)
-FOR_ALL(i, 0, N, j, 0, N, k, 0, N, {
+FOR_ALL(i, 0, N, 
+        j, 0, N, 
+        k, 0, N, {
     C(i, j) += A(i, k) * B(k, j);
 });
 
 // FIX OPTION 1: Atomics
-FOR_ALL(i, 0, N, j, 0, N, k, 0, N, {
+FOR_ALL(i, 0, N, 
+        j, 0, N, 
+        k, 0, N, {
     Kokkos::atomic_add(&C(i, j), A(i, k) * B(k, j));
 });
 
 // FIX OPTION 2: Inner serial loop (preferred for matmul)
-FOR_ALL(i, 0, N, j, 0, N, {
+FOR_ALL(i, 0, N, 
+        j, 0, N, {
     double local_sum = 0.0;
     for (int k = 0; k < N; k++) {
         local_sum += A(i, k) * B(k, j);
@@ -787,11 +847,12 @@ MATAR_INITIALIZE(argc, argv);
     CArrayDual<double> temperature(height + 2, width + 2);
     CArrayDual<double> temperature_previous(height + 2, width + 2);
 
-    // Initialize
+    // Initialize on device
     temperature_previous.set_values(0.0);
     FOR_ALL(i, 0, height + 2, {
         temperature_previous(i, width + 1) = (1000.0 / height) * i;
     });
+    MATAR_FENCE();
     temperature_previous.update_host();
 
     while (worst_dt > tolerance) {
@@ -845,7 +906,7 @@ for (int i = 0; i < 10; i++) sum += A[i] * A[i];
 ```cpp
 CArrayDevice<int> A(10);
 FOR_ALL(i, 0, 10, {
-    A(i) = 314;
+    A(i) = 314; // OR, initialize using A.set_values(314);
 });
 
 int loc_sum = 0, sum = 0;
@@ -870,8 +931,13 @@ for (int i = 0; i < N; i++)
 **After:** `PSEUDOCODE_PATTERN`
 
 ```cpp
-CArrayDevice<double> A(N, N), B(N, N), C(N, N);
+CArrayDevice<double> A(N, N);
+CArrayDevice<double> B(N, N);
+CArrayDevice<double> C(N, N);
 // ... initialize A, B ...
+A.set_values(1.0);
+B.set_values(1.0);
+C.set_values(0.0);
 
 FOR_ALL(i, 0, N, j, 0, N, {
     double local_sum = 0.0;
@@ -906,7 +972,8 @@ for (int i = 0; i < 3; i++)
 ViewCArrayDual<int> view(&some_array[0], 3, 3);
 
 // Now usable in parallel regions on device
-FOR_ALL(i, 0, 3, j, 0, 3, {
+FOR_ALL(i, 0, 3, 
+        j, 0, 3, {
     view(i, j) *= 2;
 });
 MATAR_FENCE();
@@ -941,12 +1008,14 @@ FOR_ALL(i, 0, num_shapes, {
 
 ```cpp
 // BAD: inner loop on i with CArray (row-major) — poor cache behavior
-FOR_ALL(j, 0, M, i, 0, N, {
+FOR_ALL(j, 0, M, 
+        i, 0, N, {
     carr(i, j) = ...;  // i changes in inner loop but is the slow index in CArray
 });
 
 // GOOD: inner loop on j matches CArray row-major layout
-FOR_ALL(i, 0, N, j, 0, M, {
+FOR_ALL(i, 0, N, 
+        j, 0, M, {
     carr(i, j) = ...;  // j changes in inner loop = fast index = good cache use
 });
 ```
@@ -955,23 +1024,35 @@ FOR_ALL(i, 0, N, j, 0, M, {
 
 ```cpp
 // BAD: immediately reading on host without fence
-FOR_ALL(i, 0, N, { a(i) = compute(i); });
-double val = a.host(0);  // may not reflect device computation
+FOR_ALL(i, 0, N, { 
+    a(i) = compute(i); 
+});
+double val = a.host(0);  // will not reflect device computation without a fence and a.update_host()
 
 // GOOD: fence then sync to host
-FOR_ALL(i, 0, N, { a(i) = compute(i); });
+FOR_ALL(i, 0, N, { 
+    a(i) = compute(i); 
+});
 MATAR_FENCE();
 a.update_host();
 double val = a.host(0);  // correct
 
 // BAD: unnecessary fence between independent kernels (kills performance)
-FOR_ALL(i, 0, N, { A(i) = i; });
+FOR_ALL(i, 0, N, { 
+    A(i) = i;
+});
 MATAR_FENCE();  // wasteful — B doesn't depend on A
-FOR_ALL(i, 0, N, { B(i) = i * 2; });
+FOR_ALL(i, 0, N, { 
+    B(i) = i * 2; 
+});
 
 // GOOD: no fence needed for independent data
-FOR_ALL(i, 0, N, { A(i) = i; });
-FOR_ALL(i, 0, N, { B(i) = i * 2; });
+FOR_ALL(i, 0, N, { 
+    A(i) = i; 
+});
+FOR_ALL(i, 0, N, { 
+    B(i) = i * 2; 
+});
 
 // NOTE: reduction results are immediately available (implicit fence)
 double loc = 0.0, total = 0.0;
@@ -987,7 +1068,7 @@ See the [Fence Placement Rules](#fence-placement-rules) in Section 1 for the com
 CArrayDual<double> data(N);
 for (int i = 0; i < N; i++) data.host(i) = i;  // writes to host copy
 
-// BAD: device copy is stale
+// BAD: device copy is stale (not copied to device)
 FOR_ALL(i, 0, N, { data(i) *= 2.0; });  // reads stale device data
 
 // GOOD:
@@ -1105,21 +1186,18 @@ using u_int  = unsigned int;
 Use this decision tree:
 
 1. **Do you need host and device access?**
-   - Yes → Use a **Dual** type
-   - No, device only → Use a **Device** type
-   - No, host only → Use a **Host** type
-
+  - Yes → Use a **Dual** type
+  - No, device only → Use a **Device** type
+  - No, host only → Use a **Host** type
 2. **Do you own the memory?**
-   - Yes → Use an owning type (no `View` prefix)
-   - No, wrapping existing pointer → Use a **View** type
-
+  - Yes → Use an owning type (no `View` prefix)
+  - No, wrapping existing pointer → Use a **View** type
 3. **What index convention?**
-   - 0-based `[0, N)` → Use an **Array** type
-   - 1-based `[1, N]` → Use a **Matrix** type
-
+  - 0-based `[0, N)` → Use an **Array** type
+  - 1-based `[1, N]` → Use a **Matrix** type
 4. **What memory layout?**
-   - Row-major (C-style, last index fastest) → Use a **C** prefix
-   - Column-major (Fortran-style, first index fastest) → Use an **F** prefix
+  - Row-major (C-style, last index fastest) → Use a **C** prefix
+  - Column-major (Fortran-style, first index fastest) → Use an **F** prefix
 
 **Example decision:** "I need a 2D row-major 0-based array that lives on both host and device"
 → `CArrayDual<double> a(N, M);`
@@ -1135,16 +1213,18 @@ Code inside `FOR_ALL`, `DO_ALL`, `RUN`, and reduction macros runs on the device
 
 ### What You Cannot Do Inside Device Kernels
 
-| Forbidden | Why | Alternative |
-|-----------|-----|-------------|
-| `std::cout`, `std::cerr` | Not available on GPU | Use `printf` |
-| `new` / `delete` / `malloc` / `free` | Host memory allocators | Use `Kokkos::kokkos_malloc` / placement `new` |
-| `std::vector`, `std::map`, etc. | STL containers are host-only | Use MATAR arrays |
-| `throw` / `try` / `catch` | Exceptions not supported on GPU | Use error codes or assertions |
-| File I/O (`fopen`, `fstream`) | No filesystem on GPU | Do I/O on host, sync with `update_host()` |
-| Virtual function calls | Only work via placement `new` on device memory | See Example E in Section 11 |
-| `std::string` | Host-only | Use `const char*` or integer codes |
-| Recursive function calls | Limited/unsupported on some GPU architectures | Rewrite iteratively |
+
+| Forbidden                            | Why                                            | Alternative                                   |
+| ------------------------------------ | ---------------------------------------------- | --------------------------------------------- |
+| `std::cout`, `std::cerr`             | Not available on GPU                           | Use `printf`                                  |
+| `new` / `delete` / `malloc` / `free` | Host memory allocators                         | Use `Kokkos::kokkos_malloc` / placement `new` |
+| `std::vector`, `std::map`, etc.      | STL containers are host-only                   | Use MATAR arrays                              |
+| `throw` / `try` / `catch`            | Exceptions not supported on GPU                | Use error codes or assertions                 |
+| File I/O (`fopen`, `fstream`)        | No filesystem on GPU                           | Do I/O on host, sync with `update_host()`     |
+| Virtual function calls               | Only work via placement `new` on device memory | See Example E in Section 11                   |
+| `std::string`                        | Host-only                                      | Use `const char`* or integer codes            |
+| Recursive function calls             | Limited/unsupported on some GPU architectures  | Rewrite iteratively                           |
+
 
 ### Lambda Capture Rules
 
@@ -1160,6 +1240,7 @@ FOR_ALL(i, 0, N, {
 ```
 
 You **cannot** capture:
+
 - Host-only objects (`std::vector`, `std::string`, etc.)
 - References to local variables that won't exist on the device
 - Large stack objects (copies are expensive)
@@ -1247,6 +1328,7 @@ end program
 ```
 
 Key rules:
+
 - Use `ViewFMatrixDual` to wrap Fortran arrays (column-major, 1-based).
 - Use `DO_ALL` with 1-based ranges for natural Fortran indexing.
 - C++ functions callable from Fortran must use `extern "C"` and trailing `_` in the name.
@@ -1265,15 +1347,17 @@ target_link_libraries(my_target matar)
 
 ### Preprocessor Defines
 
-| Define | Effect |
-|--------|--------|
-| `HAVE_KOKKOS` | Enables Kokkos-backed device/dual types and parallel macros |
-| `HAVE_CUDA` | Kokkos execution space = CUDA; layout defaults to `LayoutLeft` |
-| `HAVE_HIP` | Kokkos execution space = HIP |
-| `HAVE_OPENMP` | Kokkos execution space = OpenMP; layout = `LayoutRight` |
-| `HAVE_THREADS` | Kokkos execution space = Threads |
-| `HAVE_MPI` | Enables MPI types (`MPICArrayKokkos`, `PartitionMap`, etc.) |
-| `TRILINOS_INTERFACE` | Enables Tpetra wrapper types |
+
+| Define               | Effect                                                         |
+| -------------------- | -------------------------------------------------------------- |
+| `HAVE_KOKKOS`        | Enables Kokkos-backed device/dual types and parallel macros    |
+| `HAVE_CUDA`          | Kokkos execution space = CUDA; layout defaults to `LayoutLeft` |
+| `HAVE_HIP`           | Kokkos execution space = HIP                                   |
+| `HAVE_OPENMP`        | Kokkos execution space = OpenMP; layout = `LayoutRight`        |
+| `HAVE_THREADS`       | Kokkos execution space = Threads                               |
+| `HAVE_MPI`           | Enables MPI types (`MPICArrayKokkos`, `PartitionMap`, etc.)    |
+| `TRILINOS_INTERFACE` | Enables Tpetra wrapper types                                   |
+
 
 When no GPU backend is specified with Kokkos, the default execution space and layout from Kokkos are used.
 
@@ -1302,19 +1386,22 @@ MATAR provides MPI-aware data types and a communication plan abstraction for dis
 
 ### When to Use MPICArrayKokkos vs. Plain DCArrayKokkos
 
-| Type | Use Case |
-|------|----------|
-| `DCArrayKokkos<T>` (`CArrayDual<T>`) | Single-rank data. No MPI communication needed. Host/device sync only. |
-| `MPICArrayKokkos<T>` | Distributed data with ghost/halo regions. Wraps `DCArrayKokkos` internally and adds MPI send/recv buffer management, a `CommunicationPlan`, and a `communicate()` method that handles the full pack → exchange → unpack cycle. |
+
+| Type                                 | Use Case                                                                                                                                                                                                                       |
+| ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `DCArrayKokkos<T>` (`CArrayDual<T>`) | Single-rank data. No MPI communication needed. Host/device sync only.                                                                                                                                                          |
+| `MPICArrayKokkos<T>`                 | Distributed data with ghost/halo regions. Wraps `DCArrayKokkos` internally and adds MPI send/recv buffer management, a `CommunicationPlan`, and a `communicate()` method that handles the full pack → exchange → unpack cycle. |
+
 
 Use `MPICArrayKokkos` when a data array is partitioned across MPI ranks and neighboring ranks need to exchange boundary (ghost/halo) data. Use plain `DCArrayKokkos` for rank-local data that never crosses MPI boundaries.
 
 ### MPICArrayKokkos Overview
 
 `MPICArrayKokkos<T>` is a template class that wraps a `DCArrayKokkos<T>` with:
+
 - **Send and receive buffers** (`DCArrayKokkos<T>`) for packing/unpacking halo data
 - A **stride** value computed from trailing dimensions (for multi-dimensional arrays, each first-index element is a contiguous block of `dim1 * dim2 * ... * dimN` values)
-- A pointer to a shared **`CommunicationPlan`** that defines the neighbor topology
+- A pointer to a shared `**CommunicationPlan`** that defines the neighbor topology
 - A `host` member (`ViewCArray<T>`) for convenient host-side access
 
 ```cpp
@@ -1362,19 +1449,21 @@ comm_plan.setup_send_recv(rank_send_ids, rank_recv_ids);
 
 #### CommunicationPlan Members
 
-| Member | Type | Description |
-|--------|------|-------------|
-| `mpi_comm_world` | `MPI_Comm` | The base MPI communicator |
-| `mpi_comm_graph` | `MPI_Comm` | Distributed graph communicator for neighbor collectives |
-| `num_send_ranks` | `int` | Number of ranks this process sends to (out-degree) |
-| `num_recv_ranks` | `int` | Number of ranks this process receives from (in-degree) |
-| `send_rank_ids` | `DCArrayKokkos<int>` | Destination rank IDs |
-| `recv_rank_ids` | `DCArrayKokkos<int>` | Source rank IDs |
-| `send_indices_` | `DRaggedRightArrayKokkos<int>` | Per-neighbor indices of elements to send |
-| `recv_indices_` | `DRaggedRightArrayKokkos<int>` | Per-neighbor indices of elements to receive |
-| `send_counts_` / `recv_counts_` | `DCArrayKokkos<int>` | Number of elements per neighbor |
-| `send_displs_` / `recv_displs_` | `DCArrayKokkos<int>` | Displacement offsets for packing |
-| `total_send_count` / `total_recv_count` | `int` | Total elements across all neighbors |
+
+| Member                                  | Type                           | Description                                             |
+| --------------------------------------- | ------------------------------ | ------------------------------------------------------- |
+| `mpi_comm_world`                        | `MPI_Comm`                     | The base MPI communicator                               |
+| `mpi_comm_graph`                        | `MPI_Comm`                     | Distributed graph communicator for neighbor collectives |
+| `num_send_ranks`                        | `int`                          | Number of ranks this process sends to (out-degree)      |
+| `num_recv_ranks`                        | `int`                          | Number of ranks this process receives from (in-degree)  |
+| `send_rank_ids`                         | `DCArrayKokkos<int>`           | Destination rank IDs                                    |
+| `recv_rank_ids`                         | `DCArrayKokkos<int>`           | Source rank IDs                                         |
+| `send_indices_`                         | `DRaggedRightArrayKokkos<int>` | Per-neighbor indices of elements to send                |
+| `recv_indices_`                         | `DRaggedRightArrayKokkos<int>` | Per-neighbor indices of elements to receive             |
+| `send_counts_` / `recv_counts_`         | `DCArrayKokkos<int>`           | Number of elements per neighbor                         |
+| `send_displs_` / `recv_displs_`         | `DCArrayKokkos<int>`           | Displacement offsets for packing                        |
+| `total_send_count` / `total_recv_count` | `int`                          | Total elements across all neighbors                     |
+
 
 ### Connecting MPICArrayKokkos to a CommunicationPlan
 
@@ -1392,13 +1481,10 @@ field.initialize_comm_plan(comm_plan);
 
 Calling `field.communicate()` performs the full halo exchange:
 
-1. **`fill_send_buffer()`**: Copies `this_array_` to host (`update_host`), then packs elements listed in `send_indices_` into the contiguous `send_buffer_` on the host. For multi-dimensional arrays, each element index packs `stride_` contiguous values.
-
-2. **`MPI_Neighbor_alltoallv()`**: Exchanges data with all neighbors using the graph communicator. Send/recv counts and displacements are pre-computed and scaled by stride.
-
-3. **`copy_recv_buffer()`**: Unpacks `recv_buffer_` into the ghost positions of `this_array_` using `recv_indices_`.
-
-4. **`update_device()`**: Syncs the updated array (with fresh ghost data) back to the device.
+1. `**fill_send_buffer()**`: Copies `this_array_` to host (`update_host`), then packs elements listed in `send_indices_` into the contiguous `send_buffer_` on the host. For multi-dimensional arrays, each element index packs `stride_` contiguous values.
+2. `**MPI_Neighbor_alltoallv()**`: Exchanges data with all neighbors using the graph communicator. Send/recv counts and displacements are pre-computed and scaled by stride.
+3. `**copy_recv_buffer()**`: Unpacks `recv_buffer_` into the ghost positions of `this_array_` using `recv_indices_`.
+4. `**update_device()**`: Syncs the updated array (with fresh ghost data) back to the device.
 
 ```cpp
 // COMPILES_AS_IS (assuming variables are declared)
@@ -1419,18 +1505,21 @@ FOR_ALL(i, 0, num_owned + num_ghost, {
 ### Interaction Between update_host/update_device and MPI
 
 The `communicate()` method internally calls `update_host()` before packing and `update_device()` after unpacking. This means:
+
 - Before calling `communicate()`, ensure device data is current (place a `MATAR_FENCE()` after any kernel that wrote to the array).
 - After `communicate()` returns, the device copy already has fresh ghost data — no additional `update_device()` is needed.
 - If you need to inspect the data on the host after communication, call `update_host()` explicitly.
 
 ### MPI Convenience Macros
 
-| Macro | Expands To |
-|-------|-----------|
-| `MATAR_MPI_INIT` | `MPI_Init(&argc, &argv)` |
-| `MATAR_MPI_FINALIZE` | `MPI_Finalize()` |
-| `MATAR_MPI_TIME` | `MPI_Wtime()` |
-| `MATAR_MPI_BARRIER` | `MPI_Barrier(MPI_COMM_WORLD)` |
+
+| Macro                | Expands To                    |
+| -------------------- | ----------------------------- |
+| `MATAR_MPI_INIT`     | `MPI_Init(&argc, &argv)`      |
+| `MATAR_MPI_FINALIZE` | `MPI_Finalize()`              |
+| `MATAR_MPI_TIME`     | `MPI_Wtime()`                 |
+| `MATAR_MPI_BARRIER`  | `MPI_Barrier(MPI_COMM_WORLD)` |
+
 
 ### Complete MPI Example Pattern
 
@@ -1503,33 +1592,27 @@ int main(int argc, char* argv[]) {
 Use these constraints as hard rules when generating MATAR code:
 
 1. **Use only macros that exist in `MATAR/src/include/macros.h`.**
-   - **Flat parallel loops:** `FOR_ALL`, `DO_ALL`, `RUN`
-   - **Serial convenience loops:** `FOR_LOOP`, `DO_LOOP` (and stride variants `FOR_LOOP_DIM` / `DO_LOOP_DIM` for loops with a delta/increment)
-   - **Reductions (half-open):** `FOR_REDUCE_SUM`, `FOR_REDUCE_MAX`, `FOR_REDUCE_MIN`, `FOR_REDUCE_PRODUCT`
-   - **Reductions (inclusive):** `DO_REDUCE_SUM`, `DO_REDUCE_MAX`, `DO_REDUCE_MIN` — **`DO_REDUCE_PRODUCT` does NOT exist**
-   - **Class-member variants** (use `KOKKOS_CLASS_LAMBDA`): `FOR_ALL_CLASS`, `RUN_CLASS`, `FOR_REDUCE_SUM_CLASS`, `FOR_REDUCE_MAX_CLASS`, `FOR_REDUCE_MIN_CLASS`
-   - **Hierarchical (team/thread/vector):** `FOR_FIRST`, `FOR_SECOND`, `FOR_THIRD`, `DO_FIRST`, `DO_SECOND`, `DO_THIRD`
-   - **Hierarchical reductions:** `FOR_REDUCE_SUM_SECOND`, `FOR_REDUCE_SUM_THIRD`, `DO_REDUCE_SUM_SECOND`, `DO_REDUCE_SUM_THIRD`, `FOR_REDUCE_MAX_SECOND`, `DO_REDUCE_MAX_THIRD`, `FOR_REDUCE_MIN_SECOND`, `DO_REDUCE_MIN_THIRD`
-
-2. **`FOR_ALL` and `DO_ALL` parallelize all listed dimensions.**
-   - If a dimension must remain sequential, keep it as a plain inner `for` loop inside the macro body, or use hierarchical macros where appropriate.
-
+  - **Flat parallel loops:** `FOR_ALL`, `DO_ALL`, `RUN`
+  - **Serial convenience loops:** `FOR_LOOP`, `DO_LOOP` (and stride variants `FOR_LOOP_DIM` / `DO_LOOP_DIM` for loops with a delta/increment)
+  - **Reductions (half-open):** `FOR_REDUCE_SUM`, `FOR_REDUCE_MAX`, `FOR_REDUCE_MIN`, `FOR_REDUCE_PRODUCT`
+  - **Reductions (inclusive):** `DO_REDUCE_SUM`, `DO_REDUCE_MAX`, `DO_REDUCE_MIN` — `**DO_REDUCE_PRODUCT` does NOT exist**
+  - **Class-member variants** (use `KOKKOS_CLASS_LAMBDA`): `FOR_ALL_CLASS`, `RUN_CLASS`, `FOR_REDUCE_SUM_CLASS`, `FOR_REDUCE_MAX_CLASS`, `FOR_REDUCE_MIN_CLASS`
+  - **Hierarchical (team/thread/vector):** `FOR_FIRST`, `FOR_SECOND`, `FOR_THIRD`, `DO_FIRST`, `DO_SECOND`, `DO_THIRD`
+  - **Hierarchical reductions:** `FOR_REDUCE_SUM_SECOND`, `FOR_REDUCE_SUM_THIRD`, `DO_REDUCE_SUM_SECOND`, `DO_REDUCE_SUM_THIRD`, `FOR_REDUCE_MAX_SECOND`, `DO_REDUCE_MAX_THIRD`, `FOR_REDUCE_MIN_SECOND`, `DO_REDUCE_MIN_THIRD`
+2. `**FOR_ALL` and `DO_ALL` parallelize all listed dimensions.**
+  - If a dimension must remain sequential, keep it as a plain inner `for` loop inside the macro body, or use hierarchical macros where appropriate.
 3. **Indexing is always `()` for MATAR arrays.**
-   - Never emit `[]` indexing for MATAR types.
-
+  - Never emit `[]` indexing for MATAR types.
 4. **For dual types, synchronize explicitly.**
-   - Host writes require `update_device()` before device reads.
-   - Device writes require `update_host()` before host reads.
-
+  - Host writes require `update_device()` before device reads.
+  - Device writes require `update_host()` before host reads.
 5. **Fence only when needed.**
-   - Required at dependency boundaries or before host/timing use.
-   - Avoid unnecessary fences between independent kernels.
-
+  - Required at dependency boundaries or before host/timing use.
+  - Avoid unnecessary fences between independent kernels.
 6. **Prefer aliases in `mtr::` namespace.**
-   - Example: `CArrayDual<T>` instead of direct underlying class names unless low-level control is needed.
-
+  - Example: `CArrayDual<T>` instead of direct underlying class names unless low-level control is needed.
 7. **MPI-aware distributed arrays use `MPICArrayKokkos`.**
-   - Plain `DCArrayKokkos` / `CArrayDual` are rank-local and do not perform communication.
+  - Plain `DCArrayKokkos` / `CArrayDual` are rank-local and do not perform communication.
 
 ---
 
@@ -1541,29 +1624,25 @@ When producing MATAR conversions, output must satisfy this checklist:
   - Include `#include <matar.h>`
   - Use `using namespace mtr;`
   - Wrap compute region between `MATAR_INITIALIZE(...)` and `MATAR_FINALIZE()`
-
 - **Data structure mapping**
-  - Choose `C*` vs `F*` based on target layout/coalescing goals
+  - Choose `C`* vs `F*` based on target layout/coalescing goals
   - Choose `Array` vs `Matrix` based on 0-based vs 1-based indexing
   - Choose `Device` vs `Dual` vs `Host` based on execution/sync needs
   - Use `View*` when wrapping existing pointers or slices
-
 - **Loop mapping**
   - Replace parallelizable loops with `FOR_ALL` / `DO_ALL`
-  - Replace reductions with `FOR_REDUCE_*` / `DO_REDUCE_*`
+  - Replace reductions with `FOR_REDUCE_`* / `DO_REDUCE_*`
   - Keep dependency-carrying inner loops serial inside macro bodies
-
 - **Correctness and synchronization**
   - Add `update_device()` / `update_host()` where required
   - Add `MATAR_FENCE()` only at true synchronization boundaries
   - Avoid race conditions (use reductions, atomics, or serial inner loops)
-
 - **MPI (if distributed)**
   - Build `CommunicationPlan`
   - Attach with `initialize_comm_plan(...)`
   - Use `communicate()` at halo exchange points
-
 - **Output quality**
   - Label examples as either:
     - `COMPILES_AS_IS` (fully concrete except obvious variable declarations), or
     - `PSEUDOCODE_PATTERN` (contains app-specific placeholders)
+