[Bug] CUDA Error 700 (Illegal Memory Access) in Myelin shapeChangeHelper During Prefill After Optimization Profile Switch# [Bug] CUDA Error 700 (Illegal Memory Access) in Myelin shapeChangeHelper During Prefill After Optimization Profile Switch

[llm_inference.log](https://github.com/user-attachments/files/26018243/llm_inference.log)

## Describe the bug

**Impact: Blocker** — This bug makes it impossible to use CUDA graph capture for the generation (decode) profile and then switch back to the prefill profile within the same `IExecutionContext`. It blocks the primary inference pipeline for hybrid linear-attention LLM models on DRIVE Orin.

`IExecutionContext::enqueueV3()` fails with **CUDA error 700 (cudaErrorIllegalAddress)** inside TensorRT's internal Myelin runtime (`shapeChangeHelper` → async copy) during the first prefill execution. The crash occurs when switching the optimization profile from **profile 1 (generation)** back to **profile 0 (prefill)** via `setOptimizationProfileAsync()`, in a multi-stream engine scenario.

**Error message:**
```
[18:31:00.337] [ERROR] [TensorRT] IExecutionContext::enqueueV3: Error Code 1: Myelin
  ([copy.cpp:exec:180] CUDA error 700 enqueueing async copy.
   In shapeChangeHelper at runtime/myelin/runner.cpp:460)
```

Followed by a cascade failure on cleanup:
```
terminate called after throwing an instance of 'std::runtime_error'
  what():  CUDA runtime error in cudaFreeHost(data): an illegal memory access was encountered
```

**Root cause analysis:**

The error originates **inside TensorRT's internal Myelin runtime** (not in any user plugin), specifically in `shapeChangeHelper` at `runtime/myelin/runner.cpp:460`, performing an async memory copy.

We believe the root cause is a **TensorRT internal state corruption when switching optimization profiles after CUDA graph capture in a multi-stream engine**. Specifically:

1. During CUDA graph capture on **profile 1**, TensorRT's Myelin runtime initializes internal state (buffer addresses, shape metadata) for its auxiliary worker streams.
2. When `setOptimizationProfileAsync(0, stream)` is called to switch back to **profile 0**, Myelin's internal state on the auxiliary worker streams is **not fully reinitialized** — even if a `cudaStreamSynchronize(stream)` is issued on the main stream.
3. When `enqueueV3()` is subsequently called on profile 0, Myelin's `shapeChangeHelper` operates on stale internal state (e.g., buffer addresses/shapes from profile 1), causing the async copy to access invalid memory.

**This is NOT a simple race condition.** Adding `cudaStreamSynchronize(stream)` between `setOptimizationProfileAsync()` and `enqueueV3()` does not resolve the crash. The main stream is fully quiesced before `enqueueV3()`, yet the crash still occurs. This rules out async profile switch timing as the cause and points to Myelin's auxiliary stream state not being properly updated during profile switches.

**Key evidence:**

- The TensorRT log confirms the profile switch: `"Switching optimization profile from: 1 to 0"` (at `18:30:59.915`)
- The crash happens 422ms later during `enqueueV3` (at `18:31:00.337`)
- **`cudaStreamSynchronize(stream)` was added between `setOptimizationProfileAsync` and `enqueueV3`** — the crash still occurs, 100% reproducibly
- The engine uses multi-stream (`Number of total worker streams is 2`)
- The error is in Myelin's **internal** `copy.cpp:exec:180`, not in any user-provided plugin
- The crash only occurs after CUDA graph capture has been performed on profile 1; if CUDA graph capture is skipped entirely, the profile switch works correctly

**Workarounds attempted (all failed):**

1. **Warm up profile 0 before profile 1** — Still crashes. The Myelin internal state established during the warm-up is overwritten during CUDA graph capture on profile 1.
2. **Remove all `cudaMemcpyAsync` in custom LinearAttentionPlugin** — Still crashes. The error is in Myelin's internal copy, not any plugin-side copy.
3. **Add `cudaStreamSynchronize(stream)` after `setOptimizationProfileAsync`** — Still crashes. The stream sync completes successfully, confirming the profile switch itself finishes without error, yet the subsequent `enqueueV3()` still hits CUDA 700.
4. **Add `cudaDeviceSynchronize()` after `setOptimizationProfileAsync`** — Still crashes. This synchronizes ALL streams on the device. Confirms the issue is not a synchronization/timing problem, but a state management bug in Myelin's profile switching logic.

**Questions for NVIDIA:**

1. Is switching optimization profiles supported after CUDA graph capture on a multi-stream engine? If not, this should be documented.
2. Does CUDA graph capture on one optimization profile permanently lock Myelin's internal auxiliary stream state?
3. What is the recommended pattern for using multiple optimization profiles with CUDA graph capture in a multi-stream engine? (e.g., separate `IExecutionContext` per profile?)

### Steps/Code to reproduce bug

**Build configuration:**
```bash
cmake .. -DCMAKE_BUILD_TYPE=Release \
    -DTRT_PACKAGE_DIR=/path/to/TensorRT \
    -DCMAKE_TOOLCHAIN_FILE=cmake/aarch64_linux_toolchain.cmake \
    -DEMBEDDED_TARGET=drive-orin
```

**Runtime command used:**
```bash
# Step 1: Build the engine with 2 optimization profiles (profile 0: prefill, profile 1: generation)
./build/examples/llm/llm_build --onnxDir ./onnx_models/qwen3.5-0.8b \
    --engineDir ./engines/qwen3.5-0.8b \
    --maxBatchSize 1 --maxInputLen 1024 --maxKVCacheCapacity 4096

# Step 2: Run inference — crashes on first prefill after CUDA graph capture
./build/examples/llm/llm_inference --engineDir ./engines/qwen3.5-0.8b \
    --inputFile input.json --outputFile output.json
```

**Execution flow that triggers the crash (simplified):**

```
① Engine deserialization (profile defaults to 0)
② captureDecodingCUDAGraph()
   └─ captureVanillaDecodingCudaGraph():
      ├─ setOptimizationProfileAsync(1, stream)  // switch to generation profile
      ├─ ... set tensor addresses & shapes ...
      ├─ enqueueV3(stream)                        // warm-up run
      ├─ cudaStreamSynchronize(stream)
      ├─ cudaStreamBeginCapture(...)
      ├─ enqueueV3(stream)                        // capture
      └─ cudaStreamEndCapture(...)
   // ⚠ Function returns with profile still at 1

③ handleRequest() → executePrefillStep():
   ├─ setOptimizationProfileAsync(0, stream)      // switch back to prefill profile
   ├─ cudaStreamSynchronize(stream)               // ← explicit sync added; does NOT prevent crash
   ├─ setTensorAddress(...)                        // bind tensors for profile 0
   ├─ setInputShape(...)
   └─ enqueueV3(stream)                           // 💥 CUDA 700 in Myelin shapeChangeHelper
```

**Relevant source code** is in `cpp/runtime/llmEngineRunner.cpp`:

- `executePrefillStep()` — line ~1009: where the profile switch + `enqueueV3` crash occurs
- `captureVanillaDecodingCudaGraph()` — line ~1530: CUDA graph capture that leaves context on profile 1
- `executeVanillaDecodingStep()` — line ~1225: generation path (bypasses profile switching via CUDA graph, so does not crash)

All `setOptimizationProfileAsync` call sites and their synchronization status:

| Location | Profile | `cudaStreamSynchronize` after? |
|---|---|---|
| `executePrefillStep` line 1014 | 0 (prefill) | **Yes (added, does NOT prevent crash)** |
| `executeVanillaDecodingStep` line 1229 | 1 (generation) | No |
| `captureVanillaDecodingCudaGraph` line 1535 | 1 (generation) | No (sync is 100+ lines later, after enqueueV3 warm-up) |
| `captureEagleBaseTreeDecodingCudaGraph` line 1687 | 1 (generation) | No |

**Model / Engine Details:**

- **Architecture**: Qwen3.5-0.8B hybrid model — 24 layers total: `[L, L, L, F] × 6` (18 linear attention + 6 full attention)
- **Full Attention config**: 8 q_heads, 2 kv_heads, head_dim=256, GQA 4:1
- **Linear Attention config**: GatedDeltaNet, 16 heads, key/value_dim=128, conv_kernel=4
- **Precision**: INT4 AWQ
- **Optimization profiles**: Profile 0 (prefill, dynamic sequence length), Profile 1 (generation, single token)
- **Engine multi-stream**: `Number of total worker streams is 2`, `Number of aux streams is 1`
- **Engine size**: 1224 MiB
- **Engine bindings include**:
  - Standard: `inputs_embeds`, `context_lengths`, `last_token_ids`, `rope_cos_sin`, `logits`
  - KV Cache: `past_key_values_{3,7,11,15,19,23}` (6 full attention layers)
  - Linear Attention State: `conv_state_{0,1,2,4,5,6,...}`, `recurrent_state_{0,1,2,4,5,6,...}` (18 layers)
- **Plugins**: `Int4GroupwiseGemmPlugin` (IPluginV3), `LinearAttentionPlugin` (IPluginV3), `AttentionPlugin` (IPluginV2DynamicExt)

**Relevant log (trimmed):**

```log
[18:30:57.511] [INFO] [TensorRT] [MS] Running engine with multi stream info
[18:30:57.511] [INFO] [TensorRT] [MS] Number of aux streams is 1
[18:30:57.511] [INFO] [TensorRT] [MS] Number of total worker streams is 2
[18:30:57.511] [INFO] [TensorRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream
...
[18:30:59.523] [INFO] [TensorRT] Switching optimization profile from: 0 to 1. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[18:30:59.915] [DEBUG] captureVanillaDecodingCudaGraph(): CUDA graph captured successfully for input shape [1, 1, 1024]
[18:30:59.915] [INFO] LLMInferenceRuntime(): Successfully captured the decoding CUDA graph for all execution batch sizes and LoRA weights.
[18:30:59.915] [INFO] [TensorRT] Switching optimization profile from: 1 to 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[18:31:00.337] [ERROR] [TensorRT] IExecutionContext::enqueueV3: Error Code 1: Myelin ([copy.cpp:exec:180] CUDA error 700 enqueueing async copy. In shapeChangeHelper at runtime/myelin/runner.cpp:460)
[18:31:00.337] [ERROR] executePrefill(): Failed on TensorRT prefill stage enqueueV3() call.
```

### Expected behavior

`setOptimizationProfileAsync(profile, stream)` followed by proper tensor binding and `enqueueV3(stream)` on the **same stream** should work correctly after CUDA graph capture on a different profile. TensorRT's internal state management should ensure that Myelin's `shapeChangeHelper` and all auxiliary worker streams properly reinitialize their state when the optimization profile changes, regardless of whether CUDA graph capture was previously performed.

**Actual behavior:** After CUDA graph capture on profile 1, switching back to profile 0 via `setOptimizationProfileAsync(0, stream)` — even with explicit `cudaStreamSynchronize(stream)` or `cudaDeviceSynchronize()` — leaves Myelin's internal state in an inconsistent state. The subsequent `enqueueV3(stream)` triggers CUDA error 700 (illegal memory access) in Myelin's `shapeChangeHelper` async copy. The error is 100% reproducible on the first prefill `enqueueV3` after CUDA graph capture. **If CUDA graph capture is entirely skipped**, profile switching between 0 and 1 with `enqueueV3` works correctly, confirming the issue is specific to the interaction between CUDA graph capture and subsequent profile switches in multi-stream Myelin engines.

## System information (Edge Device)

- Platform: NVIDIA DRIVE Orin
- Software release: NVIDIA DRIVE OS (bundled with DriveOS)
- CPU architecture: aarch64
- GPU compute capability: SM87 (Ampere)
- Total device memory: 32 GB (unified memory)
- Build type: Release
- Library versions:
  - TensorRT Edge-LLM version or commit hash: v0.5.0
  - CUDA: 11.4
  - TensorRT: 10.13.1 (bundled with DriveOS)
  - C++ compiler: GCC 11.4 (cross-compilation toolchain)
- CMake options used:
  - CMAKE_TOOLCHAIN_FILE: cmake/aarch64_linux_toolchain.cmake
  - EMBEDDED_TARGET: drive-orin
  - TRT_PACKAGE_DIR: /path/to/TensorRT
- Any other details that may help:
  - Engine uses multi-stream execution: `Number of total worker streams is 2`, `Number of aux streams is 1`
  - Engine size: 1224 MiB
  - Model: Qwen3.5-0.8B hybrid (18 linear attention + 6 full attention layers, INT4 AWQ)
  - Plugins used: `Int4GroupwiseGemmPlugin` (IPluginV3), `LinearAttentionPlugin` (IPluginV3), `AttentionPlugin` (IPluginV2DynamicExt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] CUDA Error 700 (Illegal Memory Access) in Myelin shapeChangeHelper During Prefill After Optimization Profile Switch# [Bug] CUDA Error 700 (Illegal Memory Access) in Myelin shapeChangeHelper During Prefill After Optimization Profile Switch #50

Describe the bug

Steps/Code to reproduce bug

Expected behavior

System information (Edge Device)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Location	Profile	`cudaStreamSynchronize` after?
`executePrefillStep` line 1014	0 (prefill)	Yes (added, does NOT prevent crash)
`executeVanillaDecodingStep` line 1229	1 (generation)	No
`captureVanillaDecodingCudaGraph` line 1535	1 (generation)	No (sync is 100+ lines later, after enqueueV3 warm-up)
`captureEagleBaseTreeDecodingCudaGraph` line 1687	1 (generation)	No

[Bug] CUDA Error 700 (Illegal Memory Access) in Myelin shapeChangeHelper During Prefill After Optimization Profile Switch# [Bug] CUDA Error 700 (Illegal Memory Access) in Myelin shapeChangeHelper During Prefill After Optimization Profile Switch #50

Description

Describe the bug

Steps/Code to reproduce bug

Expected behavior

System information (Edge Device)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions