Feature/yolo26 p4 optimizations by BoumedineBillal · Pull Request #310 · espressif/esp-dl

BoumedineBillal · 2026-06-12T01:13:40Z

feat: ESP32-P4 PIE SIMD kernels + Neural Morphing for YOLO26n 1,088ms @ 512×512, 0.361 mAP (1.90× vs #286 baseline)

Executive Summary

This PR adds 4 custom PIE SIMD assembly kernels and a Neural Morphing Engine for ESP32-P4, reducing YOLO26n inference from 2,072ms → 1,088ms (1.90× speedup) at 512×512 resolution while maintaining 0.361 mAP50-95 (−0.004 from baseline). This builds on the baseline YOLO26n deployment merged in #286.

These optimizations are not YOLO26n-specific. The PIE kernels accelerate any model containing matching operator patterns (Conv+Activation, Transpose, INT16 LUT, SiLU/HardSiLU), and the Neural Morphing Engine provides a reusable strategy framework for automated graph transformation with quality-gated distillation applicable to any model deployed via esp-ppq.

Scope: ESP32-P4 only. All assembly kernels use ESP32-P4 PIE SIMD instructions.

Model	Configuration	Resolution	mAP50-95	Inference	vs #286	vs YOLOv11n
YOLO26n (This PR)	T1 Morph + 4 PIE kernels	512×512	0.361	1,088ms	1.90× faster	2.54× faster
YOLO26n (PR #286)	PTQ + TQT + LUT	512×512	0.365	2,072ms	baseline	1.33× faster
YOLOv11n (Stock)	Official ESP-DL	640×640	0.360	2,764ms		baseline

Performance Breakdown

Every millisecond saved is attributed to a specific kernel, with no overlap:

Kernel	Target Operators	Speedup	Latency Saved	Mechanism
TransposePIE	Attention transposes (PSA)	89× avg, 917× peak	303ms	3-stage `vzip` butterfly in PIE registers
TiledConvBlock	Conv+Act layers (spatial + 1×1)	13.3× on largest layer	379ms	H-tiling in L2 cache, fused in-place activation
NN-LUT INT16	INT16 Swish activations	4.2×	156ms	NN lookup via XOR + `vmul.u16` HALF_EVEN rounding, 4-way interleaved
HardSiluPie8	SiLU→HardSiLU8 (morphed)	6.5×	131ms	12-instruction piecewise-linear kernel with learned scale
			~984ms total	2,072ms → 1,088ms

TiledConvBlock detail: 326ms from spatial H-tiling (Conv 3×3, 5×5) + 53ms from output channel tiling (c_tile) on 1×1 Conv layers. The largest single-layer improvement is model.7/conv (DW 5×5): 305ms → 23ms (13.3×).

Architecture: Three Optimization Layers

The optimizations are organized into three independent layers, each contributing to the final result:

Layer 1: Neural Morphing Engine (Model-Level)
    yolo26n.pt → T1: SiLU→HardSiLU8 (45/66 accepted) → morphed_hsilu.native
                                    ↓
Layer 3: Python Pipeline (Quantization + Emulation)
    morphed_hsilu.native → calibrate → TQT → custom_ops_patch → export
                                    ↓
    yolo26n_512_s8_p4_tpie.espdl (contains HardSiluPie8/TransposePIE/TiledConvBlock ops)
                                    ↓
Layer 2: PIE SIMD Kernels (Hardware-Level)
    dl_module_creator.hpp deserializes → instantiates C++ modules → calls PIE assembly
                                    ↓
    1,088ms inference, bit-exact with Python emulation

Operator-level graph transformation (Netron visualization baseline left, optimized right):

Three operator replacements: Conv+Swish → TiledConvBlock (fused with L2-cache tiling), Swish → HardSiluPie8 (via Neural Morphing T1), Transpose → TransposePIE (PIE SIMD dispatch).

Layer 1: Neural Morphing Engine

A general-purpose framework for automated, quality-gated graph transformations on PPQ .native graphs. The engine implements a strategy pattern: each transformation is a self-contained BaseReplacementStrategy that the engine orchestrates through a 5-phase pipeline:

A. Select targets → B. Build replacements → C. Block-wise distillation → D. Quality gate → E. Accept/Rollback

The core technique block-wise knowledge distillation is inspired by esp-ppq's TQT implementation, which combines TQT scale learning (per-tensor, symmetric, power-of-two constraints) with the block-wise reconstruction strategy from BRECQ (Li et al., ICLR 2021). The Neural Morphing Engine extends this approach: instead of only optimizing quantization scales, it uses block-wise distillation to optimize architectural replacements (e.g., SiLU → HardSiLU8 with learned scale). The reconstruction objective is strategy-defined T1 uses Huber+Cosine loss (robust to activation outliers), while the engine itself is loss-agnostic.

T1: SiLU → HardSiLU8 (Deployed)

Replaces standard SiLU activations with HardSiLU8 a piecewise-linear approximation that maps directly to the HardSiluPie8 PIE kernel. Each replacement includes a learnable scale factor (clamp-quantized to scale_int ∈ [0, 256] via clamp(0,1) × 256 → round) trained via block-wise Huber+Cosine distillation.

Metric	Value
Candidates scanned	66 SiLU ops
Accepted	45 (68.2%)
Rejected	21 (PSA attention cosine < threshold)
Quality gate	cos ≥ 0.9908, scale ratio ∈ [0.985, 1.015]
FP32 mAP impact	0.375 → 0.370 (−0.005, 98.6% preserved)
Post-quantization mAP	0.365 → 0.361 (−0.004)

T2 & T3 (Ablation Only Not Deployed)

Two additional strategies were evaluated but not included in the deployed model:

T2 (Conv Decomposition): Decomposes 3×3 Conv into residual DW+PW with SVD initialization. FP32 recovery was good (0.367 mAP), but quantized performance dropped to 0.323 the residual skip path creates a quantization-sensitive addition node.
T3 (Channel Pruning): L1-norm pointwise pruning at 20% ratio. FP32 recovery reached 0.362, but interaction with T2's degraded baseline made the combined result unacceptable.

These strategies remain in the repository as documented ablation studies and starting points for future work.

Extensibility

New strategies are added by implementing BaseReplacementStrategy, which defines the full distillation contract:

Method / Property	What the Strategy Controls
`select_target()`	Which ops to replace
`build_replacement()`	How to construct the replacement subgraph
`evaluate_validation()`	Quality gate logic (accept/reject decision)
`get_criterion()`	Distillation loss function (e.g., HuberCosine, MSE)
`get_scheduler()`	LR schedule (e.g., CosineAnnealingWarmRestarts)
`calculate_samples()`	How many batches to cache per block
`learning_rate`, `weight_decay`, `steps`, `patience`	Optimizer hyperparameters

Optional hooks: on_step_end() (per-step constraints), requires_predecessor (include predecessor Conv in distillation block for activation strategies), compensate_on_reject (fine-tune original block after rejection to correct upstream drift).

The engine handles the orchestration: block iteration, subgraph splicing, calibration data caching, gradient loop execution, rollback on rejection, and genealogy metadata tracking.

Layer 2: PIE SIMD Kernels

K1: HardSiluPie8 (`dl_esp32p4_s8_hard_silu8.S`)

12 computation PIE instructions (16 total including load, store, and 2× SAR setup) process 16 INT8 elements per iteration. The kernel computes:

y = (x_scaled × gate) >> SAR_total
where gate = clamp(x + offset, 0, max_gate)
      x_scaled = x × scale_int

All arithmetic stays in INT16 (sign-extended from INT8). The scale_int parameter is deserialized from the .espdl model it encodes the per-layer learned scale factor from T1 distillation.

6.5× faster than scalar SiLU. Saves 131ms across 45 morphed activations.

K2: TransposePIE (`dl_esp32p4_s8_transpose.S`)

Contains two kernels selected at runtime by a 6-step dispatch algorithm in dl_module_transpose_pie.hpp:

Byte-zip kernel (K1): 3-stage vzip.8 → vzip.16 → vzip.32 butterfly pattern. Transposes an 8×16 tile using only PIE register shuffles zero memory scratch space. 14.6× average on small attention tensors.
Block-copy kernel (K2): For large tensors exceeding register capacity. 16-byte aligned vld/vst block moves. 30× average, up to 917× peak on large (80×1024) transposes.

The dispatch algorithm auto-selects the optimal kernel based on tensor dimensions and alignment. 89× average speedup across all transpose ops. Saves 303ms.

K3: TiledConvBlock (`dl_module_tiled_conv_block.hpp`)

Fuses Conv + Activation into a single tiled operation that exploits the ESP32-P4's 768KB L2 cache (SRAM), with two independent tiling dimensions:

Spatial H-tiling: Splits the output height into tiles sized to fit in L2. Each tile's Conv output stays in cache for the fused activation no PSRAM round-trip between Conv and Act. Saves 326ms across Conv 3×3 and 5×5 layers.
Output channel tiling (c_tile): For layers where the full filter working set (kH × kW × C_in × C_out) exceeds the L2 budget, splits the convolution into chunks of c_tile output channels. Each chunk's filter slice fits in cache, and the output channels are processed sequentially. c_tile is aligned down to vector width (16 for INT8, 8 for INT16). Saves 53ms across 1×1 Conv layers with large channel counts.
In-place activation: HardSiLU8 or NN-LUT applied directly on the L2-resident tile after each spatial or channel chunk no intermediate buffer.

The tiling parameters (tile_h, c_tile) are computed at export time from the layer dimensions and available cache budget (L2 / 8 = 32KB per tile).

13.3× on model.7/conv (the largest DW 5×5 layer: 305ms → 23ms). Saves 379ms total (326ms spatial + 53ms channel).

K4: NN-LUT INT16 (`dl_esp32p4_s16_lut_nearest_neighbor.S`)

Replaces the linear-interpolation LUT from PR #286 with a nearest-neighbor lookup. Since the step size is a power of 2 (step=32), the SIMD index computation uses XOR sign-flip + vmul.u16 with hardware SAR:

esp.xorq     q0, q0, q7    // signed → unsigned: XOR with 0x8000 (+32768)
esp.vmul.u16 q0, q0, q5    // q0 = (q0 × 1) >> SAR  (HALF_EVEN rounding)

The vmul.u16 multiplies by broadcast(1) with SAR = log2(step), giving HALF_EVEN (banker's) rounding to the nearest table slot more accurate than plain truncation. Each element is then extracted, table-looked-up via scalar lh, and re-inserted using 4-way interleaved extract→address→load→insert waves for instruction-level parallelism.

This eliminates the per-element modulo, multiply, and divide of the original linear interpolation. The table remains 2,049 entries (same as PR #286); only the lookup strategy changes.

4.2× faster than interpolated LUT. Saves 156ms across 21 INT16 Swish activations.

Layer 3: Python Quantization Pipeline

Custom Ops Registration (`custom_ops_patch.py`)

Registers 3 custom operation types in esp-ppq so the graph correctly represents the P4-optimized operators:

Op Type	Forward Emulator	Backward	What It Does
`HardSiluPie8`	INT16 piecewise-linear with learned scale	STE	Bit-exact emulation of `dl_esp32p4_s8_hard_silu8.S` arithmetic
`TransposePIE`	Identity (auto-renames INT8 Transpose ops)	Pass-through	Ensures correct op type in exported `.espdl`
`TiledConvBlock`	None (export-only graph surgery)		Fuses Conv+Act pairs into single node at export time

The HardSiluPie8 emulator is a torch.autograd.Function that replicates the exact integer arithmetic of the PIE kernel (clamp, multiply, SAR shift, scale multiplication) with STE backward for TQT compatibility.

NN-LUT Emulator (`emulator_nearest_neighbor.py`)

Extends esp_ppq_lut with a nearest-neighbor simulation mode. When custom_ops_patch.py is imported, it replaces the linear-interpolation emulator with the NN variant matching the dl_esp32p4_s16_lut_nearest_neighbor.S rounding behavior.

P4 Notebook (`quantize_yolo26_coco_p4.ipynb`)

The notebook loads the T1-morphed graph (morphed_hsilu.native), applies the P4-specific pipeline:

Load morphed_hsilu.native → Apply INT16 layers → Calibrate (percentile)
→ TQT → custom_ops_patch (register HardSiluPie8/TransposePIE/TiledConvBlock)
→ LUT fusion (NN mode) → Graph surgery (Box/Class split) → Export .espdl

Bit-Exact Validation

The optimized model produces identical detections between the Python notebook and the ESP32-P4 hardware:

Python Notebook                           ESP32-P4 Firmware (Raw RGB mode)
─────────────────────────────────         ──────────────────────────────────
person   conf=0.85  [86,186,177,429]      person   conf=0.85  [86,186,177,429]
bus      conf=0.83  [68,109,448,349]      bus      conf=0.83  [68,109,448,349]
person   conf=0.76  [169,193,229,406]     person   conf=0.76  [169,193,229,406]
person   conf=0.49  [380,186,449,416]     person   conf=0.49  [380,186,449,416]
person   conf=0.47  [62,262,97,415]       person   conf=0.47  [62,262,97,415]

Result: 5/5 detections, 100% match (class, confidence, bounding box)

This parity requires a 6-link chain breaking any single link introduces mismatches:

#	Component	What It Eliminates	Location
1	Raw RGB input bypass	JPEG decoder drift (even 1-bit amplifies through 100+ quantized layers)	`generate_raw_rgb.py` + `USE_RAW_RGB` CMake define
2	`espdl_preprocess()`	OpenCV resize drift replicates ESP-DL's C++ `resize_nn` coordinate truncation: `m_x[i] = int(i * inv_scale_x)`	`notebook_helpers.py`
3	HardSiluPie8 emulator	Wrong activation `torch.autograd.Function` replicating exact integer arithmetic (clamp → multiply → SAR shift → scale) with STE backward	`custom_ops_patch.py`
4	NN-LUT emulator	Wrong rounding simulates `vmul.u16` HALF_EVEN rounding (not C-truncation, not Python floor-division)	`emulator_nearest_neighbor.py`
5	FP64 Conv toggle	float32 mantissa drift ESP32-P4 uses 64-bit integer accumulators for INT16 Conv; PyTorch float32 has only 24-bit mantissa, truncating large accumulations in the INT16 detection head	`enable_fp64_conv()` in `notebook_helpers.py`
6	SIMULATION mode	Wrong LUT mode forces hardware-faithful NN-LUT (rounded index → direct lookup) instead of ideal-math float Swish	`set_simulation_mode(SimulationMode.SIMULATION)`

Raw RGB → espdl_preprocess → FP64 Conv → HardSiluPie8 emulator → NN-LUT emulator → SIMULATION mode
   ↓            ↓                ↓               ↓                      ↓                ↓
JPEG drift  OpenCV drift   float32 drift   wrong activation      wrong rounding    wrong LUT mode

Ecosystem Impact

Reusable PIE Kernels

The 4 kernels contributed in this PR are not YOLO26n-specific they accelerate any ESP32-P4 model containing matching operators:

Kernel	Benefits Any Model With...
TransposePIE	Attention layers (PSA, MHSA, any `Transpose` op)
TiledConvBlock	Conv + Activation patterns (Conv+SiLU, Conv+ReLU, Conv+HardSiLU)
NN-LUT INT16	INT16 non-linear activations (Swish, Sigmoid, Tanh)
HardSiluPie8	HardSiLU8 activations (after Neural Morphing T1)

Applicability to `esp-detection`

The reusable kernels (TransposePIE, TiledConvBlock, NN-LUT) can benefit models deployed via esp-detection on ESP32-P4, since they accelerate common operator patterns (Conv+Activation, Transpose, INT16 LUT) regardless of the specific model architecture.

Neural Morphing Engine Extensibility

The strategy-pattern architecture makes it straightforward to add new transformations:

New activation replacements: Implement the full BaseReplacementStrategy contract (target selection, replacement building, quality gate, loss function, hyperparameters) to swap any activation for a hardware-friendly alternative.
New decompositions: T2 (Conv decomposition) is already implemented as a reference future work could target depthwise-separable or grouped convolutions.
New pruning criteria: T3 (channel pruning) demonstrates L1-norm ranking alternatives like Taylor expansion or gradient-based importance can be plugged in.

The engine handles the orchestration: block iteration, subgraph splicing, calibration data caching, gradient loop execution, rollback via genealogy metadata, and drift compensation. Each strategy controls its own loss function, quality gate, and hyperparameters (see Extensibility section above).

Development Methodology

The PIE SIMD kernels were developed using esp32-p4-jit, a JIT compilation tool that compiles C/ASM on the host PC and executes natively on the ESP32-P4 via USB in 1–2 seconds (vs 30–60s for full firmware rebuild). This enabled rapid iteration on cycle-level optimizations particularly critical for the 12-instruction HardSiluPie8 kernel and the 3-stage transpose butterfly, where each instruction ordering change required immediate hardware validation.

Deliverables

Layer 1: PIE SIMD Assembly Kernels

Path: dl/base/isa/esp32p4/

File	Status	Description
`dl_esp32p4_s8_hard_silu8.S`	[NEW]	HardSiLU8 kernel (12 PIE instr/16 elements, 6.5×)
`dl_esp32p4_s8_transpose.S`	[MOD]	K1 byte-zip (14.6×) + K2 block-copy (30×) transpose
`dl_esp32p4_s16_lut_nearest_neighbor.S`	[NEW]	NN INT16 LUT (4.2×, replaces linear interpolation)
`dl_esp32p4_block_transpose.S`	[DEL]	K2 consolidated into `dl_esp32p4_s8_transpose.S`
`dl_esp32p4_s8_hard_silu_pie8.S`	[DEL]	Renamed to `dl_esp32p4_s8_hard_silu8.S`
`dl_esp32p4_s16_lut_pie8.S`	[DEL]	Renamed to `dl_esp32p4_s16_lut_nearest_neighbor.S`

Also modified: dl_base_esp32p4.h adds extern "C" declaration for dl_esp32p4_s8_hard_silu8

Layer 2: C++ ESP-DL Module Headers

Path: dl/module/include/

File	Status	Description
`dl_module_hard_silu8.hpp`	[NEW]	Runtime wrapper deserializes `scale_int` from `.espdl`, precomputes constants
`dl_module_transpose_pie.hpp`	[NEW]	6-step dispatch algorithm, auto-selects K1/K2/scalar per tensor
`dl_module_tiled_conv_block.hpp`	[NEW]	H-tiling + fused activation (HardSiLU8 or NN-LUT in-place on L2)
`dl_module_lut.hpp`	[MOD]	Added NN-LUT SIMD dispatch (power-of-2 step → SIMD path)
`dl_module_creator.hpp`	[MOD]	Registers `HardSiluPie8`, `TransposePIE`, `TiledConvBlock` deserializers
`dl_module_hard_silu_pie8.hpp`	[DEL]	Renamed to `dl_module_hard_silu8.hpp`
`dl_module_tiled_conv_block copy.hpp`	[DEL]	Stale copy removed

Also modified: CMakeLists.txt adds new .S files to the build

Layer 3: Python Quantization Pipeline

Path: examples/tutorial/how_to_quantize_model/quantize_yolo26/

File	Status	Description
`quantize_yolo26_coco_p4.ipynb`	[NEW]	P4-optimized notebook (loads T1 graph, exports TPIE model)
`scripts/custom_ops_patch.py`	[NEW]	Registers HardSiluPie8, TransposePIE, TiledConvBlock in esp-ppq
`scripts/notebook_helpers.py`	[MOD]	Added FP64 Conv toggle, `espdl_preprocess`, `eval_espdl_model`
`esp_ppq_lut/emulator_nearest_neighbor.py`	[NEW]	NN-LUT emulator (replaces linear interpolation emulator)
`esp_ppq_lut/__init__.py`	[MOD]	Registers `SimulationMode` switch
`esp_ppq_lut/exporter.py`	[MOD]	Exports NN-LUT step attribute

Layer 4: Neural Morphing Engine

Path: examples/tutorial/how_to_quantize_model/quantize_yolo26/

File/Directory	Status	Description
`neural_morphing/__init__.py`	[NEW]	Exports engine + interface
`neural_morphing/engine.py`	[NEW]	5-phase pipeline: calibrate → transform → distill → evaluate → decide
`neural_morphing/interface.py`	[NEW]	`BaseReplacementStrategy` ABC (strategy pattern)
`neural_morphing/README.md`	[NEW]	Full documentation
`integrations/yolo26n/Transformation1_SiLU_to_HardSiLU/`	[NEW]	T1 strategy + output checkpoint
`integrations/yolo26n/Transformation2_Conv_to_DWPW/`	[NEW]	T2 strategy (ablation)
`integrations/yolo26n/Transformation3_Conv1x1_Prune/`	[NEW]	T3 strategy (ablation)

Layer 5: Model + Firmware Updates

Path: models/yolo26/ and examples/yolo26_detect/

File	Status	Description
`models/yolo26/models/p4/yolo26n_512_s8_p4_tpie.espdl`	[NEW]	Optimized model (2.93 MB, 1,088ms)
`models/yolo26/README.md`	[MOD]	Added TPIE benchmark row
`examples/yolo26_detect/README.md`	[MOD]	Added P4-optimized model entry
`examples/yolo26_detect/main/CMakeLists.txt`	[MOD]	Added `USE_RAW_RGB`, TPIE model selection
`examples/yolo26_detect/main/app_main.cpp`	[MOD]	Raw RGB input path for bit-exact validation
`examples/yolo26_detect/main/images/generate_raw_rgb.py`	[NEW]	Generates `raw_rgb_bus.h` from `bus.jpg`
`examples/yolo26_detect/main/images/raw_rgb_bus.h`	[NEW]	Pre-baked RGB pixels (bypasses JPEG decoder)

Verification Checklist

Bit-exact validation: 5/5 detections match between Python notebook and ESP32-P4 firmware (raw RGB input mode)
mAP validation: 0.361 mAP50-95 on COCO val2017 (5,000 images) at 512×512
Latency measurement: 1,088ms inference on ESP32-P4 (esp_timer_get_time(), microsecond precision)
Per-kernel profiling: Each kernel's speedup independently measured and attributed
Regression check: PR feat: Add YOLO26n support for ESP32-P4/S3 (NMS-Free) (AIV-805) #286 baseline model (yolo26n_512_s8_p4.espdl) still produces identical results no breaking changes
Build verification: idf.py build succeeds with all new .S and .hpp files for ESP32-P4
Custom ops export: custom_ops_patch.py correctly emits HardSiluPie8, TransposePIE, TiledConvBlock ops in .espdl
Neural Morphing reproducibility: T1 strategy produces consistent 45/66 acceptance rate across runs
Documentation: All READMEs updated with TPIE benchmarks and usage instructions

Known Issue: NN-LUT Auto-Dispatch in Shared Module

⚠️ The NN-LUT SIMD path added to dl_module_lut.hpp (lines 76-88) auto-dispatches based on runtime conditions (power-of-2 step, alignment, size % 8 == 0). This silently replaces linear interpolation with nearest-neighbor lookup for any model meeting these conditions including existing models trained with linear interpolation.

For the TPIE model (yolo26n_512_s8_p4_tpie.espdl) this is correct the Python emulator was trained with NN rounding. For existing models (e.g., PR #286 baseline), this changes numerical output.

Proposed solutions (open for discussion):

Model attribute gate: Add a lut_mode field to .espdl format. Only dispatch to NN-SIMD when lut_mode == NEAREST_NEIGHBOR. Requires format change.
Separate module: Keep dl_module_lut.hpp unchanged. Create dl_module_lut_nn.hpp used only by TPIE models via dl_module_creator.hpp dispatch.
Compile-time flag: Gate behind #define DL_LUT_ENABLE_NN_SIMD.

Reproducibility: How to Run the Full Pipeline

Step 1: Neural Morphing T1

cd examples/tutorial/how_to_quantize_model/quantize_yolo26/integrations/yolo26n/Transformation1_SiLU_to_HardSiLU/
python run_transform.py
# Output: output/T1_silu_hsilu_512_p4/morphed_hsilu.native

Step 2: P4-Optimized Quantization

Open quantize_yolo26_coco_p4.ipynb and run all cells sequentially:

Cells 1–3: Load morphed_hsilu.native, configure INT16 layers, calibrate (percentile)
Cell 4: TQT optimization
Cell 5: import custom_ops_patch (registers HardSiluPie8, TransposePIE, TiledConvBlock)
Cell 6–8: LUT fusion (NN mode), graph surgery (Box/Class split)
Cell 9: Export → yolo26n_512_s8_p4_tpie.espdl
Cell 10: mAP evaluation (uses ideal math, reports 0.361)
Cell 12: Bit-exact validation (eval_espdl_model() with SIMULATION mode + FP64 Conv)

Step 3: Flash and Verify on ESP32-P4

cd examples/yolo26_detect
# Set model in main/CMakeLists.txt:
#   set(MODEL_FILENAME "yolo26n_512_s8_p4_tpie.espdl")
#   set(USE_RAW_RGB ON)   # for bit-exact validation
idf.py set-target esp32p4
idf.py build flash monitor

Expected output: 5 detections matching the notebook exactly (see Bit-Exact Validation section).

Switching Between Models

CMake setting	Model	Expected latency
`yolo26n_512_s8_p4.espdl`	PR #286 baseline	2,072ms
`yolo26n_512_s8_p4_tpie.espdl`	This PR (optimized)	1,088ms

… debug flag

…tion - Add m_c_tile member and c_tile attribute (from PPQ export) - Inner channel tile loop slices filter via pointer arithmetic using (N/16)HWC16 SIMD-interleaved layout - Strided output writes into shared tile_buf via output_x/y_offset override - Per-channel quantization factors correctly offset by c_start - Backward compatible: c_tile=0 (default) disables channel tiling Result: 1482ms -> 1156ms inference (22% speedup, bit-exact output)

…tput, Espressif code style

…> int64_t*)

…1.90x speedup)

BoumedineBillal · 2026-06-12T01:18:53Z

Note: The results in this PR were obtained with ESP-PPQ < 1.2.10 (per-tensor quantization for Conv/Gemm). The recent per-channel quantization upgrade ([2026/4/20] ESP-PPQ ≥ 1.2.10, ESP-DL ≥ 3.3.1) is not reflected in these benchmarks. A follow-up PR could evaluate the impact of per-channel quantization on mAP and latency for both the baseline and TPIE models.

BoumedineBillal added 7 commits May 17, 2026 07:18

TiledConvBlock v1: per-forward SRAM alloc, exponent fix, NHWC tile_h,…

ff71bc2

… debug flag

TiledConvBlock: remove tile_buf, write directly to L2-cached PSRAM ou…

6602020

…tput, Espressif code style

TiledConvBlock: fix INT16 bias pointer for channel tiling (int32_t* -…

56398f2

…> int64_t*)

HardSiluPie8 module + PIE8 NN LUT kernel: bit-exact validated at 1096ms

f5c7f10

HardSiluPie8 module + PIE8 NN LUT kernel: bit-exact validated at 1096ms

8be6842

feat: ESP32-P4 PIE SIMD optimizations for YOLO26n (2067ms -> 1088ms, …

400dba5

…1.90x speedup)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/yolo26 p4 optimizations#310

Feature/yolo26 p4 optimizations#310
BoumedineBillal wants to merge 7 commits into
espressif:masterfrom
BoumedineBillal:feature/yolo26-p4-optimizations

BoumedineBillal commented Jun 12, 2026

Uh oh!

BoumedineBillal commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BoumedineBillal commented Jun 12, 2026

feat: ESP32-P4 PIE SIMD kernels + Neural Morphing for YOLO26n 1,088ms @ 512×512, 0.361 mAP (1.90× vs #286 baseline)

Executive Summary

Performance Breakdown

Architecture: Three Optimization Layers

Layer 1: Neural Morphing Engine

T1: SiLU → HardSiLU8 (Deployed)

T2 & T3 (Ablation Only Not Deployed)

Extensibility

Layer 2: PIE SIMD Kernels

K1: HardSiluPie8 (dl_esp32p4_s8_hard_silu8.S)

K2: TransposePIE (dl_esp32p4_s8_transpose.S)

K3: TiledConvBlock (dl_module_tiled_conv_block.hpp)

K4: NN-LUT INT16 (dl_esp32p4_s16_lut_nearest_neighbor.S)

Layer 3: Python Quantization Pipeline

Custom Ops Registration (custom_ops_patch.py)

NN-LUT Emulator (emulator_nearest_neighbor.py)

P4 Notebook (quantize_yolo26_coco_p4.ipynb)

Bit-Exact Validation

Ecosystem Impact

Reusable PIE Kernels

Applicability to esp-detection

Neural Morphing Engine Extensibility

Development Methodology

Deliverables

Layer 1: PIE SIMD Assembly Kernels

Layer 2: C++ ESP-DL Module Headers

Layer 3: Python Quantization Pipeline

Layer 4: Neural Morphing Engine

Layer 5: Model + Firmware Updates

Verification Checklist

Known Issue: NN-LUT Auto-Dispatch in Shared Module

Reproducibility: How to Run the Full Pipeline

Step 1: Neural Morphing T1

Step 2: P4-Optimized Quantization

Step 3: Flash and Verify on ESP32-P4

Switching Between Models

Uh oh!

BoumedineBillal commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

K1: HardSiluPie8 (`dl_esp32p4_s8_hard_silu8.S`)

K2: TransposePIE (`dl_esp32p4_s8_transpose.S`)

K3: TiledConvBlock (`dl_module_tiled_conv_block.hpp`)

K4: NN-LUT INT16 (`dl_esp32p4_s16_lut_nearest_neighbor.S`)

Custom Ops Registration (`custom_ops_patch.py`)

NN-LUT Emulator (`emulator_nearest_neighbor.py`)

P4 Notebook (`quantize_yolo26_coco_p4.ipynb`)

Applicability to `esp-detection`