Feature/yolo26 p4 optimizations#310
Open
BoumedineBillal wants to merge 7 commits into
Open
Conversation
…tion - Add m_c_tile member and c_tile attribute (from PPQ export) - Inner channel tile loop slices filter via pointer arithmetic using (N/16)HWC16 SIMD-interleaved layout - Strided output writes into shared tile_buf via output_x/y_offset override - Per-channel quantization factors correctly offset by c_start - Backward compatible: c_tile=0 (default) disables channel tiling Result: 1482ms -> 1156ms inference (22% speedup, bit-exact output)
…tput, Espressif code style
Contributor
Author
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat: ESP32-P4 PIE SIMD kernels + Neural Morphing for YOLO26n 1,088ms @ 512×512, 0.361 mAP (1.90× vs #286 baseline)
Executive Summary
This PR adds 4 custom PIE SIMD assembly kernels and a Neural Morphing Engine for ESP32-P4, reducing YOLO26n inference from 2,072ms → 1,088ms (1.90× speedup) at 512×512 resolution while maintaining 0.361 mAP50-95 (−0.004 from baseline). This builds on the baseline YOLO26n deployment merged in #286.
These optimizations are not YOLO26n-specific. The PIE kernels accelerate any model containing matching operator patterns (Conv+Activation, Transpose, INT16 LUT, SiLU/HardSiLU), and the Neural Morphing Engine provides a reusable strategy framework for automated graph transformation with quality-gated distillation applicable to any model deployed via esp-ppq.
Performance Breakdown
Every millisecond saved is attributed to a specific kernel, with no overlap:
vzipbutterfly in PIE registersvmul.u16HALF_EVEN rounding, 4-way interleavedTiledConvBlock detail: 326ms from spatial H-tiling (Conv 3×3, 5×5) + 53ms from output channel tiling (
c_tile) on 1×1 Conv layers. The largest single-layer improvement ismodel.7/conv(DW 5×5): 305ms → 23ms (13.3×).Architecture: Three Optimization Layers
The optimizations are organized into three independent layers, each contributing to the final result:
Operator-level graph transformation (Netron visualization baseline left, optimized right):
Three operator replacements: Conv+Swish → TiledConvBlock (fused with L2-cache tiling), Swish → HardSiluPie8 (via Neural Morphing T1), Transpose → TransposePIE (PIE SIMD dispatch).
Layer 1: Neural Morphing Engine
A general-purpose framework for automated, quality-gated graph transformations on PPQ
.nativegraphs. The engine implements a strategy pattern: each transformation is a self-containedBaseReplacementStrategythat the engine orchestrates through a 5-phase pipeline:A. Select targets → B. Build replacements → C. Block-wise distillation → D. Quality gate → E. Accept/Rollback
The core technique block-wise knowledge distillation is inspired by esp-ppq's TQT implementation, which combines TQT scale learning (per-tensor, symmetric, power-of-two constraints) with the block-wise reconstruction strategy from BRECQ (Li et al., ICLR 2021). The Neural Morphing Engine extends this approach: instead of only optimizing quantization scales, it uses block-wise distillation to optimize architectural replacements (e.g., SiLU → HardSiLU8 with learned scale). The reconstruction objective is strategy-defined T1 uses Huber+Cosine loss (robust to activation outliers), while the engine itself is loss-agnostic.
T1: SiLU → HardSiLU8 (Deployed)
Replaces standard SiLU activations with
HardSiLU8a piecewise-linear approximation that maps directly to theHardSiluPie8PIE kernel. Each replacement includes a learnable scale factor (clamp-quantized toscale_int ∈ [0, 256]viaclamp(0,1) × 256 → round) trained via block-wise Huber+Cosine distillation.T2 & T3 (Ablation Only Not Deployed)
Two additional strategies were evaluated but not included in the deployed model:
These strategies remain in the repository as documented ablation studies and starting points for future work.
Extensibility
New strategies are added by implementing
BaseReplacementStrategy, which defines the full distillation contract:select_target()build_replacement()evaluate_validation()get_criterion()get_scheduler()calculate_samples()learning_rate,weight_decay,steps,patienceOptional hooks:
on_step_end()(per-step constraints),requires_predecessor(include predecessor Conv in distillation block for activation strategies),compensate_on_reject(fine-tune original block after rejection to correct upstream drift).The engine handles the orchestration: block iteration, subgraph splicing, calibration data caching, gradient loop execution, rollback on rejection, and genealogy metadata tracking.
Layer 2: PIE SIMD Kernels
K1: HardSiluPie8 (
dl_esp32p4_s8_hard_silu8.S)12 computation PIE instructions (16 total including load, store, and 2× SAR setup) process 16 INT8 elements per iteration. The kernel computes:
All arithmetic stays in INT16 (sign-extended from INT8). The
scale_intparameter is deserialized from the.espdlmodel it encodes the per-layer learned scale factor from T1 distillation.6.5× faster than scalar SiLU. Saves 131ms across 45 morphed activations.
K2: TransposePIE (
dl_esp32p4_s8_transpose.S)Contains two kernels selected at runtime by a 6-step dispatch algorithm in
dl_module_transpose_pie.hpp:vzip.8→vzip.16→vzip.32butterfly pattern. Transposes an 8×16 tile using only PIE register shuffles zero memory scratch space. 14.6× average on small attention tensors.vld/vstblock moves. 30× average, up to 917× peak on large (80×1024) transposes.The dispatch algorithm auto-selects the optimal kernel based on tensor dimensions and alignment. 89× average speedup across all transpose ops. Saves 303ms.
K3: TiledConvBlock (
dl_module_tiled_conv_block.hpp)Fuses Conv + Activation into a single tiled operation that exploits the ESP32-P4's 768KB L2 cache (SRAM), with two independent tiling dimensions:
c_tile): For layers where the full filter working set (kH × kW × C_in × C_out) exceeds the L2 budget, splits the convolution into chunks ofc_tileoutput channels. Each chunk's filter slice fits in cache, and the output channels are processed sequentially.c_tileis aligned down to vector width (16 for INT8, 8 for INT16). Saves 53ms across 1×1 Conv layers with large channel counts.The tiling parameters (
tile_h,c_tile) are computed at export time from the layer dimensions and available cache budget (L2 / 8 = 32KB per tile).13.3× on
model.7/conv(the largest DW 5×5 layer: 305ms → 23ms). Saves 379ms total (326ms spatial + 53ms channel).K4: NN-LUT INT16 (
dl_esp32p4_s16_lut_nearest_neighbor.S)Replaces the linear-interpolation LUT from PR #286 with a nearest-neighbor lookup. Since the step size is a power of 2 (
step=32), the SIMD index computation uses XOR sign-flip +vmul.u16with hardware SAR:The
vmul.u16multiplies bybroadcast(1)withSAR = log2(step), giving HALF_EVEN (banker's) rounding to the nearest table slot more accurate than plain truncation. Each element is then extracted, table-looked-up via scalarlh, and re-inserted using 4-way interleaved extract→address→load→insert waves for instruction-level parallelism.This eliminates the per-element modulo, multiply, and divide of the original linear interpolation. The table remains 2,049 entries (same as PR #286); only the lookup strategy changes.
4.2× faster than interpolated LUT. Saves 156ms across 21 INT16 Swish activations.
Layer 3: Python Quantization Pipeline
Custom Ops Registration (
custom_ops_patch.py)Registers 3 custom operation types in esp-ppq so the graph correctly represents the P4-optimized operators:
HardSiluPie8dl_esp32p4_s8_hard_silu8.SarithmeticTransposePIE.espdlTiledConvBlockThe
HardSiluPie8emulator is atorch.autograd.Functionthat replicates the exact integer arithmetic of the PIE kernel (clamp, multiply, SAR shift, scale multiplication) with STE backward for TQT compatibility.NN-LUT Emulator (
emulator_nearest_neighbor.py)Extends
esp_ppq_lutwith a nearest-neighbor simulation mode. Whencustom_ops_patch.pyis imported, it replaces the linear-interpolation emulator with the NN variant matching thedl_esp32p4_s16_lut_nearest_neighbor.Srounding behavior.P4 Notebook (
quantize_yolo26_coco_p4.ipynb)The notebook loads the T1-morphed graph (
morphed_hsilu.native), applies the P4-specific pipeline:Bit-Exact Validation
The optimized model produces identical detections between the Python notebook and the ESP32-P4 hardware:
This parity requires a 6-link chain breaking any single link introduces mismatches:
generate_raw_rgb.py+USE_RAW_RGBCMake defineespdl_preprocess()resize_nncoordinate truncation:m_x[i] = int(i * inv_scale_x)notebook_helpers.pytorch.autograd.Functionreplicating exact integer arithmetic (clamp → multiply → SAR shift → scale) with STE backwardcustom_ops_patch.pyvmul.u16HALF_EVEN rounding (not C-truncation, not Python floor-division)emulator_nearest_neighbor.pyenable_fp64_conv()innotebook_helpers.pyset_simulation_mode(SimulationMode.SIMULATION)Ecosystem Impact
Reusable PIE Kernels
The 4 kernels contributed in this PR are not YOLO26n-specific they accelerate any ESP32-P4 model containing matching operators:
Transposeop)Applicability to
esp-detectionThe reusable kernels (TransposePIE, TiledConvBlock, NN-LUT) can benefit models deployed via esp-detection on ESP32-P4, since they accelerate common operator patterns (Conv+Activation, Transpose, INT16 LUT) regardless of the specific model architecture.
Neural Morphing Engine Extensibility
The strategy-pattern architecture makes it straightforward to add new transformations:
BaseReplacementStrategycontract (target selection, replacement building, quality gate, loss function, hyperparameters) to swap any activation for a hardware-friendly alternative.The engine handles the orchestration: block iteration, subgraph splicing, calibration data caching, gradient loop execution, rollback via genealogy metadata, and drift compensation. Each strategy controls its own loss function, quality gate, and hyperparameters (see Extensibility section above).
Development Methodology
The PIE SIMD kernels were developed using
esp32-p4-jit, a JIT compilation tool that compiles C/ASM on the host PC and executes natively on the ESP32-P4 via USB in 1–2 seconds (vs 30–60s for full firmware rebuild). This enabled rapid iteration on cycle-level optimizations particularly critical for the 12-instructionHardSiluPie8kernel and the 3-stage transpose butterfly, where each instruction ordering change required immediate hardware validation.Deliverables
Layer 1: PIE SIMD Assembly Kernels
Path:
dl/base/isa/esp32p4/dl_esp32p4_s8_hard_silu8.Sdl_esp32p4_s8_transpose.Sdl_esp32p4_s16_lut_nearest_neighbor.Sdl_esp32p4_block_transpose.Sdl_esp32p4_s8_transpose.Sdl_esp32p4_s8_hard_silu_pie8.Sdl_esp32p4_s8_hard_silu8.Sdl_esp32p4_s16_lut_pie8.Sdl_esp32p4_s16_lut_nearest_neighbor.SAlso modified:
dl_base_esp32p4.haddsextern "C"declaration fordl_esp32p4_s8_hard_silu8Layer 2: C++ ESP-DL Module Headers
Path:
dl/module/include/dl_module_hard_silu8.hppscale_intfrom.espdl, precomputes constantsdl_module_transpose_pie.hppdl_module_tiled_conv_block.hppdl_module_lut.hppdl_module_creator.hppHardSiluPie8,TransposePIE,TiledConvBlockdeserializersdl_module_hard_silu_pie8.hppdl_module_hard_silu8.hppdl_module_tiled_conv_block copy.hppAlso modified:
CMakeLists.txtadds new.Sfiles to the buildLayer 3: Python Quantization Pipeline
Path:
examples/tutorial/how_to_quantize_model/quantize_yolo26/quantize_yolo26_coco_p4.ipynbscripts/custom_ops_patch.pyscripts/notebook_helpers.pyespdl_preprocess,eval_espdl_modelesp_ppq_lut/emulator_nearest_neighbor.pyesp_ppq_lut/__init__.pySimulationModeswitchesp_ppq_lut/exporter.pyLayer 4: Neural Morphing Engine
Path:
examples/tutorial/how_to_quantize_model/quantize_yolo26/neural_morphing/__init__.pyneural_morphing/engine.pyneural_morphing/interface.pyBaseReplacementStrategyABC (strategy pattern)neural_morphing/README.mdintegrations/yolo26n/Transformation1_SiLU_to_HardSiLU/integrations/yolo26n/Transformation2_Conv_to_DWPW/integrations/yolo26n/Transformation3_Conv1x1_Prune/Layer 5: Model + Firmware Updates
Path:
models/yolo26/andexamples/yolo26_detect/models/yolo26/models/p4/yolo26n_512_s8_p4_tpie.espdlmodels/yolo26/README.mdexamples/yolo26_detect/README.mdexamples/yolo26_detect/main/CMakeLists.txtUSE_RAW_RGB, TPIE model selectionexamples/yolo26_detect/main/app_main.cppexamples/yolo26_detect/main/images/generate_raw_rgb.pyraw_rgb_bus.hfrombus.jpgexamples/yolo26_detect/main/images/raw_rgb_bus.hVerification Checklist
esp_timer_get_time(), microsecond precision)yolo26n_512_s8_p4.espdl) still produces identical results no breaking changesidf.py buildsucceeds with all new.Sand.hppfiles for ESP32-P4custom_ops_patch.pycorrectly emitsHardSiluPie8,TransposePIE,TiledConvBlockops in.espdlKnown Issue: NN-LUT Auto-Dispatch in Shared Module
For the TPIE model (
yolo26n_512_s8_p4_tpie.espdl) this is correct the Python emulator was trained with NN rounding. For existing models (e.g., PR #286 baseline), this changes numerical output.Proposed solutions (open for discussion):
lut_modefield to.espdlformat. Only dispatch to NN-SIMD whenlut_mode == NEAREST_NEIGHBOR. Requires format change.dl_module_lut.hppunchanged. Createdl_module_lut_nn.hppused only by TPIE models viadl_module_creator.hppdispatch.#define DL_LUT_ENABLE_NN_SIMD.Reproducibility: How to Run the Full Pipeline
Step 1: Neural Morphing T1
Step 2: P4-Optimized Quantization
Open
quantize_yolo26_coco_p4.ipynband run all cells sequentially:morphed_hsilu.native, configure INT16 layers, calibrate (percentile)import custom_ops_patch(registers HardSiluPie8, TransposePIE, TiledConvBlock)yolo26n_512_s8_p4_tpie.espdleval_espdl_model()with SIMULATION mode + FP64 Conv)Step 3: Flash and Verify on ESP32-P4
Expected output: 5 detections matching the notebook exactly (see Bit-Exact Validation section).
Switching Between Models
yolo26n_512_s8_p4.espdlyolo26n_512_s8_p4_tpie.espdl