Skip to content

Feature/yolo26 p4 optimizations#310

Open
BoumedineBillal wants to merge 7 commits into
espressif:masterfrom
BoumedineBillal:feature/yolo26-p4-optimizations
Open

Feature/yolo26 p4 optimizations#310
BoumedineBillal wants to merge 7 commits into
espressif:masterfrom
BoumedineBillal:feature/yolo26-p4-optimizations

Conversation

@BoumedineBillal

Copy link
Copy Markdown
Contributor

feat: ESP32-P4 PIE SIMD kernels + Neural Morphing for YOLO26n 1,088ms @ 512×512, 0.361 mAP (1.90× vs #286 baseline)

Executive Summary

This PR adds 4 custom PIE SIMD assembly kernels and a Neural Morphing Engine for ESP32-P4, reducing YOLO26n inference from 2,072ms → 1,088ms (1.90× speedup) at 512×512 resolution while maintaining 0.361 mAP50-95 (−0.004 from baseline). This builds on the baseline YOLO26n deployment merged in #286.

These optimizations are not YOLO26n-specific. The PIE kernels accelerate any model containing matching operator patterns (Conv+Activation, Transpose, INT16 LUT, SiLU/HardSiLU), and the Neural Morphing Engine provides a reusable strategy framework for automated graph transformation with quality-gated distillation applicable to any model deployed via esp-ppq.

Scope: ESP32-P4 only. All assembly kernels use ESP32-P4 PIE SIMD instructions.

Model Configuration Resolution mAP50-95 Inference vs #286 vs YOLOv11n
YOLO26n (This PR) T1 Morph + 4 PIE kernels 512×512 0.361 1,088ms 1.90× faster 2.54× faster
YOLO26n (PR #286) PTQ + TQT + LUT 512×512 0.365 2,072ms baseline 1.33× faster
YOLOv11n (Stock) Official ESP-DL 640×640 0.360 2,764ms baseline

Performance Breakdown

Every millisecond saved is attributed to a specific kernel, with no overlap:

Kernel Target Operators Speedup Latency Saved Mechanism
TransposePIE Attention transposes (PSA) 89× avg, 917× peak 303ms 3-stage vzip butterfly in PIE registers
TiledConvBlock Conv+Act layers (spatial + 1×1) 13.3× on largest layer 379ms H-tiling in L2 cache, fused in-place activation
NN-LUT INT16 INT16 Swish activations 4.2× 156ms NN lookup via XOR + vmul.u16 HALF_EVEN rounding, 4-way interleaved
HardSiluPie8 SiLU→HardSiLU8 (morphed) 6.5× 131ms 12-instruction piecewise-linear kernel with learned scale
~984ms total 2,072ms → 1,088ms

TiledConvBlock detail: 326ms from spatial H-tiling (Conv 3×3, 5×5) + 53ms from output channel tiling (c_tile) on 1×1 Conv layers. The largest single-layer improvement is model.7/conv (DW 5×5): 305ms → 23ms (13.3×).


Architecture: Three Optimization Layers

The optimizations are organized into three independent layers, each contributing to the final result:

Layer 1: Neural Morphing Engine (Model-Level)
    yolo26n.pt → T1: SiLU→HardSiLU8 (45/66 accepted) → morphed_hsilu.native
                                    ↓
Layer 3: Python Pipeline (Quantization + Emulation)
    morphed_hsilu.native → calibrate → TQT → custom_ops_patch → export
                                    ↓
    yolo26n_512_s8_p4_tpie.espdl (contains HardSiluPie8/TransposePIE/TiledConvBlock ops)
                                    ↓
Layer 2: PIE SIMD Kernels (Hardware-Level)
    dl_module_creator.hpp deserializes → instantiates C++ modules → calls PIE assembly
                                    ↓
    1,088ms inference, bit-exact with Python emulation

Operator-level graph transformation (Netron visualization baseline left, optimized right):

ch4_netron_comparison

Three operator replacements: Conv+Swish → TiledConvBlock (fused with L2-cache tiling), Swish → HardSiluPie8 (via Neural Morphing T1), Transpose → TransposePIE (PIE SIMD dispatch).


Layer 1: Neural Morphing Engine

A general-purpose framework for automated, quality-gated graph transformations on PPQ .native graphs. The engine implements a strategy pattern: each transformation is a self-contained BaseReplacementStrategy that the engine orchestrates through a 5-phase pipeline:

A. Select targets → B. Build replacements → C. Block-wise distillation → D. Quality gate → E. Accept/Rollback

The core technique block-wise knowledge distillation is inspired by esp-ppq's TQT implementation, which combines TQT scale learning (per-tensor, symmetric, power-of-two constraints) with the block-wise reconstruction strategy from BRECQ (Li et al., ICLR 2021). The Neural Morphing Engine extends this approach: instead of only optimizing quantization scales, it uses block-wise distillation to optimize architectural replacements (e.g., SiLU → HardSiLU8 with learned scale). The reconstruction objective is strategy-defined T1 uses Huber+Cosine loss (robust to activation outliers), while the engine itself is loss-agnostic.

T1: SiLU → HardSiLU8 (Deployed)

Replaces standard SiLU activations with HardSiLU8 a piecewise-linear approximation that maps directly to the HardSiluPie8 PIE kernel. Each replacement includes a learnable scale factor (clamp-quantized to scale_int ∈ [0, 256] via clamp(0,1) × 256 → round) trained via block-wise Huber+Cosine distillation.

Metric Value
Candidates scanned 66 SiLU ops
Accepted 45 (68.2%)
Rejected 21 (PSA attention cosine < threshold)
Quality gate cos ≥ 0.9908, scale ratio ∈ [0.985, 1.015]
FP32 mAP impact 0.375 → 0.370 (−0.005, 98.6% preserved)
Post-quantization mAP 0.365 → 0.361 (−0.004)

T2 & T3 (Ablation Only Not Deployed)

Two additional strategies were evaluated but not included in the deployed model:

  • T2 (Conv Decomposition): Decomposes 3×3 Conv into residual DW+PW with SVD initialization. FP32 recovery was good (0.367 mAP), but quantized performance dropped to 0.323 the residual skip path creates a quantization-sensitive addition node.
  • T3 (Channel Pruning): L1-norm pointwise pruning at 20% ratio. FP32 recovery reached 0.362, but interaction with T2's degraded baseline made the combined result unacceptable.

These strategies remain in the repository as documented ablation studies and starting points for future work.

Extensibility

New strategies are added by implementing BaseReplacementStrategy, which defines the full distillation contract:

Method / Property What the Strategy Controls
select_target() Which ops to replace
build_replacement() How to construct the replacement subgraph
evaluate_validation() Quality gate logic (accept/reject decision)
get_criterion() Distillation loss function (e.g., HuberCosine, MSE)
get_scheduler() LR schedule (e.g., CosineAnnealingWarmRestarts)
calculate_samples() How many batches to cache per block
learning_rate, weight_decay, steps, patience Optimizer hyperparameters

Optional hooks: on_step_end() (per-step constraints), requires_predecessor (include predecessor Conv in distillation block for activation strategies), compensate_on_reject (fine-tune original block after rejection to correct upstream drift).

The engine handles the orchestration: block iteration, subgraph splicing, calibration data caching, gradient loop execution, rollback on rejection, and genealogy metadata tracking.


Layer 2: PIE SIMD Kernels

K1: HardSiluPie8 (dl_esp32p4_s8_hard_silu8.S)

12 computation PIE instructions (16 total including load, store, and 2× SAR setup) process 16 INT8 elements per iteration. The kernel computes:

y = (x_scaled × gate) >> SAR_total
where gate = clamp(x + offset, 0, max_gate)
      x_scaled = x × scale_int

All arithmetic stays in INT16 (sign-extended from INT8). The scale_int parameter is deserialized from the .espdl model it encodes the per-layer learned scale factor from T1 distillation.

6.5× faster than scalar SiLU. Saves 131ms across 45 morphed activations.

K2: TransposePIE (dl_esp32p4_s8_transpose.S)

Contains two kernels selected at runtime by a 6-step dispatch algorithm in dl_module_transpose_pie.hpp:

  • Byte-zip kernel (K1): 3-stage vzip.8vzip.16vzip.32 butterfly pattern. Transposes an 8×16 tile using only PIE register shuffles zero memory scratch space. 14.6× average on small attention tensors.
  • Block-copy kernel (K2): For large tensors exceeding register capacity. 16-byte aligned vld/vst block moves. 30× average, up to 917× peak on large (80×1024) transposes.

The dispatch algorithm auto-selects the optimal kernel based on tensor dimensions and alignment. 89× average speedup across all transpose ops. Saves 303ms.

K3: TiledConvBlock (dl_module_tiled_conv_block.hpp)

Fuses Conv + Activation into a single tiled operation that exploits the ESP32-P4's 768KB L2 cache (SRAM), with two independent tiling dimensions:

  1. Spatial H-tiling: Splits the output height into tiles sized to fit in L2. Each tile's Conv output stays in cache for the fused activation no PSRAM round-trip between Conv and Act. Saves 326ms across Conv 3×3 and 5×5 layers.
  2. Output channel tiling (c_tile): For layers where the full filter working set (kH × kW × C_in × C_out) exceeds the L2 budget, splits the convolution into chunks of c_tile output channels. Each chunk's filter slice fits in cache, and the output channels are processed sequentially. c_tile is aligned down to vector width (16 for INT8, 8 for INT16). Saves 53ms across 1×1 Conv layers with large channel counts.
  3. In-place activation: HardSiLU8 or NN-LUT applied directly on the L2-resident tile after each spatial or channel chunk no intermediate buffer.

The tiling parameters (tile_h, c_tile) are computed at export time from the layer dimensions and available cache budget (L2 / 8 = 32KB per tile).

13.3× on model.7/conv (the largest DW 5×5 layer: 305ms → 23ms). Saves 379ms total (326ms spatial + 53ms channel).

K4: NN-LUT INT16 (dl_esp32p4_s16_lut_nearest_neighbor.S)

Replaces the linear-interpolation LUT from PR #286 with a nearest-neighbor lookup. Since the step size is a power of 2 (step=32), the SIMD index computation uses XOR sign-flip + vmul.u16 with hardware SAR:

esp.xorq     q0, q0, q7    // signed → unsigned: XOR with 0x8000 (+32768)
esp.vmul.u16 q0, q0, q5    // q0 = (q0 × 1) >> SAR  (HALF_EVEN rounding)

The vmul.u16 multiplies by broadcast(1) with SAR = log2(step), giving HALF_EVEN (banker's) rounding to the nearest table slot more accurate than plain truncation. Each element is then extracted, table-looked-up via scalar lh, and re-inserted using 4-way interleaved extract→address→load→insert waves for instruction-level parallelism.

This eliminates the per-element modulo, multiply, and divide of the original linear interpolation. The table remains 2,049 entries (same as PR #286); only the lookup strategy changes.

4.2× faster than interpolated LUT. Saves 156ms across 21 INT16 Swish activations.


Layer 3: Python Quantization Pipeline

Custom Ops Registration (custom_ops_patch.py)

Registers 3 custom operation types in esp-ppq so the graph correctly represents the P4-optimized operators:

Op Type Forward Emulator Backward What It Does
HardSiluPie8 INT16 piecewise-linear with learned scale STE Bit-exact emulation of dl_esp32p4_s8_hard_silu8.S arithmetic
TransposePIE Identity (auto-renames INT8 Transpose ops) Pass-through Ensures correct op type in exported .espdl
TiledConvBlock None (export-only graph surgery) Fuses Conv+Act pairs into single node at export time

The HardSiluPie8 emulator is a torch.autograd.Function that replicates the exact integer arithmetic of the PIE kernel (clamp, multiply, SAR shift, scale multiplication) with STE backward for TQT compatibility.

NN-LUT Emulator (emulator_nearest_neighbor.py)

Extends esp_ppq_lut with a nearest-neighbor simulation mode. When custom_ops_patch.py is imported, it replaces the linear-interpolation emulator with the NN variant matching the dl_esp32p4_s16_lut_nearest_neighbor.S rounding behavior.

P4 Notebook (quantize_yolo26_coco_p4.ipynb)

The notebook loads the T1-morphed graph (morphed_hsilu.native), applies the P4-specific pipeline:

Load morphed_hsilu.native → Apply INT16 layers → Calibrate (percentile)
→ TQT → custom_ops_patch (register HardSiluPie8/TransposePIE/TiledConvBlock)
→ LUT fusion (NN mode) → Graph surgery (Box/Class split) → Export .espdl

Bit-Exact Validation

The optimized model produces identical detections between the Python notebook and the ESP32-P4 hardware:

Python Notebook                           ESP32-P4 Firmware (Raw RGB mode)
─────────────────────────────────         ──────────────────────────────────
person   conf=0.85  [86,186,177,429]      person   conf=0.85  [86,186,177,429]
bus      conf=0.83  [68,109,448,349]      bus      conf=0.83  [68,109,448,349]
person   conf=0.76  [169,193,229,406]     person   conf=0.76  [169,193,229,406]
person   conf=0.49  [380,186,449,416]     person   conf=0.49  [380,186,449,416]
person   conf=0.47  [62,262,97,415]       person   conf=0.47  [62,262,97,415]

Result: 5/5 detections, 100% match (class, confidence, bounding box)

This parity requires a 6-link chain breaking any single link introduces mismatches:

# Component What It Eliminates Location
1 Raw RGB input bypass JPEG decoder drift (even 1-bit amplifies through 100+ quantized layers) generate_raw_rgb.py + USE_RAW_RGB CMake define
2 espdl_preprocess() OpenCV resize drift replicates ESP-DL's C++ resize_nn coordinate truncation: m_x[i] = int(i * inv_scale_x) notebook_helpers.py
3 HardSiluPie8 emulator Wrong activation torch.autograd.Function replicating exact integer arithmetic (clamp → multiply → SAR shift → scale) with STE backward custom_ops_patch.py
4 NN-LUT emulator Wrong rounding simulates vmul.u16 HALF_EVEN rounding (not C-truncation, not Python floor-division) emulator_nearest_neighbor.py
5 FP64 Conv toggle float32 mantissa drift ESP32-P4 uses 64-bit integer accumulators for INT16 Conv; PyTorch float32 has only 24-bit mantissa, truncating large accumulations in the INT16 detection head enable_fp64_conv() in notebook_helpers.py
6 SIMULATION mode Wrong LUT mode forces hardware-faithful NN-LUT (rounded index → direct lookup) instead of ideal-math float Swish set_simulation_mode(SimulationMode.SIMULATION)
Raw RGB → espdl_preprocess → FP64 Conv → HardSiluPie8 emulator → NN-LUT emulator → SIMULATION mode
   ↓            ↓                ↓               ↓                      ↓                ↓
JPEG drift  OpenCV drift   float32 drift   wrong activation      wrong rounding    wrong LUT mode

Ecosystem Impact

Reusable PIE Kernels

The 4 kernels contributed in this PR are not YOLO26n-specific they accelerate any ESP32-P4 model containing matching operators:

Kernel Benefits Any Model With...
TransposePIE Attention layers (PSA, MHSA, any Transpose op)
TiledConvBlock Conv + Activation patterns (Conv+SiLU, Conv+ReLU, Conv+HardSiLU)
NN-LUT INT16 INT16 non-linear activations (Swish, Sigmoid, Tanh)
HardSiluPie8 HardSiLU8 activations (after Neural Morphing T1)

Applicability to esp-detection

The reusable kernels (TransposePIE, TiledConvBlock, NN-LUT) can benefit models deployed via esp-detection on ESP32-P4, since they accelerate common operator patterns (Conv+Activation, Transpose, INT16 LUT) regardless of the specific model architecture.

Neural Morphing Engine Extensibility

The strategy-pattern architecture makes it straightforward to add new transformations:

  • New activation replacements: Implement the full BaseReplacementStrategy contract (target selection, replacement building, quality gate, loss function, hyperparameters) to swap any activation for a hardware-friendly alternative.
  • New decompositions: T2 (Conv decomposition) is already implemented as a reference future work could target depthwise-separable or grouped convolutions.
  • New pruning criteria: T3 (channel pruning) demonstrates L1-norm ranking alternatives like Taylor expansion or gradient-based importance can be plugged in.

The engine handles the orchestration: block iteration, subgraph splicing, calibration data caching, gradient loop execution, rollback via genealogy metadata, and drift compensation. Each strategy controls its own loss function, quality gate, and hyperparameters (see Extensibility section above).

Development Methodology

The PIE SIMD kernels were developed using esp32-p4-jit, a JIT compilation tool that compiles C/ASM on the host PC and executes natively on the ESP32-P4 via USB in 1–2 seconds (vs 30–60s for full firmware rebuild). This enabled rapid iteration on cycle-level optimizations particularly critical for the 12-instruction HardSiluPie8 kernel and the 3-stage transpose butterfly, where each instruction ordering change required immediate hardware validation.


Deliverables

Layer 1: PIE SIMD Assembly Kernels

Path: dl/base/isa/esp32p4/

File Status Description
dl_esp32p4_s8_hard_silu8.S [NEW] HardSiLU8 kernel (12 PIE instr/16 elements, 6.5×)
dl_esp32p4_s8_transpose.S [MOD] K1 byte-zip (14.6×) + K2 block-copy (30×) transpose
dl_esp32p4_s16_lut_nearest_neighbor.S [NEW] NN INT16 LUT (4.2×, replaces linear interpolation)
dl_esp32p4_block_transpose.S [DEL] K2 consolidated into dl_esp32p4_s8_transpose.S
dl_esp32p4_s8_hard_silu_pie8.S [DEL] Renamed to dl_esp32p4_s8_hard_silu8.S
dl_esp32p4_s16_lut_pie8.S [DEL] Renamed to dl_esp32p4_s16_lut_nearest_neighbor.S

Also modified: dl_base_esp32p4.h adds extern "C" declaration for dl_esp32p4_s8_hard_silu8

Layer 2: C++ ESP-DL Module Headers

Path: dl/module/include/

File Status Description
dl_module_hard_silu8.hpp [NEW] Runtime wrapper deserializes scale_int from .espdl, precomputes constants
dl_module_transpose_pie.hpp [NEW] 6-step dispatch algorithm, auto-selects K1/K2/scalar per tensor
dl_module_tiled_conv_block.hpp [NEW] H-tiling + fused activation (HardSiLU8 or NN-LUT in-place on L2)
dl_module_lut.hpp [MOD] Added NN-LUT SIMD dispatch (power-of-2 step → SIMD path)
dl_module_creator.hpp [MOD] Registers HardSiluPie8, TransposePIE, TiledConvBlock deserializers
dl_module_hard_silu_pie8.hpp [DEL] Renamed to dl_module_hard_silu8.hpp
dl_module_tiled_conv_block copy.hpp [DEL] Stale copy removed

Also modified: CMakeLists.txt adds new .S files to the build

Layer 3: Python Quantization Pipeline

Path: examples/tutorial/how_to_quantize_model/quantize_yolo26/

File Status Description
quantize_yolo26_coco_p4.ipynb [NEW] P4-optimized notebook (loads T1 graph, exports TPIE model)
scripts/custom_ops_patch.py [NEW] Registers HardSiluPie8, TransposePIE, TiledConvBlock in esp-ppq
scripts/notebook_helpers.py [MOD] Added FP64 Conv toggle, espdl_preprocess, eval_espdl_model
esp_ppq_lut/emulator_nearest_neighbor.py [NEW] NN-LUT emulator (replaces linear interpolation emulator)
esp_ppq_lut/__init__.py [MOD] Registers SimulationMode switch
esp_ppq_lut/exporter.py [MOD] Exports NN-LUT step attribute

Layer 4: Neural Morphing Engine

Path: examples/tutorial/how_to_quantize_model/quantize_yolo26/

File/Directory Status Description
neural_morphing/__init__.py [NEW] Exports engine + interface
neural_morphing/engine.py [NEW] 5-phase pipeline: calibrate → transform → distill → evaluate → decide
neural_morphing/interface.py [NEW] BaseReplacementStrategy ABC (strategy pattern)
neural_morphing/README.md [NEW] Full documentation
integrations/yolo26n/Transformation1_SiLU_to_HardSiLU/ [NEW] T1 strategy + output checkpoint
integrations/yolo26n/Transformation2_Conv_to_DWPW/ [NEW] T2 strategy (ablation)
integrations/yolo26n/Transformation3_Conv1x1_Prune/ [NEW] T3 strategy (ablation)

Layer 5: Model + Firmware Updates

Path: models/yolo26/ and examples/yolo26_detect/

File Status Description
models/yolo26/models/p4/yolo26n_512_s8_p4_tpie.espdl [NEW] Optimized model (2.93 MB, 1,088ms)
models/yolo26/README.md [MOD] Added TPIE benchmark row
examples/yolo26_detect/README.md [MOD] Added P4-optimized model entry
examples/yolo26_detect/main/CMakeLists.txt [MOD] Added USE_RAW_RGB, TPIE model selection
examples/yolo26_detect/main/app_main.cpp [MOD] Raw RGB input path for bit-exact validation
examples/yolo26_detect/main/images/generate_raw_rgb.py [NEW] Generates raw_rgb_bus.h from bus.jpg
examples/yolo26_detect/main/images/raw_rgb_bus.h [NEW] Pre-baked RGB pixels (bypasses JPEG decoder)

Verification Checklist

  • Bit-exact validation: 5/5 detections match between Python notebook and ESP32-P4 firmware (raw RGB input mode)
  • mAP validation: 0.361 mAP50-95 on COCO val2017 (5,000 images) at 512×512
  • Latency measurement: 1,088ms inference on ESP32-P4 (esp_timer_get_time(), microsecond precision)
  • Per-kernel profiling: Each kernel's speedup independently measured and attributed
  • Regression check: PR feat: Add YOLO26n support for ESP32-P4/S3 (NMS-Free) (AIV-805) #286 baseline model (yolo26n_512_s8_p4.espdl) still produces identical results no breaking changes
  • Build verification: idf.py build succeeds with all new .S and .hpp files for ESP32-P4
  • Custom ops export: custom_ops_patch.py correctly emits HardSiluPie8, TransposePIE, TiledConvBlock ops in .espdl
  • Neural Morphing reproducibility: T1 strategy produces consistent 45/66 acceptance rate across runs
  • Documentation: All READMEs updated with TPIE benchmarks and usage instructions

Known Issue: NN-LUT Auto-Dispatch in Shared Module

⚠️ The NN-LUT SIMD path added to dl_module_lut.hpp (lines 76-88) auto-dispatches based on runtime conditions (power-of-2 step, alignment, size % 8 == 0). This silently replaces linear interpolation with nearest-neighbor lookup for any model meeting these conditions including existing models trained with linear interpolation.

For the TPIE model (yolo26n_512_s8_p4_tpie.espdl) this is correct the Python emulator was trained with NN rounding. For existing models (e.g., PR #286 baseline), this changes numerical output.

Proposed solutions (open for discussion):

  1. Model attribute gate: Add a lut_mode field to .espdl format. Only dispatch to NN-SIMD when lut_mode == NEAREST_NEIGHBOR. Requires format change.
  2. Separate module: Keep dl_module_lut.hpp unchanged. Create dl_module_lut_nn.hpp used only by TPIE models via dl_module_creator.hpp dispatch.
  3. Compile-time flag: Gate behind #define DL_LUT_ENABLE_NN_SIMD.

Reproducibility: How to Run the Full Pipeline

Step 1: Neural Morphing T1

cd examples/tutorial/how_to_quantize_model/quantize_yolo26/integrations/yolo26n/Transformation1_SiLU_to_HardSiLU/
python run_transform.py
# Output: output/T1_silu_hsilu_512_p4/morphed_hsilu.native

Step 2: P4-Optimized Quantization

Open quantize_yolo26_coco_p4.ipynb and run all cells sequentially:

  • Cells 1–3: Load morphed_hsilu.native, configure INT16 layers, calibrate (percentile)
  • Cell 4: TQT optimization
  • Cell 5: import custom_ops_patch (registers HardSiluPie8, TransposePIE, TiledConvBlock)
  • Cell 6–8: LUT fusion (NN mode), graph surgery (Box/Class split)
  • Cell 9: Export → yolo26n_512_s8_p4_tpie.espdl
  • Cell 10: mAP evaluation (uses ideal math, reports 0.361)
  • Cell 12: Bit-exact validation (eval_espdl_model() with SIMULATION mode + FP64 Conv)

Step 3: Flash and Verify on ESP32-P4

cd examples/yolo26_detect
# Set model in main/CMakeLists.txt:
#   set(MODEL_FILENAME "yolo26n_512_s8_p4_tpie.espdl")
#   set(USE_RAW_RGB ON)   # for bit-exact validation
idf.py set-target esp32p4
idf.py build flash monitor

Expected output: 5 detections matching the notebook exactly (see Bit-Exact Validation section).

Switching Between Models

CMake setting Model Expected latency
yolo26n_512_s8_p4.espdl PR #286 baseline 2,072ms
yolo26n_512_s8_p4_tpie.espdl This PR (optimized) 1,088ms

…tion

- Add m_c_tile member and c_tile attribute (from PPQ export)
- Inner channel tile loop slices filter via pointer arithmetic
  using (N/16)HWC16 SIMD-interleaved layout
- Strided output writes into shared tile_buf via output_x/y_offset override
- Per-channel quantization factors correctly offset by c_start
- Backward compatible: c_tile=0 (default) disables channel tiling

Result: 1482ms -> 1156ms inference (22% speedup, bit-exact output)
@BoumedineBillal

Copy link
Copy Markdown
Contributor Author

Note: The results in this PR were obtained with ESP-PPQ < 1.2.10 (per-tensor quantization for Conv/Gemm). The recent per-channel quantization upgrade ([2026/4/20] ESP-PPQ ≥ 1.2.10, ESP-DL ≥ 3.3.1) is not reflected in these benchmarks. A follow-up PR could evaluate the impact of per-channel quantization on mAP and latency for both the baseline and TPIE models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant