Skip to content

Model Conversion & Deployment

Mathis edited this page Sep 18, 2025 · 12 revisions

This page explains the steps for converting the fine-tuned YOLO11n model to the hardware-compatible .espdl format as well as its deployment.

Model Conversion

As the XIAO ESP32-S3 Sense MCU does not support to run PyTorch models natively, one needs can convert it to the vendor-specific .espdl format. However, .espdl it not the only model format that can be deployed on the aforementioned MCU. We also investigated the compatibility and feasibility of a TensorFlow Lite model deployment. After some research, we found out that the TensorFlow Lite format provides far less flexibility compared to .espdl: .tflite models consume more memory and reserve most of the PSRAM available, which makes it not feasible to acquire memory for other tasks like capturing and storing image data temporarily. .espdl models make use of zero-copy deserialization, allowing for efficient memory management and, thus, consumes less memory that can be utilized for other tasks. Further, .espdl models are specialized for Espressif MCUs and allow for utilization of their mechanisms (e.g., dual-core scheduling), which increases on-device inferences as well as power and memory consumptions.

For these reasons, we decided to use the .espdl format as well. In order to deploy a YOLO11n model on the aforementioned MCU, the model needs to be quantized. Therefore, Espressif provides ESP-PPQ, a fork of PPQ from OpenPPL specifically tailored to meet the needs of Espressif MCU model deployments.

Currently, ESP-PPQ supports ONNX, PyTorch, and TensorFlow models. During the quantization process, PyTorch and TensorFlow models are first converted to ONNX models, so each model can safely converted to ONNX first before being quantized without any risks of changing model architectures.

Convert YOLO11n to ONNX

We used the official deployment guide from Espressif for model conversion, quantization, and deployment of YOLO11n and adjusted some steps to meet our requirements.

YOLO11n model inferences on PC-side differ may from MCU on-device inferences. Thus, the authors of modified the YOLO11n PyTorch model by replacing the attention and detect layers in the model architectures. More specifically, the forward methods of attention and detect layers get replaced. They argue this step with the following obtained advantages:

  • "Faster inference. Compared to the original yolo11n model, operations related to decoding bounding boxes in Detect head are moved from the inference pass to the post-processing phase, resulting in a significant reduction in inference latency. On one hand, operations like Conv, Transpose, Slice, Split and Concat are time-consuming when applied during inference pass. On the other hand, the inference outputs are first filtered using a score threshold before decoding the boxes in the post-processing pass, which significantly reduces the number of calculations, thereby accelerating the overall inference speed."

  • "Lower quantization Error. The Concat and Add operators adopt joint quantization in ESP-PPQ. To reduce quantization errors, the box and score are output by separate branches, rather than being concatenated, due to the significant difference in their ranges. Similarly, since the ranges of the two inputs of Add and Sub differ significantly, the calculations are performed in the post-processing phase to avoid quantization errors."

More interestingly, these architecture adjustments are necessary when using the YOLO11n Postprocessor pipeline from the C/C++ code base for on-device inference as it specifies the output names for the final layer, which are accessed by the .espdl model during on-device inference.

To this end, they provided a custom script, which we integrated into our code base. This code gets refactored to meet common Python coding standards (e.g., Ruff, MyPy typing, etc.) and capsulated into the YoloConverter class. The yolo2onnx CLI command facilitates the model conversion to ONNX.

At the time writing this documentation, the conversion to ONNX only works for the ONNX opset version 13, which is why it is set as a default setting for the CLI. A batch size of 1 is necessary as well as the subsequent quantization only works for the mentioned batch size. Dynamic axes for ONNX export were also tested but ended up in runtime errors during model quantization, which is why the dynamic parameter is set to False as a default.

Conversion and Quantization of ONNX model to .espdl

We tested and evaluated Post-training Quantization (PTQ) and Quantization-aware Training (QAT) methods for model compression.

Post-training Quantization

For the conversion to .espdl, the ONNX model gets quantized. Espressif provided a set of files for model conversion. We integrated the code into our code base and implemented the OnnxQuantizer for this purpose.

We used our full combined and preprocessed dataset as a calibration dataset as up to 1000 images are recommended for calibration.

ESP-PPQ allows int8 and int16 quantization für the ESP32S3. However, compressing the YOLO11n model to int16 exceeded the memory required for model deployment. Even mixed precision quantization (see corresponding Espressif guide) exceeded the on-device memory. Thus, int8 quantization was the only option for PTQ.

During the project timeframe, we experienced the following model performance issue: Contrary to the official YOLO11n int8 quantization guide, inserting QuantizationSettingFactory.espdl_setting() for the setting parameter of the espdl_quantize_onnx function the quantized model's performance drop significantly, resulting in only few to no detected objects for the test data after deploying the model on the chip. After reverting the ESP-PPQ dependency to the 9445ddd436321189231c824f5b8e02b96e132bd0 commit hash (unfortunately, no package versioning was provided by the maintainers) and leaving the setting parameter of the espdl_quantize_onnx function blank resulted in the original model's performance. We assume that either the dependency updates or the QuantizationSettingFactory.espdl_setting() settings had a destructive impact on the quantized model's performance.

Quantization-aware Training

As QAT is the most promising technique for preserving model performance according to the literature, we tested QAT as well. Therefore, Espressif provided a guide how to do QAT for YOLO11n. The authors referred to custom scripts for facilitating model training and the full QAT pipeline with quantization parameters. We integrated the provided code into our code base, which resulted in the qat.py and quantization.py files.

For QAT, a comprehensive configuration YAML file has to be created in order to start a training job. Just like for usual full-precision training, the YAML config must contain information regarding the dataset (data YAML config), training arguments (QuantizationAwareTrainingArgs), and quantization arguments QuantizationArgs. The Pydantic models for these configurations can be found here. The following figure illustrates the dependencies and relationships for the QAT config:

qat_config

The provided example with the toy data (i.e., COCO) provided by the authors works. However, as soon as the COCO dataset got replaced by our training dataset, the QAT pipeline failed with an internal error: during the quantized model's inference, the postprocessing of the TorchExecutor forward function resulted in an out-of-bounds error when iterating over the raw model predictions (i.e., feature maps of bounding boxes). After tedious debugging, we couldn't come up with a fix and, thus, decided to reject this approach as the int8 PTQ already resulted in very good performance and we expected negligible improvements with QAT. We assume that the quantized model trained on our dataset outputs an incompatible output feature map that does not align with the hard-coded postprocessing of the TorchExecutor.

Model Conversion and Deployment in practice

The following provides a guide to the functions implemented and the steps required to convert and deploy a model onto an ESP32S3.

Conversion Requirements

  • Install requirements with uv:

    uv sync
    
  • Ensure you have pulled the dvc dataset

    dvc pull datasets
    
  • Put a yolo11n.pt model in the model-deployment/coco_detect/models/ folder

  • Optional: Change parameters of paths and constants in model-deployment/model_conversion/core/

Conversion

Prepare the data for model conversion

This script is the first step in the pipeline and is responsible for preparing the dataset for model conversion and evaluation. It begins by creating all necessary directories for the project. The script then resizes the original images to the model's required input shape and cleans up the calibration directory to only include test images. Finally, it converts the YOLO .txt format ground truth labels into .csv files with absolute pixel coordinates, making them ready for performance evaluation.

  python -m model_conversion.prepare_data

Evaluate model and generate .espdl file

This script orchestrates the core model conversion and evaluation process. It starts by generating baseline predictions using the original .pt model. Next, it converts the PyTorch model to the ONNX format. This ONNX model is then quantized to an 8-bit integer .espdl model, which is optimized for the ESP32S3 hardware. The script concludes by performing a comprehensive evaluation of both the original and the quantized models, comparing key performance metrics such as mean Average Precision (mAP), precision, and recall, and printing a detailed comparison.

For accurately evaluating the performance of the .espdl model, it uses the TorchExecutor. The workflow leverages a TorchExecutor from the ppq library to run the quantized model. This executor takes the quantized model graph and performs inference, outputting the raw feature maps of the model. They do not directly provide bounding box coordinates but rather encoded information about potential objects. Therefore, pre-processing of the input image and post-processing of the output feature maps are crucial steps.

Before an image is fed into the model for inference, it must be pre-processed to match the ESP32S3 model's input requirements. This involves resizing the image to the correct dimensions, normalizing its pixel values by adjusting for mean and standard deviation, and finally converting it into a PyTorch tensor.

After the TorchExecutor runs inference, it produces raw output feature maps, not direct bounding boxes. The post-processing function decodes these feature maps to calculate the coordinates of potential objects. These potential detections are then filtered by a confidence score to remove unlikely results. Finally, Non-Maximum Suppression is applied to eliminate redundant, overlapping boxes for the same object, resulting in a clean list of final detections.

  python -m model_conversion.run_evaluation

Visualize performance of model on images based on predictions

This utility script provides a visual representation of the model's performance on the test images. It draws bounding boxes for ground truth, true positives (TP), and false positives (FP) on the images, which helps in qualitatively assessing the model's accuracy. The script can be run for all object classes or filtered to visualize the performance for a specific class. Additionally, it generates a summary of images with the most false positives and false negatives to aid in error analysis.

  python -m model_conversion.visualize_evaluation --class_name bicycle  

Deployment Requirements

  • Ensure that the generated .espdl file, which is automatically placed in the model-deployment/coco_detect/models folder when the above steps are followed, is actually in the folder.

  • Ensure that the ESP-IDF Environment is correctly setup and the virtual environment is active to deploy the program to the ESP. (See here for instructions)

  • Optional: Change parameters for the detection thresholds in model-deployment/coco_detect/coco_detect.cpp. There, the first parameters after m_model represents the confidence and IoU threshold and the max detection value, in this case 25%, 70% and 100 respectively:

    new dl::detect::yolo11PostProcessor(m_model, 0.25, 0.7, 100, {{8, 8, 4, 4}, {16, 16, 8, 8}, {32, 32, 16, 16}});
    

Deployment

This deployment is tested for ESP32S3.

  • It will use the .espdl file from model-deployment/coco_detect/models/ automatically.

  • Move into ./yolo11_detect/ to build the program. Use the following command to build the program:

    idf.py fullclean build flash monitor
    

Bugs before build

  • If the build is crashing, it might be due to a too large image. Reduce the size of the image. The yolo11n is trained on images of size 640x640. Smaller resolutions also work.

  • Make sure you are in your ESP-IDF virtual environment with Python 3.10

  • Sometimes the IDF_TARGET has to be set again:

    unset IDF_TARGET
    idf.py set-target esp32s3
    

Prediction on a single image

To just make a prediction on a single image, use this official example of espressif https://github.com/espressif/esp-dl/tree/master/examples/yolo11_detect. There, copy an image into the corresponding 'img' folder with the correct name. Other changes to some variables may also be necessary. As our end product involves more than just predicting a single image, we haven't included this in our repository.