From 43fee71fd2a5065bf076ed541d62e4b3b7fbde54 Mon Sep 17 00:00:00 2001
From: Songhao Jia <gasoonjia@meta.com>
Date: Thu, 22 Jan 2026 13:00:14 -0800
Subject: [PATCH] tutorial for devtool debugibility (#16735)

Summary:

This diff introduces a tutorial for executorch devtool debugibility

Reviewed By: GregoryComer

Differential Revision: D90809255
---
 docs/source/devtools-tutorial.md              |  21 +-
 docs/source/model-debugging.md                |   2 +
 docs/source/model-inspector.rst               |   2 +-
 .../devtools-debugging-tutorial.py            | 487 ++++++++++++++++++
 docs/source/using-executorch-faqs.md          |   6 +-
 .../using-executorch-troubleshooting.md       |   1 +
 6 files changed, 515 insertions(+), 4 deletions(-)
 create mode 100644 docs/source/tutorials_source/devtools-debugging-tutorial.py

diff --git a/docs/source/devtools-tutorial.md b/docs/source/devtools-tutorial.md
index 6d540dc7f35..192b249422e 100644
--- a/docs/source/devtools-tutorial.md
+++ b/docs/source/devtools-tutorial.md
@@ -1,3 +1,20 @@
-## Developer Tools Usage Tutorial
+## Developer Tools Usage Tutorials
 
-Please refer to the [Developer Tools tutorial](tutorials/devtools-integration-tutorial) <!-- @lint-ignore --> for a walkthrough on how to profile a model in ExecuTorch using the Developer Tools.
+The ExecuTorch Developer Tools provide capabilities for profiling and debugging your models. We provide step-by-step tutorials for common workflows:
+
+### Profiling Tutorial
+
+Please refer to the [Profiling Tutorial](tutorials/devtools-integration-tutorial) <!-- @lint-ignore --> for a walkthrough on how to profile a model in ExecuTorch using the Developer Tools. This tutorial covers:
+
+- Generating ETRecord and ETDump artifacts
+- Using the Inspector API to analyze performance data
+- Identifying slow operators and bottlenecks
+
+### Debugging Tutorial
+
+Please refer to the [Debugging Tutorial](tutorials/devtools-debugging-tutorial) <!-- @lint-ignore --> for a walkthrough on how to debug numerical discrepancies in ExecuTorch models. This tutorial covers:
+
+- Capturing intermediate outputs with debug buffers
+- Using ``calculate_numeric_gap`` to identify precision issues
+- Debugging delegated models (e.g., XNNPACK)
+- Comparing runtime outputs with eager model references
diff --git a/docs/source/model-debugging.md b/docs/source/model-debugging.md
index 5cf0d7633fc..c090753468f 100644
--- a/docs/source/model-debugging.md
+++ b/docs/source/model-debugging.md
@@ -2,6 +2,8 @@
 
 With the ExecuTorch Developer Tools, users can debug their models for numerical inaccurcies and extract model outputs from their device to do quality analysis (such as Signal-to-Noise, Mean square error etc.).
 
+For a complete step-by-step walkthrough, please refer to the [Debugging Tutorial](tutorials/devtools-debugging-tutorial). <!-- @lint-ignore -->
+
 Currently, ExecuTorch supports the following debugging flows:
 - Extraction of model level outputs via ETDump.
 - Extraction of intermediate outputs (outside of delegates) via ETDump:
diff --git a/docs/source/model-inspector.rst b/docs/source/model-inspector.rst
index 4cda6580189..1af765cc7e4 100644
--- a/docs/source/model-inspector.rst
+++ b/docs/source/model-inspector.rst
@@ -17,7 +17,7 @@ APIs:
 * By accessing the `public attributes <#inspector-attributes>`__ of the ``Inspector``, ``EventBlock``, and ``Event`` classes.
 * By using a `CLI <#cli>`__ tool for basic functionalities.
 
-Please refer to the `e2e use case doc <tutorials/devtools-integration-tutorial.html>`__ get an understanding of how to use these in a real world example.
+Please refer to the `e2e use case doc <tutorials/devtools-integration-tutorial.html>`__ to get an understanding of how to use these for profiling, or the `debugging tutorial <tutorials/devtools-debugging-tutorial.html>`__ for debugging numerical discrepancies.
 
 
 Inspector Methods
diff --git a/docs/source/tutorials_source/devtools-debugging-tutorial.py b/docs/source/tutorials_source/devtools-debugging-tutorial.py
new file mode 100644
index 00000000000..fb8af3ea3e2
--- /dev/null
+++ b/docs/source/tutorials_source/devtools-debugging-tutorial.py
@@ -0,0 +1,487 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+"""
+Using the ExecuTorch Developer Tools for Numerical Debugging
+========================
+"""
+
+######################################################################
+# The `ExecuTorch Developer Tools <../devtools-overview.html>`__ is a set of tools designed to
+# provide users with the ability to profile, debug, and visualize ExecuTorch
+# models.
+#
+# This tutorial will show a full end-to-end flow of how to utilize the Developer Tools to debug a model
+# by detecting numerical discrepancies between the original PyTorch model and the ExecuTorch model.
+#
+# The tutorial will show you how to:
+# 1. Check if the lowered ExecuTorch model is numerically correct.
+# 2. Gain a deeper understanding of where the numerical discrepancy comes from using the Inspector API.
+#
+# This is particularly useful when working with delegated models (e.g., XNNPACK) where numerical
+# precision may differ. Specifically, it will:
+#
+# 1. Generate the artifacts consumed by the Developer Tools (`ETRecord <../etrecord.html>`__, `ETDump <../etdump.html>`__).
+# 2. Run the model and compare final outputs between eager model and runtime.
+# 3. If discrepancies exist, use the Inspector's `calculate_numeric_gap <../model-inspector.html#calculate-numeric-gap>`__ method to identify operator-level issues.
+#
+# .. note::
+#    Currently operator-level debugging support is limited to ET-visible operators,
+#    and treat every delegate call as a single operator.
+#    We are working on expanding this support to dive into delegate operators.
+#
+# We provide two example debugging pipelines on xnnpack-delegated Vision Transformer (VIT) model:
+#
+# - **Python Pipeline**: Export, run, and debug entirely in Python using the ExecuTorch Runtime API.
+# - **CMake Pipeline**: Export in Python, run with CMake example runner, then analyze in Python.
+
+######################################################################
+# Prerequisites
+# -------------
+#
+# To run this tutorial, you'll first need to
+# `Set up your ExecuTorch environment <../getting-started-setup.html>`__.
+#
+# For the Python pipeline, you'll need the ExecuTorch Python runtime bindings.
+# For the CMake pipeline, follow `these instructions <../runtime-build-and-cross-compilation.html#configure-the-cmake-build>`__ to set up CMake.
+#
+
+######################################################################
+# Pipeline 1: Python Runtime
+# =========================================================
+#
+# This pipeline allows you to export, run, and debug your model entirely in Python,
+# making it ideal for rapid iteration during development.
+
+######################################################################
+# Step 1: Export Model and Generate ETRecord
+# ------------------------------------------
+#
+# First, we export the model and generate an ``ETRecord``. The ETRecord contains
+# model graphs and metadata for linking runtime results to the eager model.
+# We use ``to_edge_transform_and_lower`` with ``generate_etrecord=True`` to
+# automatically capture the ETRecord during the lowering process.
+
+import os
+import tempfile
+
+import torch
+
+from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
+from executorch.backends.xnnpack.utils.configs import get_xnnpack_edge_compile_config
+
+from executorch.exir import ExecutorchProgramManager, to_edge_transform_and_lower
+from torch.export import export, ExportedProgram
+from torchvision import models  # type: ignore[import-untyped]
+
+# Create Vision Transformer model
+vit = models.vision_transformer.vit_b_16(weights="IMAGENET1K_V1")
+model = vit.eval()
+model_inputs = (torch.randn(1, 3, 224, 224),)
+
+temp_dir = tempfile.mkdtemp()
+
+# Export and lower model to XNNPACK delegate
+aten_model: ExportedProgram = export(model, model_inputs, strict=True)
+edge_program_manager = to_edge_transform_and_lower(
+    aten_model,
+    partitioner=[XnnpackPartitioner()],
+    compile_config=get_xnnpack_edge_compile_config(),
+    generate_etrecord=True,
+)
+
+et_program_manager: ExecutorchProgramManager = edge_program_manager.to_executorch()
+
+# Save the .pte file
+pte_path = os.path.join(temp_dir, "model.pte")
+et_program_manager.save(pte_path)
+
+# Get and save ETRecord with representative inputs
+etrecord = et_program_manager.get_etrecord()
+etrecord.update_representative_inputs(model_inputs)
+etrecord_path = os.path.join(temp_dir, "etrecord.bin")
+etrecord.save(etrecord_path)
+
+# sphinx_gallery_start_ignore
+from unittest.mock import patch
+
+# sphinx_gallery_end_ignore
+
+######################################################################
+#
+# .. note::
+#    The ``update_representative_inputs`` method is crucial for debugging.
+#    It stores the inputs that will be used to compute reference outputs
+#    from the exported program, which are then compared against the runtime outputs.
+#
+
+######################################################################
+# Step 2: Run Model and Generate ETDump with Debug Buffer
+# -------------------------------------------------------
+#
+# Next, we run the model using the ExecuTorch Python Runtime API with debug
+# output enabled. The debug buffer captures intermediate outputs from the
+# runtime execution.
+#
+# .. code-block:: python
+#
+#    from executorch.runtime import Method, Program, Runtime, Verification
+#
+#    # Load and run the model with debug output enabled
+#    et_runtime: Runtime = Runtime.get()
+#    program: Program = et_runtime.load_program(
+#        pte_path,
+#        verification=Verification.Minimal,
+#        enable_etdump=True,
+#        debug_buffer_size=1024 * 1024 * 1024,  # 1GB buffer
+#    )
+#
+#    forward: Method = program.load_method("forward")
+#    runtime_outputs = forward.execute(*model_inputs)
+#
+#    # Save ETDump and debug buffer
+#    etdump_path = os.path.join(temp_dir, "etdump.etdp")
+#    debug_buffer_path = os.path.join(temp_dir, "debug_buffer.bin")
+#    program.write_etdump_result_to_file(etdump_path, debug_buffer_path)
+#
+# .. warning::
+#    The debug buffer size should be large enough to hold all intermediate
+#    outputs.
+#    If the buffer is too small, some intermediate outputs may be truncated or error might be rasied.
+#
+
+######################################################################
+# Step 3: Compare Final Outputs (Best Practice)
+# ---------------------------------------------
+#
+# **Best Practice**: Before diving into operator-level debugging, first compare
+# the final outputs between the eager model and the runtime model. This helps
+# you quickly determine if there are any numerical issues worth investigating.
+#
+# .. code-block:: python
+#
+#    # Get eager model output
+#    with torch.no_grad():
+#        eager_output = model(*model_inputs)
+#
+#    # Compare with runtime output
+#    if isinstance(runtime_outputs, (list, tuple)):
+#        runtime_output = runtime_outputs[0]
+#    else:
+#        runtime_output = runtime_outputs
+#
+#    # Calculate MSE between eager and runtime outputs
+#    mse = torch.mean((eager_output - runtime_output) ** 2).item()
+#    print(f"Final output MSE: {mse}")
+#
+#    # Check if outputs are close enough
+#    if torch.allclose(eager_output, runtime_output, rtol=1e-3, atol=1e-5):
+#        print("Outputs match within tolerance!")
+#    else:
+#        print("Outputs differ - proceeding with operator-level analysis...")
+#
+
+######################################################################
+# Step 4: Operator-Level Debugging with calculate_numeric_gap
+# -----------------------------------------------------------
+#
+# If the final outputs show discrepancies, use the Inspector's ``calculate_numeric_gap``
+# method to identify which operators are contributing to the numerical differences.
+#
+# .. code-block:: python
+#
+#    import pandas as pd
+#    from executorch.devtools import Inspector
+#
+#    inspector = Inspector(
+#        etdump_path=etdump_path,
+#        etrecord=etrecord_path,
+#        debug_buffer_path=debug_buffer_path,
+#    )
+#
+#    pd.set_option("display.width", 100000)
+#    pd.set_option("display.max_columns", None)
+#
+#    # Calculate numerical gap using Mean Squared Error
+#    df: pd.DataFrame = inspector.calculate_numeric_gap("MSE")
+#    print(df)
+#
+# The returned DataFrame contains columns for each operator including:
+#
+# - ``aot_ops``: The operators in the eager model graph
+# - ``aot_intermediate_output``: Intermediate outputs from eager model
+# - ``runtime_ops``: The operators executed at runtime (may show DELEGATE_CALL for delegated ops)
+# - ``runtime_intermediate_output``: Intermediate outputs from runtime
+# - ``gap``: The numerical gap (MSE) between eager and runtime outputs
+#
+# Example output:
+#
+# .. code-block:: text
+#
+#    |    | aot_ops                                                         | aot_intermediate_output                            | runtime_ops                                        | runtime_intermediate_output                        | gap                        |
+#    |----|----------------------------------------------------------------|----------------------------------------------------|----------------------------------------------------|----------------------------------------------------| ---------------------------|
+#    | 0  | [conv2d]                                                        | [[[tensor([-0.0130,  0.0075, -0.0334, -0.0122,...  | [DELEGATE_CALL]                                    | [[[tensor([-0.0130,  0.0075, -0.0334, -0.0122,...  | [3.2530690555343034e-15]   |
+#    | 1  | [permute, cat, add, dropout]                                    | [[[tensor(-0.0024), tensor(0.0054), tensor(0.0...  | [DELEGATE_CALL]                                    | [[[tensor(-0.0024), tensor(0.0054), tensor(0.0...  | [3.2488685838924244e-15]   |
+#    | 4  | [transpose, linear, unflatten, unsqueeze, tran...]              | [[[tensor(0.0045), tensor(-0.0084), tensor(0.0...  | [DELEGATE_CALL, DELEGATE_CALL, DELEGATE_CALL, ...] | [[tensor(0.0045), tensor(-0.0084), tensor(0.00...  | [0.00010033142876115867]   |
+#    | 59 | [transpose_66, linear_44, unflatten_11, unsque...]              | [[[tensor(-0.3346), tensor(0.1540), tensor(-0....  | [DELEGATE_CALL, DELEGATE_CALL, DELEGATE_CALL, ...] | [[tensor(-0.3346), tensor(0.1540), tensor(-0.0...  | [0.02629170972698486]      |
+#
+
+######################################################################
+# Step 5: Analyze and Identify Problematic Operators
+# --------------------------------------------------
+#
+# Once you have the numerical gaps, identify operators with significant
+# discrepancies for further investigation.
+#
+# .. code-block:: python
+#
+#    # Find operators with the largest discrepancies
+#    df_sorted = df.sort_values(by="gap", ascending=False, key=lambda x: x.apply(lambda y: y[0] if isinstance(y, list) else y))
+#
+#    print("Top 5 operators with largest numerical discrepancies:")
+#    print(df_sorted.head(5))
+#
+#    # Filter for operators with gap above a threshold
+#    threshold = 1e-4
+#    problematic_ops = df[df["gap"].apply(lambda x: x[0] > threshold if isinstance(x, list) else x > threshold)]
+#    print(f"\nOperators with MSE > {threshold}:")
+#    print(problematic_ops)
+#
+# Example output showing problematic operators in a ViT model:
+#
+# .. code-block:: text
+#
+#    Top 5 operators with largest numerical discrepancies:
+#                                                  aot_ops                            aot_intermediate_output                                        runtime_ops                        runtime_intermediate_output                     gap
+#    59  [transpose_66, linear_44, unflatten_11, unsque...  [[[tensor(-0.3346), tensor(0.1540), tensor(-0....  [DELEGATE_CALL, DELEGATE_CALL, DELEGATE_CALL, ...  [[tensor(-0.3346), tensor(0.1540), tensor(-0.0...   [0.02629170972698486]
+#    24  [transpose_24, linear_16, unflatten_4, unsquee...  [[[tensor(0.0344), tensor(-0.0583), tensor(-0....  [DELEGATE_CALL, DELEGATE_CALL, DELEGATE_CALL, ...  [[tensor(0.0344), tensor(-0.0583), tensor(-0.0...  [0.010045093258604096]
+#    29  [transpose_30, linear_20, unflatten_5, unsquee...  [[[tensor(0.0457), tensor(0.0266), tensor(-0.0...  [DELEGATE_CALL, DELEGATE_CALL, DELEGATE_CALL, ...  [[tensor(0.0457), tensor(0.0266), tensor(-0.05...  [0.008497326594593926]
+#    34  [transpose_36, linear_24, unflatten_6, unsquee...  [[[tensor(-0.1336), tensor(-0.0154), tensor(-0...  [DELEGATE_CALL, DELEGATE_CALL, DELEGATE_CALL, ...  [[tensor(-0.1336), tensor(-0.0154), tensor(-0....  [0.007672668965640913]
+#    19  [transpose_18, linear_12, unflatten_3, unsquee...  [[[tensor(-0.0801), tensor(0.0458), tensor(-0....  [DELEGATE_CALL, DELEGATE_CALL, DELEGATE_CALL, ...  [[tensor(-0.0801), tensor(0.0458), tensor(-0.0...  [0.007446783635888463]
+#
+#    Operators with MSE > 0.0001:
+#                                                  aot_ops                            aot_intermediate_output                                        runtime_ops                        runtime_intermediate_output                       gap
+#    4   [transpose, linear, unflatten, unsqueeze, tran...  [[[tensor(0.0045), tensor(-0.0084), tensor(0.0...  [DELEGATE_CALL, DELEGATE_CALL, DELEGATE_CALL, ...  [[tensor(0.0045), tensor(-0.0084), tensor(0.00...  [0.00010033142876115867]
+#    9   [transpose_6, linear_4, unflatten_1, unsqueeze...  [[[tensor(0.0113), tensor(-0.0737), tensor(-0....  [DELEGATE_CALL, DELEGATE_CALL, DELEGATE_CALL, ...  [[tensor(0.0113), tensor(-0.0737), tensor(-0.0...   [0.0005611182577030275]
+#    14  [transpose_12, linear_8, unflatten_2, unsqueez...  [[[tensor(-0.0476), tensor(-0.0941), tensor(-0...  [DELEGATE_CALL, DELEGATE_CALL, DELEGATE_CALL, ...  [[tensor(-0.0476), tensor(-0.0941), tensor(-0....    [0.004658652508649068]
+#    ...
+#
+# In this example, we can see that the attention layers (transpose + linear + unflatten patterns)
+# show the largest numerical discrepancies, which is expected behavior for delegated operators
+# using different precision.
+
+######################################################################
+# Pipeline 2: CMake Runtime
+# ==========================
+#
+# This pipeline is useful when you want to test your model with the native
+# C++ runtime or on platforms where Python bindings are not available.
+
+######################################################################
+# Step 1: Export Model and Generate ETRecord
+# ------------------------------------------
+#
+# Same as Pipeline 1 - we reuse the model and export artifacts we already created.
+# The key artifact needed for the CMake pipeline is:
+#
+# - ``bundled_program.bpte``: The BundledProgram file contains the model and
+#   sample inputs/outputs for testing.
+#
+# Most of the pipeline were the same as Pipeline 1's Step 1. If you're only using
+# the CMake pipeline, use the same export code:
+#
+# .. code-block:: python
+#
+#    # Export and lower model (same as Pipeline 1)
+#    aten_model = export(model, model_inputs, strict=True)
+#    edge_program_manager = to_edge_transform_and_lower(
+#        aten_model,
+#        partitioner=[XnnpackPartitioner()],
+#        compile_config=get_xnnpack_edge_compile_config(),
+#        generate_etrecord=True,
+#    )
+#    et_program_manager = edge_program_manager.to_executorch()
+#
+#    # Save artifacts
+#    et_program_manager.save(pte_path)
+#    etrecord = et_program_manager.get_etrecord()
+#    etrecord.update_representative_inputs(model_inputs)
+#    etrecord.save(etrecord_path)
+#
+
+######################################################################
+# Step 2: Create BundledProgram
+# -----------------------------
+#
+# For the CMake pipeline, we create a ``BundledProgram`` that packages the model
+# with sample inputs and expected outputs for testing. We reuse the
+# ``et_program_manager`` from Step 1.
+#
+# .. code-block:: python
+#
+#    from executorch.devtools import BundledProgram
+#    from executorch.devtools.bundled_program.config import MethodTestCase, MethodTestSuite
+#    from executorch.devtools.bundled_program.serialize import (
+#        serialize_from_bundled_program_to_flatbuffer,
+#    )
+#
+#    # Construct Method Test Suites using the same model and inputs from Pipeline 1
+#    m_name = "forward"
+#    inputs = [model_inputs for _ in range(2)]
+#
+#    method_test_suites = [
+#        MethodTestSuite(
+#            method_name=m_name,
+#            test_cases=[
+#                MethodTestCase(inputs=inp, expected_outputs=model(*inp)) for inp in inputs
+#            ],
+#        )
+#    ]
+#
+#    # Generate BundledProgram using the existing et_program_manager
+#    bundled_program = BundledProgram(et_program_manager, method_test_suites)
+#
+#    # Serialize BundledProgram to flatbuffer
+#    serialized_bundled_program = serialize_from_bundled_program_to_flatbuffer(
+#        bundled_program
+#    )
+#    bundled_program_path = os.path.join(temp_dir, "bundled_program.bpte")
+#    with open(bundled_program_path, "wb") as f:
+#        f.write(serialized_bundled_program)
+#
+
+######################################################################
+# Step 3: Run with CMake Example Runner
+# -------------------------------------
+#
+# Build and run the example runner with output verification and debug output enabled::
+#
+#       cd executorch
+#       ./examples/devtools/build_example_runner.sh
+#       cmake-out/examples/devtools/example_runner \
+#           --bundled_program_path="bundled_program.bpte" \
+#           --output_verification=true \
+#           --dump_intermediate_outputs=true
+#
+# The key flags are:
+#
+# - ``--output_verification=true``: Compare runtime outputs against the expected
+#   outputs stored in the BundledProgram (uses rtol=1e-3, atol=1e-5)
+# - ``--dump_intermediate_outputs=true``: Capture intermediate outputs for
+#   operator-level debugging
+# - ``--debug_buffer_size=<bytes>``: Size of debug buffer (default: 256KB, increase
+#   for larger models)
+#
+# Example output on success:
+#
+# .. code-block:: text
+#
+#    I 00:00:00.123456 executorch:example_runner.cpp:135] Model file bundled_program.bpte is loaded.
+#    I 00:00:00.123456 executorch:example_runner.cpp:145] Running method forward
+#    I 00:00:00.234567 executorch:example_runner.cpp:250] Model executed successfully.
+#    I 00:00:00.234567 executorch:example_runner.cpp:287] Model verified successfully.
+#
+# If verification fails (outputs don't match within tolerance), you'll see an error:
+#
+# .. code-block:: text
+#
+#    E 00:00:00.234567 executorch:example_runner.cpp:287] Bundle verification failed with status 0x10
+#
+# This will also generate:
+#
+# - ``etdump.etdp``: The ETDump file containing execution trace (default path, configurable via ``--etdump_path``)
+# - ``debug_output.bin``: The debug buffer containing intermediate outputs (default path, configurable via ``--debug_output_path``)
+
+######################################################################
+# Step 4: Analyze Results in Python
+# ---------------------------------
+#
+# After running the model with the CMake runner, load the generated artifacts
+# back into Python for analysis using the Inspector.
+
+from executorch.devtools import Inspector
+
+# sphinx_gallery_start_ignore
+inspector_patch = patch.object(Inspector, "__init__", return_value=None)
+inspector_patch.start()
+# sphinx_gallery_end_ignore
+etrecord_path = "etrecord.bin"
+etdump_path = "etdump.etdp"
+debug_buffer_path = "debug_output.bin"
+
+inspector = Inspector(
+    etdump_path=etdump_path,
+    etrecord=etrecord_path,
+    debug_buffer_path=debug_buffer_path,
+)
+
+# sphinx_gallery_start_ignore
+inspector_patch.stop()
+# sphinx_gallery_end_ignore
+
+######################################################################
+# Then use the same analysis techniques as in Pipeline 1:
+#
+# .. code-block:: python
+#
+#    import pandas as pd
+#
+#    # Calculate numerical gaps
+#    df = inspector.calculate_numeric_gap("MSE")
+#
+#    # Find problematic operators
+#    df_sorted = df.sort_values(by="gap", ascending=False,
+#        key=lambda x: x.apply(lambda y: y[0] if isinstance(y, list) else y))
+#    print("Top operators with largest gaps:")
+#    print(df_sorted.head(5))
+#
+
+######################################################################
+# Best Practices for Debugging
+# ============================
+#
+# 1. **Start with final outputs**: Always compare the final model output first
+#    before diving into operator-level analysis. This saves time if outputs match.
+#
+# 2. **Use appropriate thresholds**: Small numerical differences (< 1e-6) are
+#    typically acceptable. Focus on operators with gaps > 1e-4.
+#
+# 3. **Focus on delegated operators**: Numerical discrepancies are most common
+#    in delegated operators (shown as ``DELEGATE_CALL``) due to different
+#    precision handling in delegate backends.
+#
+# 4. **Check accumulation patterns**: In transformer models, attention layers
+#    often show larger gaps due to accumulated numerical differences across
+#    many operations.
+#
+# 5. **Use stack traces**: With ETRecord, you can trace operators back to the
+#    original PyTorch source code for easier debugging using
+#    ``event.stack_traces`` and ``event.module_hierarchy``.
+#
+
+######################################################################
+# Conclusion
+# ----------
+#
+# In this tutorial, we learned how to use the ExecuTorch Developer Tools
+# to debug numerical discrepancies in models. The key workflow is:
+#
+# 1. Export the model with ETRecord generation enabled
+# 2. Run the model with debug buffer enabled (Python or CMake)
+# 3. **First** compare final outputs between eager and runtime models
+# 4. **If issues found**, use ``calculate_numeric_gap`` for operator-level analysis
+# 5. Identify and investigate operators with significant gaps
+#
+# Links Mentioned
+# ^^^^^^^^^^^^^^^
+#
+# - `ExecuTorch Developer Tools Overview <../devtools-overview.html>`__
+# - `ETRecord <../etrecord.html>`__
+# - `ETDump <../etdump.html>`__
+# - `Inspector <../model-inspector.html>`__
+# - `Model Debugging Guide <../model-debugging.html>`__
+# - `Profiling Tutorial <devtools-integration-tutorial.html>`__
diff --git a/docs/source/using-executorch-faqs.md b/docs/source/using-executorch-faqs.md
index c147403c9e8..ee11032d9bf 100644
--- a/docs/source/using-executorch-faqs.md
+++ b/docs/source/using-executorch-faqs.md
@@ -48,7 +48,11 @@ Thread count can be set with the following function. Ensure this is done prior t
 ::executorch::extension::threadpool::get_threadpool()->_unsafe_reset_threadpool(num_threads);
 ```
 
-For a deeper investigation into model performance, ExecuTorch supports operator-level performance profiling. See [Using the ExecuTorch Developer Tools to Profile a Model](devtools-integration-tutorial.md) for more information.
+For a deeper investigation into model performance, ExecuTorch supports operator-level performance profiling. See [Using the ExecuTorch Developer Tools to Profile a Model](tutorials/devtools-integration-tutorial) <!-- @lint-ignore --> for more information.
+
+### Numerical Accuracy Issues
+
+If you encounter numerical accuracy issues or unexpected model outputs, ExecuTorch provides debugging tools to identify numerical discrepancies. See [Using the ExecuTorch Developer Tools to Debug a Model](tutorials/devtools-debugging-tutorial) <!-- @lint-ignore --> for a step-by-step guide on debugging numerical issues in delegated models.
 
 ### Missing Logs
 
diff --git a/docs/source/using-executorch-troubleshooting.md b/docs/source/using-executorch-troubleshooting.md
index 75648dc5b46..ee28536e43b 100644
--- a/docs/source/using-executorch-troubleshooting.md
+++ b/docs/source/using-executorch-troubleshooting.md
@@ -17,4 +17,5 @@ The ExecuTorch developer tools, or devtools, are a collection of tooling for tro
 - [Frequently Asked Questions](using-executorch-faqs.md) for solutions to commonly encountered questions and issues.
 - [Introduction to the ExecuTorch Developer Tools](runtime-profiling.md) for a high-level introduction to available developer tooling.
 - [Using the ExecuTorch Developer Tools to Profile a Model](tutorials/devtools-integration-tutorial) <!-- @lint-ignore --> for information on runtime performance profiling.
+- [Using the ExecuTorch Developer Tools to Debug a Model](tutorials/devtools-debugging-tutorial) <!-- @lint-ignore --> for information on debugging numerical discrepancies.
 - [Inspector APIs](runtime-profiling.md) for reference material on trace inspector APIs.