feat: add TensorRT .engine support with auto-fallback and benchmarking by kashviporwal-byte · Pull Request #98 · Devnil434/Eagle

kashviporwal-byte · 2026-05-21T18:32:57Z

Description

This pull request introduces high-performance NVIDIA TensorRT .engine model support inside the Eagle Surveillance framework to significantly accelerate inference speed, lower latency, and optimize GPU memory boundaries.

Importantly, it adds a hardware-resilient, smart auto-fallback layer ensuring that the application remains fully functional out-of-the-box on any machine (even those without NVIDIA GPUs or TensorRT libraries) by gracefully redirecting to standard PyTorch (.pt) or ONNX (.onnx) models without crashing.

Key Enhancements Added

Format Routing & Smart Fallback (services/detection/detection.py):
- Refactored the core Detector to support .pt, .onnx, and .engine loaders.
- Automatically searches for a nearby .engine counterpart (e.g., yolov8n.engine for yolov8n.pt).
- Gracefully falls back to standard model formats if compiled engines are missing, incompatible, or if CUDA is not available.
Low-level TensorRT Inference Utilities (services/detection/trt_utils.py):
- Created a standalone TensorRTInference utility class utilizing the native tensorrt and pycuda API bindings.
- Manages custom pagelocked host and GPU device binding buffer allocations (DMA copies) and asynchronous stream executions.
- Built to be CPU-safe (safe import fallbacks) to prevent dependency crashes on non-GPU platforms.
Model Conversion Utility (scripts/export_tensorrt.py):
- Created a CLI tool that automates compiling standard YOLO models into TensorRT engines.
- Exposes configuration arguments for FP16 half-precision, INT8 quantization calibration, custom resolutions, and CUDA target devices.
Multi-Format Pipeline Benchmarking (benchmark.py):
- Upgraded the benchmarking suite to support comparative runs across .pt, .onnx, and .engine files.
- Generates unified markdown reports at docs/benchmarks/comparison_report.md comparing Throughput (FPS), Latency (ms), and peak VRAM/RAM memory metrics.
Technical Documentation (docs/tensorrt_conversion.md):
- Added a detailed guide covering CUDA Toolkit, cuDNN, TensorRT, and PyCUDA python wheel setups, conversion commands, and benchmark guides.
Comprehensive Unit Tests (tests/test_tensorrt_routing.py):
- Created a suite of mock tests verifying format routing, CUDA load successes, CPU fallback redirection, and driver load failure recoveries. (Tests are fully passing).

Why `.engine` Files are Not Included in this Repository

TensorRT engines are hardware-specific serialized binaries.

Unlike standard .pt or .onnx models (which are hardware-agnostic), a TensorRT .engine file compiles the model's layers directly into optimized GPU memory maps and CUDA kernels tailored exclusively to the host machine's exact GPU microarchitecture (e.g., CUDA Compute Capability, Tensor Core architecture, VRAM size) and the active CUDA/TensorRT driver versions.

Distributing pre-compiled .engine files in a repository is not feasible because they will fail to run on any computer with a different GPU card or driver configuration.

Instead, developers and deployment systems should generate their tailored .engine models locally in one click by running the newly introduced conversion utility:

python scripts/export_tensorrt.py --model yolov8n.pt --fp16

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->

## Summary by CodeRabbit

* **New Features**
  * Added TensorRT optimization support for accelerated GPU inference
  * Implemented device-aware benchmarking with CPU/CUDA selection
  * Added cross-format model benchmarking with comparative reporting via new CLI
  * Enabled smart model format fallback (TensorRT → ONNX → PyTorch) for flexible model loading
  * Added TensorRT model export tool with FP16/INT8 optimization options

* **Documentation**
  * Added comprehensive TensorRT conversion and optimization guide with installation and usage instructions

<!-- review_stack_entry_start -->

[![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/Devnil434/Eagle/pull/98?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

coderabbitai · 2026-05-21T18:33:11Z

📝 Walkthrough

Walkthrough

This PR adds TensorRT GPU-accelerated model inference support with smart fallback format selection, model export tooling, device-aware benchmarking, and performance comparison infrastructure. The changes enable loading and running YOLO detection models in optimized .engine format on CUDA devices while automatically falling back to PyTorch or ONNX formats when TensorRT is unavailable.

Changes

TensorRT Integration

Layer / File(s)	Summary
TensorRT Low-Level Inference Utility `services/detection/trt_utils.py`	`TensorRTInference` class deserializes `.engine` files, allocates pagelocked/device buffers for all bindings, and exposes async CUDA operations (`infer` method with host-to-device/device-to-host transfers and `execute_async_v2` execution). Module safely imports `tensorrt` and `pycuda` and provides `is_tensorrt_supported()` availability check.
Model Loader Refactoring with TensorRT Fallback `services/detection/detection.py`	`Detector.__init__` now stores `model_path`, `conf`, `device` and delegates to `_load_model_with_fallback()`, which attempts TensorRT `.engine` on CUDA, then falls back to `.onnx` or `.pt` based on extension or sibling file discovery. New format-specific loaders (`load_tensorrt_model`, `load_onnx_model`, `load_pytorch_model`) centralize `YOLO` initialization per format.
Model Export to TensorRT Format `scripts/export_tensorrt.py`	New CLI script validates CUDA availability and model file format (`.pt` or `.onnx`), loads via Ultralytics `YOLO`, exports to `.engine` with configurable FP16/INT8 quantization, dynamic batch sizing, and device targeting, with error handling and installation guidance.
Device-Aware Benchmarking and Comparative Reporting `benchmark.py`	`PipelineBenchmark.run_full_pipeline_benchmark` accepts optional `device` parameter, selects CUDA vs CPU dynamically, resets metrics per run, warms up and infers on selected device. `generate_report()` returns summary dict (model, FPS, latencies, peak RAM). New `run_comparative_benchmark()` runs multiple formats and writes consolidated comparison table. Argparse CLI supports `--model`, `--compare`, `--frames` options.
TensorRT Conversion and Benchmarking Documentation `docs/tensorrt_conversion.md`	User-facing guide covering TensorRT benefits, installation steps, model conversion via `export_tensorrt.py`, smart fallback execution (prefers `.engine`, falls back to `.pt`/`.onnx`), and benchmarking commands with output report paths.
Model Loader Routing Tests `tests/test_tensorrt_routing.py`	Unit test suite for `Detector` model loader routing: format selection by file extension, device-aware routing (TensorRT on CUDA, fallback on CPU), and fallback on TensorRT load failures via mocked `Path.exists` and exception simulation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Devnil434/Eagle#94: This PR implements TensorRT .engine support (trt_utils module, Detector fallback loaders, export script, routing tests, and comparative benchmarking) directly addressing the feature requested in that issue.

Possibly related PRs

Devnil434/Eagle#70: Both PRs modify benchmark.py's PipelineBenchmark class—updating run_full_pipeline_benchmark and generate_report to change how benchmarking runs and how metrics/reports are produced.

Suggested labels

gssoc:approved

Poem

🐰 TensorRT engines roar with CUDA speed,
Smart fallbacks catch what GPU can't feed,
From .pt to .engine, formats bend,
Benchmarks compare where performances end,
Optimization blooms in every layer! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.37% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main objective of the PR, which is adding TensorRT .engine support with an auto-fallback mechanism and benchmarking capabilities.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

services/detection/detection.py (1)
59-70: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Default device="cpu" makes the TensorRT path unreachable by default.

_load_model_with_fallback only attempts the engine when "cuda" in self.device.lower() (line 92), but Detector defaults device to "cpu". The CLI (main() at line 245) constructs Detector(model_name=args.model, confidence_threshold=args.conf) without passing a device, and the PR objective claims the system "Auto-detects nearby .engine counterparts" — with these defaults it never will, even on a CUDA box where yolov8n.engine sits next to yolov8n.pt.

Either auto-detect (torch.cuda.is_available()) when device is unspecified, or expose --device on the CLI and document that engines require explicit device="cuda".
🛡️ Proposed default-device resolution
     def __init__(
         self,
         model_name: str = settings.detector_model,
         confidence_threshold: float = 0.45,
-        device: str = "cpu",
+        device: str | None = None,
     ) -> None:
         self.model_path = model_name
         self.conf = confidence_threshold
-        self.device = device
+        if device is None:
+            try:
+                import torch
+                device = "cuda" if torch.cuda.is_available() else "cpu"
+            except Exception:
+                device = "cpu"
+        self.device = device
And add --device to the argparse block around line 240.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/detection/detection.py` around lines 59 - 70, The Detector currently
defaults device="cpu" so _load_model_with_fallback never attempts TensorRT even
on GPU hosts; change Detector.__init__ to accept device: Optional[str]=None and
when device is None set self.device = "cuda" if torch.cuda.is_available() else
"cpu" (keeping existing behavior if caller explicitly passes a value), and
update main()'s argparse to add a --device flag and pass args.device into
Detector(...) so engine auto-detection can run in _load_model_with_fallback when
CUDA is available.

🧹 Nitpick comments (7)

docs/tensorrt_conversion.md (1)
78-81: 💤 Low value

Clarify scope of "Resilient Fallback".

The doc states the system falls back when the engine is "compiled on a different GPU" — TensorRT deserialization typically raises at load time in such cases, which the Detector does catch, so this is accurate. However, "corrupted" engines can also fail at first inference (not just load), and silent fallback at runtime is not described here. Consider clarifying that the fallback occurs at load time only, so users understand a corrupted-but-loadable engine could still crash mid-stream.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/tensorrt_conversion.md` around lines 78 - 81, Update the "Resilient
Fallback" text to explicitly state that the automatic fallback is triggered only
on engine load/deserialization errors (caught by Detector) and not on later
runtime/inference failures; mention that a corrupted-but-loadable .engine may
still fail during inference and therefore will not be silently replaced, and
suggest users enable runtime error handling or add inference-time checks if they
need automatic fallback from .engine to .pt/.onnx during inference.
services/detection/trt_utils.py (2)
137-138: ⚡ Quick win

execute_async_v2 is legacy; pair the API change with execute_async_v3.

If you migrate to the name-based engine API above, you must also switch from execute_async_v2(bindings=[...], stream_handle=...) to execute_async_v3(stream_handle=...) and bind addresses via context.set_tensor_address(name, int(device_mem)) (typically done once during buffer allocation). NVIDIA's 8→10 guide flags this as the corresponding execution-side change.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/detection/trt_utils.py` around lines 137 - 138, The code currently
calls self.context.execute_async_v2(bindings=self.bindings,
stream_handle=self.stream.handle); replace this with the newer name-based API by
switching to self.context.execute_async_v3(stream_handle=self.stream.handle) and
ensure tensor addresses are bound earlier using
self.context.set_tensor_address(tensor_name, int(device_mem)) when you allocate
device buffers (do the set_tensor_address calls once during buffer allocation
for each tensor instead of passing a bindings list at execute time).
155-161: ⚡ Quick win

__del__ does not release CUDA resources deterministically.

Clearing the Python lists drops references to pycuda.driver.DeviceAllocation objects, but self.context, self.engine, and self.stream survive — and the order of finalizer execution at interpreter shutdown is undefined, so device memory can be freed after the CUDA context has already been torn down (raising LogicError: cuMemFree failed: invalid context). Replace with an explicit close() plus context-manager protocol; keep __del__ as a best-effort fallback.
♻️ Proposed cleanup
-    def __del__(self) -> None:
-        """
-        Cleans up GPU bindings and device pointers when the class object is garbage collected.
-        """
-        self.bindings.clear()
-        self.inputs.clear()
-        self.outputs.clear()
+    def close(self) -> None:
+        """Release GPU buffers, context, engine, and stream in a safe order."""
+        for buf in (*self.inputs, *self.outputs):
+            try:
+                buf["device"].free()
+            except Exception:
+                pass
+        self.inputs.clear()
+        self.outputs.clear()
+        self.bindings.clear()
+        self.context = None
+        self.engine = None
+        self.stream = None
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, exc_type, exc, tb):
+        self.close()
+
+    def __del__(self) -> None:
+        try:
+            self.close()
+        except Exception:
+            pass
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/detection/trt_utils.py` around lines 155 - 161, The destructor
__del__ currently only clears lists and can run after CUDA context teardown;
implement a deterministic close() method that releases GPU resources in a safe
order (set current context if needed, free DeviceAllocation objects referenced
by self.bindings/self.inputs/self.outputs, destroy/free self.stream, then
release self.context and self.engine), add context-manager support via __enter__
returning self and __exit__ calling close(), and make __del__ a best-effort
fallback that calls close() inside a try/except to avoid exceptions during
interpreter shutdown; reference the methods/attributes __del__, close,
__enter__, __exit__, self.context, self.engine, self.stream, self.bindings,
self.inputs, self.outputs when updating the class.
services/detection/detection.py (2)
128-141: 💤 Low value

Collapse redundant YOLO loader methods—Ultralytics already chooses backend from the file extension
load_tensorrt_model, load_onnx_model, and load_pytorch_model all do the same self.model = YOLO(model_path, task="detect"); Ultralytics picks the runtime via AutoBackend from suffixes like .pt/.onnx/.engine, while task="detect" controls the detection head/output. Three separate public methods add API surface area for no functional gain—consider a single _load_model(model_path, kind) (for logging) or inline at call sites.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/detection/detection.py` around lines 128 - 141, Collapse the three
redundant loaders into a single loader (e.g., replace load_tensorrt_model,
load_onnx_model, load_pytorch_model with one method like load_model or
_load_model) that centralizes logger usage and calls self.model =
YOLO(model_path, task="detect"); update all callers to use the new method and
remove the three old public methods to shrink the API surface; keep an optional
`kind` or derive backend from the file extension only for logging context (use
logger.info to include model_path and derived kind) so behavior relies on
Ultralytics' AutoBackend decision.
128-131: 🏗️ Heavy lift

Remove or integrate unused TensorRT utilities

services/detection/detection.py’s load_tensorrt_model() routes .engine files straight to Ultralytics via YOLO(model_path, task="detect") and never uses TensorRTInference/is_tensorrt_supported().

Repo-wide search shows TensorRTInference and is_tensorrt_supported() are only defined in services/detection/trt_utils.py and are not referenced anywhere else, so the low-level TensorRT path (execute_async_v2, buffer/DMA setup, etc.) is dead code on the runtime path.

Either delete services/detection/trt_utils.py, or wire it in (gate with is_tensorrt_supported() and use TensorRTInference in load_tensorrt_model()), or if Ultralytics’ TRT backend is the intended route, drop the low-level module.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/detection/detection.py` around lines 128 - 131, The
load_tensorrt_model function currently ignores the low-level TRT utilities;
either remove the dead module services/detection/trt_utils.py entirely, or
update load_tensorrt_model to detect and use those utilities: check
is_tensorrt_supported() and if true and the model_path endswith('.engine')
instantiate and use TensorRTInference (instead of YOLO(...)) for inference,
otherwise fall back to YOLO(model_path, task="detect"); ensure imports for
is_tensorrt_supported and TensorRTInference are added to
services/detection/detection.py and that the branch cleanly falls back to the
existing YOLO path.
tests/test_tensorrt_routing.py (1)
50-77: 💤 Low value

assert_called_with only inspects the last call — strengthen the fallback assertions.

In test_routing_engine_cpu_fallback and test_routing_engine_load_failure_fallback, the assertion passes as long as the final YOLO(...) invocation used yolov8n.pt. It does not verify that the .engine path was actually attempted (failure test) or short-circuited (CPU test), so a regression that silently skips the engine attempt would still go green. Consider asserting against mock_yolo.call_args_list to pin down the intended sequence.
♻️ Example for the load-failure case
-        detector = Detector(model_name="yolov8n.engine", device="cuda:0")
-        # Should fallback to yolov8n.pt
-        mock_yolo.assert_called_with("yolov8n.pt", task="detect")
+        detector = Detector(model_name="yolov8n.engine", device="cuda:0")
+        # Engine must be attempted first, then .pt as fallback
+        calls = [call.args[0] for call in mock_yolo.call_args_list]
+        assert calls == ["yolov8n.engine", "yolov8n.pt"], calls
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_tensorrt_routing.py` around lines 50 - 77, Update the two tests to
assert the full call sequence against mock_yolo.call_args_list: in
test_routing_engine_cpu_fallback ensure Detector(model_name="yolov8n.engine",
device="cpu") results in exactly one YOLO(...) invocation with "yolov8n.pt"
(assert len(call_args_list)==1 and call_args_list[0] matches "yolov8n.pt") to
verify the engine path was short‑circuited on CPU; in
test_routing_engine_load_failure_fallback ensure the failure path attempts the
engine first and then falls back by asserting call_args_list[0] was called with
"yolov8n.engine" and call_args_list[1] was called with "yolov8n.pt" (and that
there are at least two calls), referencing mock_yolo, Detector and Path.exists
to locate the behavior to change.
benchmark.py (1)
95-99: 💤 Low value

Prefer .to(device) over .cuda() to honor the requested device index.

fake_tensor.cuda() always lands on the current default CUDA device, so a device="cuda:1" request silently warms up on cuda:0. Using tensor.to(device) keeps the warmup tensor and the predict device aligned, which matters for multi-GPU benchmark runs and avoids hidden cross-device copies during model.predict(..., device=device).
♻️ Proposed diff
-            fake_tensor = torch.rand(1, 3, img_size, img_size)
-            if "cuda" in device.lower():
-                fake_tensor = fake_tensor.cuda()
+            fake_tensor = torch.rand(1, 3, img_size, img_size)
+            if "cuda" in device.lower():
+                fake_tensor = fake_tensor.to(device)
-                    frame_to_process = torch.rand(1, 3, img_size, img_size)
-                    if "cuda" in device.lower():
-                        frame_to_process = frame_to_process.cuda()
+                    frame_to_process = torch.rand(1, 3, img_size, img_size)
+                    if "cuda" in device.lower():
+                        frame_to_process = frame_to_process.to(device)
Also applies to: 131-132
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark.py` around lines 95 - 99, The warmup tensor creation uses
fake_tensor.cuda(), which ignores explicit device strings like "cuda:1"; change
the warmup code to move the tensor with fake_tensor.to(device) so the tensor and
model.predict(..., device=device) use the same device (update the fake_tensor
assignment near the initial warmup and the similar warmup block later that also
creates fake_tensor before calling model.predict). Ensure you use the existing
variables fake_tensor, device, img_size and keep the conditional check for CUDA
in device.lower() when deciding to move the tensor.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmark.py`:
- Around line 257-268: The device selection logic in benchmark.py sets
device="cuda" whenever path.endswith(".engine") is true, even if CUDA isn't
available; update the condition so CUDA is chosen only when the model is an
.engine file AND torch.cuda.is_available() is true (otherwise fall back to
"cpu") before calling benchrunner.run_full_pipeline_benchmark; locate the device
assignment near the loop over models and adjust the boolean expression that sets
device accordingly.
- Around line 309-334: The --compare flag is defined with store_true but
default=True so args.compare is always truthy, making the else branch
unreachable; update the argparse setup so --compare is opt-in (set
default=False) or use a mutually exclusive group between --model and --compare,
then adjust the control flow that calls run_comparative_benchmark(benchrunner,
candidate_models, num_frames=...) and
benchrunner.run_full_pipeline_benchmark(model_path=..., num_frames=...)
accordingly (references: parser.add_argument("--compare", ...), args.compare,
run_comparative_benchmark, benchrunner.run_full_pipeline_benchmark,
candidate_models).

In `@docs/tensorrt_conversion.md`:
- Around line 61-69: Update the docs to reflect the corrected CLI behavior for
the --fp16 flag: document that --fp16 remains enabled by default but can be
disabled with the generated --no-fp16 opt-out (mirror the BooleanOptionalAction
change made in scripts/export_tensorrt.py) and ensure the table lists `--fp16`
(bool, Default: True) and `--no-fp16` as the opt-out; also add a blank line
immediately before and after the Markdown table to satisfy MD058.

In `@scripts/export_tensorrt.py`:
- Around line 106-117: The parser arguments for "--fp16" and "--int8" use
action="store_true" with contradictory defaults (fp16 default=True) so users
cannot disable FP16; change these to use argparse.BooleanOptionalAction for both
flags (or provide an explicit "--no-fp16") and remove the contradictory default
values so the flag can be toggled from the CLI; update the parser.add_argument
calls for "--fp16" and "--int8" in scripts/export_tensorrt.py to use
BooleanOptionalAction and sensible defaults (e.g., default=False for fp16 if you
want FP16 off by default, or keep True but allow --no-fp16) and adjust help text
accordingly.
- Around line 72-93: The model.export call in scripts/export_tensorrt.py is
invoking INT8 conversion without passing a calibration dataset; update the CLI
to accept a --data argument, forward that value as data=<data> to
model.export(...), and add a pre-flight validation that raises/logs an error and
exits when --int8 is specified but --data is missing; additionally validate
precision flags (e.g., if int8 is True, ensure half/fp16 is disabled or warn and
force-consistent behavior) so you don’t simultaneously enable both fp16 and int8
in the export flow.

In `@services/detection/detection.py`:
- Around line 86-126: The extension checks are case-sensitive and brittle:
replace repeated self.model_path.endswith(...) checks with a single normalized
extension variable (e.g., ext = Path(self.model_path).suffix.lower()) and route
loading based on ext (compare to ".engine", ".onnx", ".pt"); keep the existing
TensorRT attempt using resolved_engine_path and load_tensorrt_model(self, ...),
and when an explicit ".engine" fails on CUDA let control fall through to the
generic fallback branch (add a short inline comment near the else to explain
this intentional flow) while using load_onnx_model and load_pytorch_model based
on the normalized ext or found counterparts.

In `@services/detection/trt_utils.py`:
- Around line 127-132: Validate the incoming input_data against the engine input
binding before calling np.copyto: retrieve the target host buffer and its
expected element count and dtype from self.inputs[0] (e.g., input_info =
self.inputs[0]; expected_size = input_info["host"].size; expected_dtype =
input_info["host"].dtype), check that input_data.size == expected_size (or
reshapeable to that size) and that input_data.dtype == expected_dtype; if
sizes/dtypes mismatch either raise a clear ValueError/TypeError naming the
binding, expected size/dtype and actual size/dtype, or explicitly convert with
input_data.astype(expected_dtype, copy=False) and reshape to match before
calling np.copyto(input_info["host"], ...). Ensure these checks/convert happen
just before the np.copyto call that currently uses input_data.ravel().
- Around line 23-29: The current module-level import of pycuda.autoinit can
raise non-ImportError exceptions and crash imports; remove the autoinit import
from the top-level try/except (keep only "import pycuda.driver as cuda" and set
CUDA_AVAILABLE based on that import), and move the "import pycuda.autoinit" into
a guarded, lazy import inside TensorRTInference.__init__; wrap that lazy import
in a try/except that catches pycuda._driver.Error (and a broad Exception
fallback) so driver/context-init failures are caught, set or clear
CUDA_AVAILABLE accordingly, and ensure TensorRTInference.__init__ handles the
case where autoinit failed without letting the module import fail.

---

Outside diff comments:
In `@services/detection/detection.py`:
- Around line 59-70: The Detector currently defaults device="cpu" so
_load_model_with_fallback never attempts TensorRT even on GPU hosts; change
Detector.__init__ to accept device: Optional[str]=None and when device is None
set self.device = "cuda" if torch.cuda.is_available() else "cpu" (keeping
existing behavior if caller explicitly passes a value), and update main()'s
argparse to add a --device flag and pass args.device into Detector(...) so
engine auto-detection can run in _load_model_with_fallback when CUDA is
available.

---

Nitpick comments:
In `@benchmark.py`:
- Around line 95-99: The warmup tensor creation uses fake_tensor.cuda(), which
ignores explicit device strings like "cuda:1"; change the warmup code to move
the tensor with fake_tensor.to(device) so the tensor and model.predict(...,
device=device) use the same device (update the fake_tensor assignment near the
initial warmup and the similar warmup block later that also creates fake_tensor
before calling model.predict). Ensure you use the existing variables
fake_tensor, device, img_size and keep the conditional check for CUDA in
device.lower() when deciding to move the tensor.

In `@docs/tensorrt_conversion.md`:
- Around line 78-81: Update the "Resilient Fallback" text to explicitly state
that the automatic fallback is triggered only on engine load/deserialization
errors (caught by Detector) and not on later runtime/inference failures; mention
that a corrupted-but-loadable .engine may still fail during inference and
therefore will not be silently replaced, and suggest users enable runtime error
handling or add inference-time checks if they need automatic fallback from
.engine to .pt/.onnx during inference.

In `@services/detection/detection.py`:
- Around line 128-141: Collapse the three redundant loaders into a single loader
(e.g., replace load_tensorrt_model, load_onnx_model, load_pytorch_model with one
method like load_model or _load_model) that centralizes logger usage and calls
self.model = YOLO(model_path, task="detect"); update all callers to use the new
method and remove the three old public methods to shrink the API surface; keep
an optional `kind` or derive backend from the file extension only for logging
context (use logger.info to include model_path and derived kind) so behavior
relies on Ultralytics' AutoBackend decision.
- Around line 128-131: The load_tensorrt_model function currently ignores the
low-level TRT utilities; either remove the dead module
services/detection/trt_utils.py entirely, or update load_tensorrt_model to
detect and use those utilities: check is_tensorrt_supported() and if true and
the model_path endswith('.engine') instantiate and use TensorRTInference
(instead of YOLO(...)) for inference, otherwise fall back to YOLO(model_path,
task="detect"); ensure imports for is_tensorrt_supported and TensorRTInference
are added to services/detection/detection.py and that the branch cleanly falls
back to the existing YOLO path.

In `@services/detection/trt_utils.py`:
- Around line 137-138: The code currently calls
self.context.execute_async_v2(bindings=self.bindings,
stream_handle=self.stream.handle); replace this with the newer name-based API by
switching to self.context.execute_async_v3(stream_handle=self.stream.handle) and
ensure tensor addresses are bound earlier using
self.context.set_tensor_address(tensor_name, int(device_mem)) when you allocate
device buffers (do the set_tensor_address calls once during buffer allocation
for each tensor instead of passing a bindings list at execute time).
- Around line 155-161: The destructor __del__ currently only clears lists and
can run after CUDA context teardown; implement a deterministic close() method
that releases GPU resources in a safe order (set current context if needed, free
DeviceAllocation objects referenced by self.bindings/self.inputs/self.outputs,
destroy/free self.stream, then release self.context and self.engine), add
context-manager support via __enter__ returning self and __exit__ calling
close(), and make __del__ a best-effort fallback that calls close() inside a
try/except to avoid exceptions during interpreter shutdown; reference the
methods/attributes __del__, close, __enter__, __exit__, self.context,
self.engine, self.stream, self.bindings, self.inputs, self.outputs when updating
the class.

In `@tests/test_tensorrt_routing.py`:
- Around line 50-77: Update the two tests to assert the full call sequence
against mock_yolo.call_args_list: in test_routing_engine_cpu_fallback ensure
Detector(model_name="yolov8n.engine", device="cpu") results in exactly one
YOLO(...) invocation with "yolov8n.pt" (assert len(call_args_list)==1 and
call_args_list[0] matches "yolov8n.pt") to verify the engine path was
short‑circuited on CPU; in test_routing_engine_load_failure_fallback ensure the
failure path attempts the engine first and then falls back by asserting
call_args_list[0] was called with "yolov8n.engine" and call_args_list[1] was
called with "yolov8n.pt" (and that there are at least two calls), referencing
mock_yolo, Detector and Path.exists to locate the behavior to change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 74207dee-bf66-48fa-a84b-a206af124e96

📥 Commits

Reviewing files that changed from the base of the PR and between f282de7 and 360ca28.

📒 Files selected for processing (6)

benchmark.py
docs/tensorrt_conversion.md
scripts/export_tensorrt.py
services/detection/detection.py
services/detection/trt_utils.py
tests/test_tensorrt_routing.py

coderabbitai · 2026-05-21T18:43:16Z

+    for label, path in models.items():
+        if os.path.exists(path):
+            try:
+                # Decide device automatically based on model suffix
+                device = "cuda" if path.endswith(".engine") or torch.cuda.is_available() else "cpu"
+                res = benchrunner.run_full_pipeline_benchmark(model_path=path, num_frames=num_frames, device=device)
+                res["format"] = label
+                results.append(res)
+            except Exception as e:
+                print(f"❌ Failed to run benchmark for {label} ({path}): {e}")
+        else:
+            print(f"⚠️ Skipping comparison for format '{label}' since file was not found at '{path}'.")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Device picker can request CUDA when CUDA is unavailable.

device = "cuda" if path.endswith(".engine") or torch.cuda.is_available() else "cpu"

Operator precedence is (path.endswith(".engine")) or (torch.cuda.is_available()), so any .engine path forces device="cuda" even on a CPU-only host. That conflicts with the PR's "graceful fallback" promise — YOLO("yolov8n.engine", task="detect").predict(..., device="cuda") will raise on a machine without CUDA rather than skip the format. Note that, unlike Detector, this comparative runner uses YOLO directly and does not benefit from the loader's fallback logic, so the guard needs to live here.

🛡️ Proposed fix

- # Decide device automatically based on model suffix - device = "cuda" if path.endswith(".engine") or torch.cuda.is_available() else "cpu" - res = benchrunner.run_full_pipeline_benchmark(model_path=path, num_frames=num_frames, device=device) + # Skip .engine on hosts without CUDA; otherwise pick CUDA when available + if path.endswith(".engine") and not torch.cuda.is_available(): + print(f"⚠️ Skipping '{label}' — '{path}' requires CUDA, which is unavailable on this host.") + continue + device = "cuda" if torch.cuda.is_available() else "cpu" + res = benchrunner.run_full_pipeline_benchmark(model_path=path, num_frames=num_frames, device=device)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmark.py` around lines 257 - 268, The device selection logic in benchmark.py sets device="cuda" whenever path.endswith(".engine") is true, even if CUDA isn't available; update the condition so CUDA is chosen only when the model is an .engine file AND torch.cuda.is_available() is true (otherwise fall back to "cpu") before calling benchrunner.run_full_pipeline_benchmark; locate the device assignment near the loop over models and adjust the boolean expression that sets device accordingly.

coderabbitai · 2026-05-21T18:43:17Z

 if __name__ == "__main__":
-    int8_path = "yolov8n_int8_openvino_model" 
+    import argparse
+    parser = argparse.ArgumentParser(description="Run Eagle performance benchmarks")
+    parser.add_argument("--model", type=str, default=None, help="Path to a specific model to benchmark")
+    parser.add_argument("--compare", action="store_true", default=True, help="Run comparisons across formats (.pt, .onnx, .engine)")
+    parser.add_argument("--frames", type=int, default=100, help="Number of frames to benchmark")
+    args = parser.parse_args()
+
    REDIS_ENV_URL = os.getenv("REDIS_URL", settings.REDIS_URL)
-
    benchrunner = PipelineBenchmark(redis_url=REDIS_ENV_URL)
-    benchrunner.run_full_pipeline_benchmark(model_path=int8_path, num_frames=100)
+
+    if args.model:
+        # Benchmark specific model
+        benchrunner.run_full_pipeline_benchmark(model_path=args.model, num_frames=args.frames)
+    elif args.compare:
+        # Cross-format comparison candidate paths
+        candidate_models = {
+            "PyTorch (.pt)": "yolov8n.pt",
+            "ONNX (.onnx)": "yolov8n.onnx",
+            "TensorRT (.engine)": "yolov8n.engine"
+        }
+        run_comparative_benchmark(benchrunner, candidate_models, num_frames=args.frames)
+    else:
+        # Default single run
+        int8_path = "yolov8n_int8_openvino_model"
+        benchrunner.run_full_pipeline_benchmark(model_path=int8_path, num_frames=args.frames)


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

--compare default makes the else branch unreachable and removes opt-out.

parser.add_argument("--compare", action="store_true", default=True, ...) with store_true and default=True means args.compare is always truthy regardless of whether the flag is passed. As a result:

The else branch at L331-333 (single OpenVINO run) is dead code.

Users have no way to disable comparative mode without supplying --model, which contradicts a typical opt-in --compare semantic.

Either default to False (idiomatic for store_true) or model the choices as a mutually exclusive group.

🐛 Proposed fix

- parser.add_argument("--compare", action="store_true", default=True, help="Run comparisons across formats (.pt, .onnx, .engine)") + parser.add_argument("--compare", action="store_true", help="Run comparisons across formats (.pt, .onnx, .engine)")

Or, if compare-by-default is the intent, drop the unreachable else branch and document that behavior explicitly.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmark.py` around lines 309 - 334, The --compare flag is defined with store_true but default=True so args.compare is always truthy, making the else branch unreachable; update the argparse setup so --compare is opt-in (set default=False) or use a mutually exclusive group between --model and --compare, then adjust the control flow that calls run_comparative_benchmark(benchrunner, candidate_models, num_frames=...) and benchrunner.run_full_pipeline_benchmark(model_path=..., num_frames=...) accordingly (references: parser.add_argument("--compare", ...), args.compare, run_comparative_benchmark, benchrunner.run_full_pipeline_benchmark, candidate_models).

coderabbitai · 2026-05-21T18:43:17Z

+### Command Parameters
+| Flag | Type | Default | Description |
+| :--- | :--- | :--- | :--- |
+| `--model` | `str` | `yolov8n.pt` | Path to the source `.pt` or `.onnx` model file to compile. |
+| `--fp16` | `bool` | `True` | Enables FP16 half-precision optimization (highly recommended). |
+| `--int8` | `bool` | `False` | Enables INT8 quantization (requires calibrating dataset). |
+| `--imgsz` | `int` | `640` | Resolution width/height of input frames (default: 640). |
+| `--device` | `str` | `cuda:0` | GPU device ID to execute compiling (default: `cuda:0`). |
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Docs reflect the misleading --fp16 default; update alongside CLI fix.

The table documents --fp16 as a bool defaulting to True, which matches the current (broken) CLI behavior where the flag cannot be disabled. Once the CLI is corrected to use BooleanOptionalAction (see review on scripts/export_tensorrt.py), please document the --no-fp16 opt-out here as well. Also, per markdownlint MD058, surround the table with blank lines (already mostly present — verify there is a blank line directly before Line 62 and after Line 68).

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 62-62: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/tensorrt_conversion.md` around lines 61 - 69, Update the docs to reflect the corrected CLI behavior for the --fp16 flag: document that --fp16 remains enabled by default but can be disabled with the generated --no-fp16 opt-out (mirror the BooleanOptionalAction change made in scripts/export_tensorrt.py) and ensure the table lists `--fp16` (bool, Default: True) and `--no-fp16` as the opt-out; also add a blank line immediately before and after the Markdown table to satisfy MD058.

coderabbitai · 2026-05-21T18:43:17Z

+    try:
+        # Ultralytics natively wraps the ONNX -> TensorRT conversion process
+        exported_path = model.export(
+            format="engine",
+            half=fp16,
+            int8=int8,
+            imgsz=imgsz,
+            device=device,
+            dynamic=True  # Enables dynamic batching support
+        )
+        logger.info("========================================= SUCCESS =========================================")
+        logger.info("TensorRT Engine compiled and optimized successfully!")
+        logger.info(f"Saved optimized model to: {os.path.abspath(exported_path)}")
+        logger.info("===========================================================================================")
+
+    except Exception as e:
+        logger.error(f"An error occurred during TensorRT compilation: {e}")
+        logger.error(
+            "Please ensure you have the TensorRT Python API and CUDA toolkit properly installed. "
+            "Refer to docs/tensorrt_conversion.md for assistance."
+        )
+        sys.exit(1)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Ultralytics YOLO export to TensorRT engine int8 calibration data argument requirement

💡 Result:

When exporting an Ultralytics YOLO model to a TensorRT engine with INT8 quantization, the data argument is essential for providing the representative dataset required for post-training calibration [1][2]. Key requirements and considerations for INT8 calibration include: 1. Data Argument: You must specify the data argument, which should point to a dataset configuration YAML file (e.g., data="my_dataset.yaml") [1][3]. The calibration process typically utilizes the validation images defined in this YAML file [1][4]. 2. Fallback Behavior: If the data argument is omitted when int8=True is enabled, the export process will default to using one of the "small" example datasets (such as coco8.yaml) rather than throwing an error [1][5]. 3. Calibration Dataset Size: NVIDIA and Ultralytics recommend using at least 500 to 1,000 images that are representative of your specific data to minimize accuracy loss [1][3]. While this is a guideline rather than a strict hard requirement, larger, more representative sets generally yield better results [1][5][3]. 4. Calibration Fraction: You can use the fraction argument (float, default 1.0) to specify the portion of the dataset used for calibration [1][5]. Setting a value less than 1.0 is useful for experiments or when computational resources are constrained [2]. 5. Calibration Algorithm: The default algorithm used is MINMAX_CALIBRATION, which is generally recommended for most use cases [1]. Example export code snippet: from ultralytics import YOLO model = YOLO("path/to/model.pt") model.export( format="engine", int8=True, data="path/to/dataset.yaml", fraction=1.0 # Optional: adjust based on resource availability)

Citations:

1: https://github.com/ultralytics/ultralytics/blob/main/docs/en/integrations/tensorrt.md

2: https://github.com/ultralytics/ultralytics/blob/18b4666b/docs/macros/export-args.md

3: Export YOLOv8 by using TensorRT with INT8 calibration ultralytics/ultralytics#11848

4: INT8 calibration set clarification ultralytics/ultralytics#14121

5: https://docs.ultralytics.com/integrations/tensorrt

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show file around the referenced lines sed -n '1,200p' scripts/export_tensorrt.py | nl -ba | sed -n '50,120p' # Search for CLI args related to data/int8/fp16 rg -n --hidden --no-ignore -S "int8|fp16|--data|data=" scripts/export_tensorrt.py || true # Show full argument parsing section if present rg -n "argparse|ArgumentParser|add_argument" scripts/export_tensorrt.py || true

Repository: Devnil434/Eagle

Length of output: 102

🏁 Script executed:

#!/bin/bash set -euo pipefail # Print lines 40-130 with line numbers using awk (nl not available) awk 'NR>=40 && NR<=130 {printf "%4d\t%s\n", NR, $0}' scripts/export_tensorrt.py # Search for CLI args / forwarding to model.export rg -n "int8|fp16|--int8|--fp16|--data|data=|data\s*=" scripts/export_tensorrt.py || true rg -n "ArgumentParser|argparse|add_argument|click\." scripts/export_tensorrt.py || true # Also show the model.export call block for exact args passed rg -n "model\.export\(" -n scripts/export_tensorrt.py

Repository: Devnil434/Eagle

Length of output: 4489

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show the remainder around argument parsing and export invocation awk 'NR>=90 && NR<=190 {printf "%4d\t%s\n", NR, $0}' scripts/export_tensorrt.py # Search for any data/fraction/calibration-related CLI options or forwarding rg -n --hidden -S "(--data|data=|fraction|calib|calibration|dataset)" scripts/export_tensorrt.py || true # Search for any mutual-exclusion / validation of fp16 vs int8 rg -n --hidden -S "(int8.*fp16|fp16.*int8|mutual|exclusive|refus|if .*int8|if .*fp16)" scripts/export_tensorrt.py || true

Repository: Devnil434/Eagle

Length of output: 2069

Fix INT8 TensorRT export: add calibration data wiring
In scripts/export_tensorrt.py (model.export call), int8=int8 is passed without any data= calibration dataset argument, and the CLI also provides no --data option. Ultralytics’ INT8 TensorRT flow uses data for calibration; without it, export may use a small default dataset or behave unexpectedly. Add --data (and forward data= to model.export(...)), and error when --int8 is set but --data is missing. Also consider validating precision flags since --fp16 defaults to True, so enabling --int8 currently enables both half and int8.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/export_tensorrt.py` around lines 72 - 93, The model.export call in scripts/export_tensorrt.py is invoking INT8 conversion without passing a calibration dataset; update the CLI to accept a --data argument, forward that value as data=<data> to model.export(...), and add a pre-flight validation that raises/logs an error and exits when --int8 is specified but --data is missing; additionally validate precision flags (e.g., if int8 is True, ensure half/fp16 is disabled or warn and force-consistent behavior) so you don’t simultaneously enable both fp16 and int8 in the export flow.

coderabbitai · 2026-05-21T18:43:17Z

+    parser.add_argument(
+        "--fp16",
+        action="store_true",
+        default=True,
+        help="Enable FP16 (half-precision) float operations for faster inference (recommended)."
+    )
+    parser.add_argument(
+        "--int8",
+        action="store_true",
+        default=False,
+        help="Enable INT8 quantization (requires calibrating dataset)."
+    )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

--fp16 flag cannot be disabled; store_true + default=True is contradictory.

With action="store_true" and default=True, the flag is always True regardless of whether the user passes --fp16 — there is no way to opt out of FP16 from the CLI. The same anti-pattern applies (less harmfully) to --int8, where the explicit default=False is redundant. Use argparse.BooleanOptionalAction so users can pass --no-fp16, or flip the semantics with a --no-fp16 flag.

🛠️ Proposed fix

parser.add_argument( "--fp16", - action="store_true", - default=True, + action=argparse.BooleanOptionalAction, + default=True, help="Enable FP16 (half-precision) float operations for faster inference (recommended)." ) parser.add_argument( "--int8", action="store_true", - default=False, help="Enable INT8 quantization (requires calibrating dataset)." )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

parser.add_argument(

"--fp16",

action="store_true",

default=True,

help="Enable FP16 (half-precision) float operations for faster inference (recommended)."

)

parser.add_argument(

"--int8",

action="store_true",

default=False,

help="Enable INT8 quantization (requires calibrating dataset)."

)

parser.add_argument(

"--fp16",

action=argparse.BooleanOptionalAction,

default=True,

help="Enable FP16 (half-precision) float operations for faster inference (recommended)."

)

parser.add_argument(

"--int8",

action="store_true",

help="Enable INT8 quantization (requires calibrating dataset)."

)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/export_tensorrt.py` around lines 106 - 117, The parser arguments for "--fp16" and "--int8" use action="store_true" with contradictory defaults (fp16 default=True) so users cannot disable FP16; change these to use argparse.BooleanOptionalAction for both flags (or provide an explicit "--no-fp16") and remove the contradictory default values so the flag can be toggled from the CLI; update the parser.add_argument calls for "--fp16" and "--int8" in scripts/export_tensorrt.py to use BooleanOptionalAction and sensible defaults (e.g., default=False for fp16 if you want FP16 off by default, or keep True but allow --no-fp16) and adjust help text accordingly.

coderabbitai · 2026-05-21T18:43:17Z

+        should_try_engine = self.model_path.endswith(".engine") or engine_path.exists()
+
+        if should_try_engine:
+            resolved_engine_path = self.model_path if self.model_path.endswith(".engine") else str(engine_path)
+
+            # TensorRT requires an NVIDIA GPU with CUDA
+            if "cuda" in self.device.lower():
+                try:
+                    logger.info(f"Attempting optimized TensorRT engine load: {resolved_engine_path}")
+                    self.load_tensorrt_model(resolved_engine_path)
+                    return
+                except Exception as e:
+                    logger.warning(
+                        f"Failed to load TensorRT engine '{resolved_engine_path}': {e}. "
+                        f"Triggering automatic fallback to standard model format..."
+                    )
+            else:
+                logger.warning(
+                    f"TensorRT engine '{resolved_engine_path}' cannot run on non-CUDA device '{self.device}'. "
+                    f"Triggering automatic fallback to standard model format..."
+                )
+
+        # Main loader routing based on model extension
+        if self.model_path.endswith(".onnx"):
+            self.load_onnx_model(self.model_path)
+        elif self.model_path.endswith(".pt"):
+            self.load_pytorch_model(self.model_path)
+        else:
+            # If explicitly requested .engine failed or file is generic, seek compatible counterpart
+            pt_path = parent_dir / f"{base_name}.pt"
+            onnx_path = parent_dir / f"{base_name}.onnx"
+
+            if pt_path.exists():
+                logger.info(f"Auto-fallback: Loading counterpart PyTorch model: {pt_path}")
+                self.load_pytorch_model(str(pt_path))
+            elif onnx_path.exists():
+                logger.info(f"Auto-fallback: Loading counterpart ONNX model: {onnx_path}")
+                self.load_onnx_model(str(onnx_path))
+            else:
+                logger.info(f"No counterpart found. Loading default fallback model path: {self.model_path}")
+                self.load_pytorch_model(self.model_path)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Case-sensitive extension matching and brittle string routing.

self.model_path.endswith(".engine" / ".onnx" / ".pt") misclassifies MODEL.PT, weights.ENGINE, etc., on case-insensitive filesystems (Windows, macOS HFS+). Use Path(self.model_path).suffix.lower() once and route on that. Also note that when an explicit .engine request fails on a CUDA device (line 97 fallback), control flows into the else branch on line 113 — which is correct, but only because .engine doesn't match the .onnx/.pt arms; worth a comment so the next reader doesn't "fix" the missing elif.

♻️ Proposed routing

- should_try_engine = self.model_path.endswith(".engine") or engine_path.exists() + suffix = path.suffix.lower() + should_try_engine = suffix == ".engine" or engine_path.exists() ... - resolved_engine_path = self.model_path if self.model_path.endswith(".engine") else str(engine_path) + resolved_engine_path = self.model_path if suffix == ".engine" else str(engine_path) ... - if self.model_path.endswith(".onnx"): + if suffix == ".onnx": self.load_onnx_model(self.model_path) - elif self.model_path.endswith(".pt"): + elif suffix == ".pt": self.load_pytorch_model(self.model_path) else: + # Reached when path has no extension, or when an explicit .engine load failed/was skipped. ...

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/detection/detection.py` around lines 86 - 126, The extension checks are case-sensitive and brittle: replace repeated self.model_path.endswith(...) checks with a single normalized extension variable (e.g., ext = Path(self.model_path).suffix.lower()) and route loading based on ext (compare to ".engine", ".onnx", ".pt"); keep the existing TensorRT attempt using resolved_engine_path and load_tensorrt_model(self, ...), and when an explicit ".engine" fails on CUDA let control fall through to the generic fallback branch (add a short inline comment near the else to explain this intentional flow) while using load_onnx_model and load_pytorch_model based on the normalized ext or found counterparts.

coderabbitai · 2026-05-21T18:43:17Z

+try:
+    import pycuda.driver as cuda
+    import pycuda.autoinit  # Automatically handles CUDA context creation/destruction
+    CUDA_AVAILABLE = True
+except ImportError:
+    cuda = None
+    CUDA_AVAILABLE = False


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail ls -la rg -n "trt_utils|TensorRTInference|is_tensorrt_supported|pycuda\.autoinit|pycuda\.driver|CUDA_AVAILABLE" -S services/detection || true sed -n '1,120p' services/detection/trt_utils.py sed -n '120,260p' services/detection/trt_utils.py

Repository: Devnil434/Eagle

Length of output: 8772

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "services\.detection\.trt_utils|detection\.trt_utils|trt_utils|is_tensorrt_supported\(" -S . || true rg -n "TensorRTInference\s*\(" -S . || true

Repository: Devnil434/Eagle

Length of output: 235

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "is_tensorrt_supported" -S . || true rg -n "TensorRTInference" -S . || true rg -n "trt_utils" -S . || true

Repository: Devnil434/Eagle

Length of output: 418

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "services\.detection\.trt_utils|detection\.trt_utils|trt_utils\.py|importlib.*trt_utils|__import__\(.*trt_utils" -S . || true rg -n "tensorrt_supported|CUDA_AVAILABLE|TRT_AVAILABLE" -S . || true rg -n "TensorRTInference" -S . || true

Repository: Devnil434/Eagle

Length of output: 887

🌐 Web query:

pycuda.autoinit import failure exception type LogicError vs ImportError

💡 Result:

When you encounter a failure during the import of pycuda.autoinit, the exception type you receive provides critical information about the nature of the problem, distinguishing between structural installation issues and runtime environment failures [1][2][3]. ImportError (or ModuleNotFoundError) An ImportError usually signifies a fundamental structural problem with the PyCUDA installation [2][3][4]. Causes: - Missing or incompatible binary files (e.g., DLLs on Windows or shared objects on Linux) [5][2]. - Mismatched versions between the installed PyCUDA package and the underlying CUDA toolkit (e.g., trying to use a library compiled for CUDA 6.5 with CUDA 7.0) [2][3]. - An incomplete installation, such as failing to clone submodules (e.g., missing the pycuda.compyte module) [4]. - Environment path issues where the system cannot locate the required CUDA libraries [5]. LogicError (or other pycuda._driver.Error subclasses) A LogicError indicates that the library was imported successfully, but the CUDA runtime failed to initialize or execute a specific operation [1][6][7]. Causes: - No CUDA-capable device is detected on the system [1]. - Attempts to interact with the GPU inside a context that has not been initialized or is invalid, which is common in multi-threaded environments like Flask or web servers [6]. - Conflicts with other libraries that also attempt to control the GPU, leading to context creation failures [8]. - Incorrect CUDA driver/runtime interaction, often resolved by verifying that the CUDA headers match the active driver [3]. In summary, if you see an ImportError, focus on fixing your installation (reinstalling from PyPI, ensuring dependencies are met, or rebuilding from source) [3][4]. If you see a LogicError, focus on your runtime environment (checking GPU availability, handling context creation properly in multi-threaded applications, or verifying driver compatibility) [1][6].

Citations:

1: https://stackoverflow.com/questions/79861471/pycuda-driver-error-cuinit-failed-unknown-error

2: https://stackoverflow.com/questions/30404269/pycuda-driver-import-error-windows

3: Can't import pycuda.driver: undefined symbol: cuIpcOpenMemHandle_v2 inducer/pycuda#409

4: No module named 'pycuda.compyte' inducer/pycuda#339

5: PyCuda can't find DLL inducer/pycuda#281

6: https://stackoverflow.com/questions/50601029/pycuda-with-flask-gives-pycuda-driver-logicerror-cumoduleloaddataex

7: https://ask.csdn.net/questions/8136673

8: Return StageCmdFailedError when I import pycuda.autoinit treeverse/dvc#8222

🌐 Web query:

pycuda.autoinit import failure exception type LogicError vs ImportError

💡 Result:

In PyCUDA, an failure when importing pycuda.autoinit typically results in one of several exception types depending on the root cause, rather than being strictly limited to LogicError or ImportError. An ImportError often occurs when the underlying C++ extension module (pycuda._driver) cannot be loaded [1][2][3]. This is common on Windows or when the PyCUDA installation is misconfigured, causing the Python interpreter to fail to locate or link the necessary dynamic libraries (DLLs or.so files) [1][2]. A LogicError or RuntimeError is more common when the import process itself succeeds, but the subsequent automated initialization logic fails [4][5][6]. For example: - LogicError (or generic Error from _driver): Often raised when the CUDA driver API call (cuInit) fails, such as when an incompatible driver version is installed, or when no CUDA-capable devices are detected [4][5]. - RuntimeError: Frequently encountered when the initialization successfully calls CUDA functions, but fails to create a CUDA context (e.g., make_default_context fails) [6]. This can happen if all GPUs are exhausted, permissions are insufficient, or the environment is restricted (e.g., in certain containers) [7][8][6]. In summary, an ImportError indicates a failure to load the PyCUDA module files, while LogicError or RuntimeError indicates that the module loaded successfully but the attempt to initialize the CUDA driver or context failed at runtime [1][2][4][6].

Citations:

1: PyCuda can't find DLL inducer/pycuda#281

2: https://stackoverflow.com/questions/30404269/pycuda-driver-import-error-windows

3: https://forums.developer.nvidia.com/t/importerror-no-module-named-driver/26751

4: https://stackoverflow.com/questions/79861471/pycuda-driver-error-cuinit-failed-unknown-error

5: https://wenku.csdn.net/answer/3579e9da34e14ada98a9cd2bae1e00d4

6: https://forums.developer.nvidia.com/t/make-default-context-wasnt-able-to-create-a-context/252243

7: https://forums.developer.nvidia.com/t/import-pycuda-autoinit-error-can-t-create-context-on-any-of-the-1detected-devices/217712

8: Return StageCmdFailedError when I import pycuda.autoinit treeverse/dvc#8222

Avoid importing pycuda.autoinit at module import time (catch CUDA-driver init failures)

In services/detection/trt_utils.py (imports around lines 23-29), pycuda.autoinit is imported at module load and only ImportError is caught. If pycuda is installed but CUDA driver/context initialization fails, pycuda.autoinit commonly raises non-ImportError exceptions (e.g., pycuda._driver.Error/LogicError/RuntimeError), which will crash module import and break the intended “safe import” behavior.

🛡️ Proposed fix

try: import pycuda.driver as cuda - import pycuda.autoinit # Automatically handles CUDA context creation/destruction CUDA_AVAILABLE = True -except ImportError: +except Exception: # ImportError or driver init failure cuda = None CUDA_AVAILABLE = False

Then lazily import pycuda.autoinit inside TensorRTInference.__init__ (wrapped in a try/except that also handles pycuda._driver.Error/driver-init failures) so probes and pytest collection can safely import the module.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/detection/trt_utils.py` around lines 23 - 29, The current module-level import of pycuda.autoinit can raise non-ImportError exceptions and crash imports; remove the autoinit import from the top-level try/except (keep only "import pycuda.driver as cuda" and set CUDA_AVAILABLE based on that import), and move the "import pycuda.autoinit" into a guarded, lazy import inside TensorRTInference.__init__; wrap that lazy import in a try/except that catches pycuda._driver.Error (and a broad Exception fallback) so driver/context-init failures are caught, set or clear CUDA_AVAILABLE accordingly, and ensure TensorRTInference.__init__ handles the case where autoinit failed without letting the module import fail.

coderabbitai · 2026-05-21T18:43:17Z

+        if not self.inputs:
+            raise ValueError("No input bindings allocated in the TensorRT engine.")
+
+        input_info = self.inputs[0]
+        # Fast copy to pagelocked host buffer
+        np.copyto(input_info["host"], input_data.ravel())


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add input shape/dtype validation before np.copyto.

np.copyto(input_info["host"], input_data.ravel()) fails with a generic ValueError if input_data.size != size and silently mis-feeds the engine on dtype mismatch (since ravel() keeps the source dtype while the destination is the binding dtype, which can trigger an implicit cast or raise TypeError). For a wrapper that is called per-frame in a detection loop, surface a clear error early.

🛡️ Proposed validation

input_info = self.inputs[0] - # Fast copy to pagelocked host buffer - np.copyto(input_info["host"], input_data.ravel()) + expected_size = input_info["host"].size + if input_data.size != expected_size: + raise ValueError( + f"Input size mismatch: got {input_data.size}, expected {expected_size} " + f"for binding '{input_info['name']}' with shape {input_info['shape']}." + ) + # Fast copy to pagelocked host buffer (dtype-cast if needed) + np.copyto(input_info["host"], input_data.ravel().astype(input_info["dtype"], copy=False))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/detection/trt_utils.py` around lines 127 - 132, Validate the incoming input_data against the engine input binding before calling np.copyto: retrieve the target host buffer and its expected element count and dtype from self.inputs[0] (e.g., input_info = self.inputs[0]; expected_size = input_info["host"].size; expected_dtype = input_info["host"].dtype), check that input_data.size == expected_size (or reshapeable to that size) and that input_data.dtype == expected_dtype; if sizes/dtypes mismatch either raise a clear ValueError/TypeError naming the binding, expected size/dtype and actual size/dtype, or explicitly convert with input_data.astype(expected_dtype, copy=False) and reshape to match before calling np.copyto(input_info["host"], ...). Ensure these checks/convert happen just before the np.copyto call that currently uses input_data.ravel().

feat: add TensorRT .engine support with auto-fallback and benchmarking

360ca28

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

Devnil434 added the gssoc:approved label May 24, 2026

Conversation

kashviporwal-byte commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Enhancements Added

Why .engine Files are Not Included in this Repository

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kashviporwal-byte commented May 21, 2026 •

edited by coderabbitai Bot

Loading

Why `.engine` Files are Not Included in this Repository

coderabbitai Bot commented May 21, 2026 •

edited

Loading