feat: add TensorRT .engine support with auto-fallback and benchmarking#98
Conversation
📝 WalkthroughWalkthroughThis PR adds TensorRT GPU-accelerated model inference support with smart fallback format selection, model export tooling, device-aware benchmarking, and performance comparison infrastructure. The changes enable loading and running YOLO detection models in optimized ChangesTensorRT Integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
services/detection/detection.py (1)
59-70:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDefault
device="cpu"makes the TensorRT path unreachable by default.
_load_model_with_fallbackonly attempts the engine when"cuda" in self.device.lower()(line 92), butDetectordefaultsdeviceto"cpu". The CLI (main()at line 245) constructsDetector(model_name=args.model, confidence_threshold=args.conf)without passing a device, and the PR objective claims the system "Auto-detects nearby.enginecounterparts" — with these defaults it never will, even on a CUDA box whereyolov8n.enginesits next toyolov8n.pt.Either auto-detect (
torch.cuda.is_available()) whendeviceis unspecified, or expose--deviceon the CLI and document that engines require explicitdevice="cuda".🛡️ Proposed default-device resolution
def __init__( self, model_name: str = settings.detector_model, confidence_threshold: float = 0.45, - device: str = "cpu", + device: str | None = None, ) -> None: self.model_path = model_name self.conf = confidence_threshold - self.device = device + if device is None: + try: + import torch + device = "cuda" if torch.cuda.is_available() else "cpu" + except Exception: + device = "cpu" + self.device = deviceAnd add
--deviceto the argparse block around line 240.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/detection/detection.py` around lines 59 - 70, The Detector currently defaults device="cpu" so _load_model_with_fallback never attempts TensorRT even on GPU hosts; change Detector.__init__ to accept device: Optional[str]=None and when device is None set self.device = "cuda" if torch.cuda.is_available() else "cpu" (keeping existing behavior if caller explicitly passes a value), and update main()'s argparse to add a --device flag and pass args.device into Detector(...) so engine auto-detection can run in _load_model_with_fallback when CUDA is available.
🧹 Nitpick comments (7)
docs/tensorrt_conversion.md (1)
78-81: 💤 Low valueClarify scope of "Resilient Fallback".
The doc states the system falls back when the engine is "compiled on a different GPU" — TensorRT deserialization typically raises at load time in such cases, which the
Detectordoes catch, so this is accurate. However, "corrupted" engines can also fail at first inference (not just load), and silent fallback at runtime is not described here. Consider clarifying that the fallback occurs at load time only, so users understand a corrupted-but-loadable engine could still crash mid-stream.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/tensorrt_conversion.md` around lines 78 - 81, Update the "Resilient Fallback" text to explicitly state that the automatic fallback is triggered only on engine load/deserialization errors (caught by Detector) and not on later runtime/inference failures; mention that a corrupted-but-loadable .engine may still fail during inference and therefore will not be silently replaced, and suggest users enable runtime error handling or add inference-time checks if they need automatic fallback from .engine to .pt/.onnx during inference.services/detection/trt_utils.py (2)
137-138: ⚡ Quick win
execute_async_v2is legacy; pair the API change withexecute_async_v3.If you migrate to the name-based engine API above, you must also switch from
execute_async_v2(bindings=[...], stream_handle=...)toexecute_async_v3(stream_handle=...)and bind addresses viacontext.set_tensor_address(name, int(device_mem))(typically done once during buffer allocation). NVIDIA's 8→10 guide flags this as the corresponding execution-side change.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/detection/trt_utils.py` around lines 137 - 138, The code currently calls self.context.execute_async_v2(bindings=self.bindings, stream_handle=self.stream.handle); replace this with the newer name-based API by switching to self.context.execute_async_v3(stream_handle=self.stream.handle) and ensure tensor addresses are bound earlier using self.context.set_tensor_address(tensor_name, int(device_mem)) when you allocate device buffers (do the set_tensor_address calls once during buffer allocation for each tensor instead of passing a bindings list at execute time).
155-161: ⚡ Quick win
__del__does not release CUDA resources deterministically.Clearing the Python lists drops references to
pycuda.driver.DeviceAllocationobjects, butself.context,self.engine, andself.streamsurvive — and the order of finalizer execution at interpreter shutdown is undefined, so device memory can be freed after the CUDA context has already been torn down (raisingLogicError: cuMemFree failed: invalid context). Replace with an explicitclose()plus context-manager protocol; keep__del__as a best-effort fallback.♻️ Proposed cleanup
- def __del__(self) -> None: - """ - Cleans up GPU bindings and device pointers when the class object is garbage collected. - """ - self.bindings.clear() - self.inputs.clear() - self.outputs.clear() + def close(self) -> None: + """Release GPU buffers, context, engine, and stream in a safe order.""" + for buf in (*self.inputs, *self.outputs): + try: + buf["device"].free() + except Exception: + pass + self.inputs.clear() + self.outputs.clear() + self.bindings.clear() + self.context = None + self.engine = None + self.stream = None + + def __enter__(self): + return self + + def __exit__(self, exc_type, exc, tb): + self.close() + + def __del__(self) -> None: + try: + self.close() + except Exception: + pass🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/detection/trt_utils.py` around lines 155 - 161, The destructor __del__ currently only clears lists and can run after CUDA context teardown; implement a deterministic close() method that releases GPU resources in a safe order (set current context if needed, free DeviceAllocation objects referenced by self.bindings/self.inputs/self.outputs, destroy/free self.stream, then release self.context and self.engine), add context-manager support via __enter__ returning self and __exit__ calling close(), and make __del__ a best-effort fallback that calls close() inside a try/except to avoid exceptions during interpreter shutdown; reference the methods/attributes __del__, close, __enter__, __exit__, self.context, self.engine, self.stream, self.bindings, self.inputs, self.outputs when updating the class.services/detection/detection.py (2)
128-141: 💤 Low valueCollapse redundant YOLO loader methods—Ultralytics already chooses backend from the file extension
load_tensorrt_model,load_onnx_model, andload_pytorch_modelall do the sameself.model = YOLO(model_path, task="detect"); Ultralytics picks the runtime via AutoBackend from suffixes like.pt/.onnx/.engine, whiletask="detect"controls the detection head/output. Three separate public methods add API surface area for no functional gain—consider a single_load_model(model_path, kind)(for logging) or inline at call sites.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/detection/detection.py` around lines 128 - 141, Collapse the three redundant loaders into a single loader (e.g., replace load_tensorrt_model, load_onnx_model, load_pytorch_model with one method like load_model or _load_model) that centralizes logger usage and calls self.model = YOLO(model_path, task="detect"); update all callers to use the new method and remove the three old public methods to shrink the API surface; keep an optional `kind` or derive backend from the file extension only for logging context (use logger.info to include model_path and derived kind) so behavior relies on Ultralytics' AutoBackend decision.
128-131: 🏗️ Heavy liftRemove or integrate unused TensorRT utilities
services/detection/detection.py’sload_tensorrt_model()routes.enginefiles straight to Ultralytics viaYOLO(model_path, task="detect")and never usesTensorRTInference/is_tensorrt_supported().Repo-wide search shows
TensorRTInferenceandis_tensorrt_supported()are only defined inservices/detection/trt_utils.pyand are not referenced anywhere else, so the low-level TensorRT path (execute_async_v2, buffer/DMA setup, etc.) is dead code on the runtime path.Either delete
services/detection/trt_utils.py, or wire it in (gate withis_tensorrt_supported()and useTensorRTInferenceinload_tensorrt_model()), or if Ultralytics’ TRT backend is the intended route, drop the low-level module.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/detection/detection.py` around lines 128 - 131, The load_tensorrt_model function currently ignores the low-level TRT utilities; either remove the dead module services/detection/trt_utils.py entirely, or update load_tensorrt_model to detect and use those utilities: check is_tensorrt_supported() and if true and the model_path endswith('.engine') instantiate and use TensorRTInference (instead of YOLO(...)) for inference, otherwise fall back to YOLO(model_path, task="detect"); ensure imports for is_tensorrt_supported and TensorRTInference are added to services/detection/detection.py and that the branch cleanly falls back to the existing YOLO path.tests/test_tensorrt_routing.py (1)
50-77: 💤 Low value
assert_called_withonly inspects the last call — strengthen the fallback assertions.In
test_routing_engine_cpu_fallbackandtest_routing_engine_load_failure_fallback, the assertion passes as long as the finalYOLO(...)invocation usedyolov8n.pt. It does not verify that the.enginepath was actually attempted (failure test) or short-circuited (CPU test), so a regression that silently skips the engine attempt would still go green. Consider asserting againstmock_yolo.call_args_listto pin down the intended sequence.♻️ Example for the load-failure case
- detector = Detector(model_name="yolov8n.engine", device="cuda:0") - # Should fallback to yolov8n.pt - mock_yolo.assert_called_with("yolov8n.pt", task="detect") + detector = Detector(model_name="yolov8n.engine", device="cuda:0") + # Engine must be attempted first, then .pt as fallback + calls = [call.args[0] for call in mock_yolo.call_args_list] + assert calls == ["yolov8n.engine", "yolov8n.pt"], calls🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_tensorrt_routing.py` around lines 50 - 77, Update the two tests to assert the full call sequence against mock_yolo.call_args_list: in test_routing_engine_cpu_fallback ensure Detector(model_name="yolov8n.engine", device="cpu") results in exactly one YOLO(...) invocation with "yolov8n.pt" (assert len(call_args_list)==1 and call_args_list[0] matches "yolov8n.pt") to verify the engine path was short‑circuited on CPU; in test_routing_engine_load_failure_fallback ensure the failure path attempts the engine first and then falls back by asserting call_args_list[0] was called with "yolov8n.engine" and call_args_list[1] was called with "yolov8n.pt" (and that there are at least two calls), referencing mock_yolo, Detector and Path.exists to locate the behavior to change.benchmark.py (1)
95-99: 💤 Low valuePrefer
.to(device)over.cuda()to honor the requested device index.
fake_tensor.cuda()always lands on the current default CUDA device, so adevice="cuda:1"request silently warms up oncuda:0. Usingtensor.to(device)keeps the warmup tensor and the predict device aligned, which matters for multi-GPU benchmark runs and avoids hidden cross-device copies duringmodel.predict(..., device=device).♻️ Proposed diff
- fake_tensor = torch.rand(1, 3, img_size, img_size) - if "cuda" in device.lower(): - fake_tensor = fake_tensor.cuda() + fake_tensor = torch.rand(1, 3, img_size, img_size) + if "cuda" in device.lower(): + fake_tensor = fake_tensor.to(device)- frame_to_process = torch.rand(1, 3, img_size, img_size) - if "cuda" in device.lower(): - frame_to_process = frame_to_process.cuda() + frame_to_process = torch.rand(1, 3, img_size, img_size) + if "cuda" in device.lower(): + frame_to_process = frame_to_process.to(device)Also applies to: 131-132
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmark.py` around lines 95 - 99, The warmup tensor creation uses fake_tensor.cuda(), which ignores explicit device strings like "cuda:1"; change the warmup code to move the tensor with fake_tensor.to(device) so the tensor and model.predict(..., device=device) use the same device (update the fake_tensor assignment near the initial warmup and the similar warmup block later that also creates fake_tensor before calling model.predict). Ensure you use the existing variables fake_tensor, device, img_size and keep the conditional check for CUDA in device.lower() when deciding to move the tensor.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmark.py`:
- Around line 257-268: The device selection logic in benchmark.py sets
device="cuda" whenever path.endswith(".engine") is true, even if CUDA isn't
available; update the condition so CUDA is chosen only when the model is an
.engine file AND torch.cuda.is_available() is true (otherwise fall back to
"cpu") before calling benchrunner.run_full_pipeline_benchmark; locate the device
assignment near the loop over models and adjust the boolean expression that sets
device accordingly.
- Around line 309-334: The --compare flag is defined with store_true but
default=True so args.compare is always truthy, making the else branch
unreachable; update the argparse setup so --compare is opt-in (set
default=False) or use a mutually exclusive group between --model and --compare,
then adjust the control flow that calls run_comparative_benchmark(benchrunner,
candidate_models, num_frames=...) and
benchrunner.run_full_pipeline_benchmark(model_path=..., num_frames=...)
accordingly (references: parser.add_argument("--compare", ...), args.compare,
run_comparative_benchmark, benchrunner.run_full_pipeline_benchmark,
candidate_models).
In `@docs/tensorrt_conversion.md`:
- Around line 61-69: Update the docs to reflect the corrected CLI behavior for
the --fp16 flag: document that --fp16 remains enabled by default but can be
disabled with the generated --no-fp16 opt-out (mirror the BooleanOptionalAction
change made in scripts/export_tensorrt.py) and ensure the table lists `--fp16`
(bool, Default: True) and `--no-fp16` as the opt-out; also add a blank line
immediately before and after the Markdown table to satisfy MD058.
In `@scripts/export_tensorrt.py`:
- Around line 106-117: The parser arguments for "--fp16" and "--int8" use
action="store_true" with contradictory defaults (fp16 default=True) so users
cannot disable FP16; change these to use argparse.BooleanOptionalAction for both
flags (or provide an explicit "--no-fp16") and remove the contradictory default
values so the flag can be toggled from the CLI; update the parser.add_argument
calls for "--fp16" and "--int8" in scripts/export_tensorrt.py to use
BooleanOptionalAction and sensible defaults (e.g., default=False for fp16 if you
want FP16 off by default, or keep True but allow --no-fp16) and adjust help text
accordingly.
- Around line 72-93: The model.export call in scripts/export_tensorrt.py is
invoking INT8 conversion without passing a calibration dataset; update the CLI
to accept a --data argument, forward that value as data=<data> to
model.export(...), and add a pre-flight validation that raises/logs an error and
exits when --int8 is specified but --data is missing; additionally validate
precision flags (e.g., if int8 is True, ensure half/fp16 is disabled or warn and
force-consistent behavior) so you don’t simultaneously enable both fp16 and int8
in the export flow.
In `@services/detection/detection.py`:
- Around line 86-126: The extension checks are case-sensitive and brittle:
replace repeated self.model_path.endswith(...) checks with a single normalized
extension variable (e.g., ext = Path(self.model_path).suffix.lower()) and route
loading based on ext (compare to ".engine", ".onnx", ".pt"); keep the existing
TensorRT attempt using resolved_engine_path and load_tensorrt_model(self, ...),
and when an explicit ".engine" fails on CUDA let control fall through to the
generic fallback branch (add a short inline comment near the else to explain
this intentional flow) while using load_onnx_model and load_pytorch_model based
on the normalized ext or found counterparts.
In `@services/detection/trt_utils.py`:
- Around line 127-132: Validate the incoming input_data against the engine input
binding before calling np.copyto: retrieve the target host buffer and its
expected element count and dtype from self.inputs[0] (e.g., input_info =
self.inputs[0]; expected_size = input_info["host"].size; expected_dtype =
input_info["host"].dtype), check that input_data.size == expected_size (or
reshapeable to that size) and that input_data.dtype == expected_dtype; if
sizes/dtypes mismatch either raise a clear ValueError/TypeError naming the
binding, expected size/dtype and actual size/dtype, or explicitly convert with
input_data.astype(expected_dtype, copy=False) and reshape to match before
calling np.copyto(input_info["host"], ...). Ensure these checks/convert happen
just before the np.copyto call that currently uses input_data.ravel().
- Around line 23-29: The current module-level import of pycuda.autoinit can
raise non-ImportError exceptions and crash imports; remove the autoinit import
from the top-level try/except (keep only "import pycuda.driver as cuda" and set
CUDA_AVAILABLE based on that import), and move the "import pycuda.autoinit" into
a guarded, lazy import inside TensorRTInference.__init__; wrap that lazy import
in a try/except that catches pycuda._driver.Error (and a broad Exception
fallback) so driver/context-init failures are caught, set or clear
CUDA_AVAILABLE accordingly, and ensure TensorRTInference.__init__ handles the
case where autoinit failed without letting the module import fail.
---
Outside diff comments:
In `@services/detection/detection.py`:
- Around line 59-70: The Detector currently defaults device="cpu" so
_load_model_with_fallback never attempts TensorRT even on GPU hosts; change
Detector.__init__ to accept device: Optional[str]=None and when device is None
set self.device = "cuda" if torch.cuda.is_available() else "cpu" (keeping
existing behavior if caller explicitly passes a value), and update main()'s
argparse to add a --device flag and pass args.device into Detector(...) so
engine auto-detection can run in _load_model_with_fallback when CUDA is
available.
---
Nitpick comments:
In `@benchmark.py`:
- Around line 95-99: The warmup tensor creation uses fake_tensor.cuda(), which
ignores explicit device strings like "cuda:1"; change the warmup code to move
the tensor with fake_tensor.to(device) so the tensor and model.predict(...,
device=device) use the same device (update the fake_tensor assignment near the
initial warmup and the similar warmup block later that also creates fake_tensor
before calling model.predict). Ensure you use the existing variables
fake_tensor, device, img_size and keep the conditional check for CUDA in
device.lower() when deciding to move the tensor.
In `@docs/tensorrt_conversion.md`:
- Around line 78-81: Update the "Resilient Fallback" text to explicitly state
that the automatic fallback is triggered only on engine load/deserialization
errors (caught by Detector) and not on later runtime/inference failures; mention
that a corrupted-but-loadable .engine may still fail during inference and
therefore will not be silently replaced, and suggest users enable runtime error
handling or add inference-time checks if they need automatic fallback from
.engine to .pt/.onnx during inference.
In `@services/detection/detection.py`:
- Around line 128-141: Collapse the three redundant loaders into a single loader
(e.g., replace load_tensorrt_model, load_onnx_model, load_pytorch_model with one
method like load_model or _load_model) that centralizes logger usage and calls
self.model = YOLO(model_path, task="detect"); update all callers to use the new
method and remove the three old public methods to shrink the API surface; keep
an optional `kind` or derive backend from the file extension only for logging
context (use logger.info to include model_path and derived kind) so behavior
relies on Ultralytics' AutoBackend decision.
- Around line 128-131: The load_tensorrt_model function currently ignores the
low-level TRT utilities; either remove the dead module
services/detection/trt_utils.py entirely, or update load_tensorrt_model to
detect and use those utilities: check is_tensorrt_supported() and if true and
the model_path endswith('.engine') instantiate and use TensorRTInference
(instead of YOLO(...)) for inference, otherwise fall back to YOLO(model_path,
task="detect"); ensure imports for is_tensorrt_supported and TensorRTInference
are added to services/detection/detection.py and that the branch cleanly falls
back to the existing YOLO path.
In `@services/detection/trt_utils.py`:
- Around line 137-138: The code currently calls
self.context.execute_async_v2(bindings=self.bindings,
stream_handle=self.stream.handle); replace this with the newer name-based API by
switching to self.context.execute_async_v3(stream_handle=self.stream.handle) and
ensure tensor addresses are bound earlier using
self.context.set_tensor_address(tensor_name, int(device_mem)) when you allocate
device buffers (do the set_tensor_address calls once during buffer allocation
for each tensor instead of passing a bindings list at execute time).
- Around line 155-161: The destructor __del__ currently only clears lists and
can run after CUDA context teardown; implement a deterministic close() method
that releases GPU resources in a safe order (set current context if needed, free
DeviceAllocation objects referenced by self.bindings/self.inputs/self.outputs,
destroy/free self.stream, then release self.context and self.engine), add
context-manager support via __enter__ returning self and __exit__ calling
close(), and make __del__ a best-effort fallback that calls close() inside a
try/except to avoid exceptions during interpreter shutdown; reference the
methods/attributes __del__, close, __enter__, __exit__, self.context,
self.engine, self.stream, self.bindings, self.inputs, self.outputs when updating
the class.
In `@tests/test_tensorrt_routing.py`:
- Around line 50-77: Update the two tests to assert the full call sequence
against mock_yolo.call_args_list: in test_routing_engine_cpu_fallback ensure
Detector(model_name="yolov8n.engine", device="cpu") results in exactly one
YOLO(...) invocation with "yolov8n.pt" (assert len(call_args_list)==1 and
call_args_list[0] matches "yolov8n.pt") to verify the engine path was
short‑circuited on CPU; in test_routing_engine_load_failure_fallback ensure the
failure path attempts the engine first and then falls back by asserting
call_args_list[0] was called with "yolov8n.engine" and call_args_list[1] was
called with "yolov8n.pt" (and that there are at least two calls), referencing
mock_yolo, Detector and Path.exists to locate the behavior to change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 74207dee-bf66-48fa-a84b-a206af124e96
📒 Files selected for processing (6)
benchmark.pydocs/tensorrt_conversion.mdscripts/export_tensorrt.pyservices/detection/detection.pyservices/detection/trt_utils.pytests/test_tensorrt_routing.py
| for label, path in models.items(): | ||
| if os.path.exists(path): | ||
| try: | ||
| # Decide device automatically based on model suffix | ||
| device = "cuda" if path.endswith(".engine") or torch.cuda.is_available() else "cpu" | ||
| res = benchrunner.run_full_pipeline_benchmark(model_path=path, num_frames=num_frames, device=device) | ||
| res["format"] = label | ||
| results.append(res) | ||
| except Exception as e: | ||
| print(f"❌ Failed to run benchmark for {label} ({path}): {e}") | ||
| else: | ||
| print(f"⚠️ Skipping comparison for format '{label}' since file was not found at '{path}'.") |
There was a problem hiding this comment.
Device picker can request CUDA when CUDA is unavailable.
device = "cuda" if path.endswith(".engine") or torch.cuda.is_available() else "cpu"Operator precedence is (path.endswith(".engine")) or (torch.cuda.is_available()), so any .engine path forces device="cuda" even on a CPU-only host. That conflicts with the PR's "graceful fallback" promise — YOLO("yolov8n.engine", task="detect").predict(..., device="cuda") will raise on a machine without CUDA rather than skip the format. Note that, unlike Detector, this comparative runner uses YOLO directly and does not benefit from the loader's fallback logic, so the guard needs to live here.
🛡️ Proposed fix
- # Decide device automatically based on model suffix
- device = "cuda" if path.endswith(".engine") or torch.cuda.is_available() else "cpu"
- res = benchrunner.run_full_pipeline_benchmark(model_path=path, num_frames=num_frames, device=device)
+ # Skip .engine on hosts without CUDA; otherwise pick CUDA when available
+ if path.endswith(".engine") and not torch.cuda.is_available():
+ print(f"⚠️ Skipping '{label}' — '{path}' requires CUDA, which is unavailable on this host.")
+ continue
+ device = "cuda" if torch.cuda.is_available() else "cpu"
+ res = benchrunner.run_full_pipeline_benchmark(model_path=path, num_frames=num_frames, device=device)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmark.py` around lines 257 - 268, The device selection logic in
benchmark.py sets device="cuda" whenever path.endswith(".engine") is true, even
if CUDA isn't available; update the condition so CUDA is chosen only when the
model is an .engine file AND torch.cuda.is_available() is true (otherwise fall
back to "cpu") before calling benchrunner.run_full_pipeline_benchmark; locate
the device assignment near the loop over models and adjust the boolean
expression that sets device accordingly.
| if __name__ == "__main__": | ||
| int8_path = "yolov8n_int8_openvino_model" | ||
| import argparse | ||
| parser = argparse.ArgumentParser(description="Run Eagle performance benchmarks") | ||
| parser.add_argument("--model", type=str, default=None, help="Path to a specific model to benchmark") | ||
| parser.add_argument("--compare", action="store_true", default=True, help="Run comparisons across formats (.pt, .onnx, .engine)") | ||
| parser.add_argument("--frames", type=int, default=100, help="Number of frames to benchmark") | ||
| args = parser.parse_args() | ||
|
|
||
| REDIS_ENV_URL = os.getenv("REDIS_URL", settings.REDIS_URL) | ||
|
|
||
| benchrunner = PipelineBenchmark(redis_url=REDIS_ENV_URL) | ||
| benchrunner.run_full_pipeline_benchmark(model_path=int8_path, num_frames=100) No newline at end of file | ||
|
|
||
| if args.model: | ||
| # Benchmark specific model | ||
| benchrunner.run_full_pipeline_benchmark(model_path=args.model, num_frames=args.frames) | ||
| elif args.compare: | ||
| # Cross-format comparison candidate paths | ||
| candidate_models = { | ||
| "PyTorch (.pt)": "yolov8n.pt", | ||
| "ONNX (.onnx)": "yolov8n.onnx", | ||
| "TensorRT (.engine)": "yolov8n.engine" | ||
| } | ||
| run_comparative_benchmark(benchrunner, candidate_models, num_frames=args.frames) | ||
| else: | ||
| # Default single run | ||
| int8_path = "yolov8n_int8_openvino_model" | ||
| benchrunner.run_full_pipeline_benchmark(model_path=int8_path, num_frames=args.frames) No newline at end of file |
There was a problem hiding this comment.
--compare default makes the else branch unreachable and removes opt-out.
parser.add_argument("--compare", action="store_true", default=True, ...) with store_true and default=True means args.compare is always truthy regardless of whether the flag is passed. As a result:
- The
elsebranch at L331-333 (single OpenVINO run) is dead code. - Users have no way to disable comparative mode without supplying
--model, which contradicts a typical opt-in--comparesemantic.
Either default to False (idiomatic for store_true) or model the choices as a mutually exclusive group.
🐛 Proposed fix
- parser.add_argument("--compare", action="store_true", default=True, help="Run comparisons across formats (.pt, .onnx, .engine)")
+ parser.add_argument("--compare", action="store_true", help="Run comparisons across formats (.pt, .onnx, .engine)")Or, if compare-by-default is the intent, drop the unreachable else branch and document that behavior explicitly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmark.py` around lines 309 - 334, The --compare flag is defined with
store_true but default=True so args.compare is always truthy, making the else
branch unreachable; update the argparse setup so --compare is opt-in (set
default=False) or use a mutually exclusive group between --model and --compare,
then adjust the control flow that calls run_comparative_benchmark(benchrunner,
candidate_models, num_frames=...) and
benchrunner.run_full_pipeline_benchmark(model_path=..., num_frames=...)
accordingly (references: parser.add_argument("--compare", ...), args.compare,
run_comparative_benchmark, benchrunner.run_full_pipeline_benchmark,
candidate_models).
| ### Command Parameters | ||
| | Flag | Type | Default | Description | | ||
| | :--- | :--- | :--- | :--- | | ||
| | `--model` | `str` | `yolov8n.pt` | Path to the source `.pt` or `.onnx` model file to compile. | | ||
| | `--fp16` | `bool` | `True` | Enables FP16 half-precision optimization (highly recommended). | | ||
| | `--int8` | `bool` | `False` | Enables INT8 quantization (requires calibrating dataset). | | ||
| | `--imgsz` | `int` | `640` | Resolution width/height of input frames (default: 640). | | ||
| | `--device` | `str` | `cuda:0` | GPU device ID to execute compiling (default: `cuda:0`). | | ||
|
|
There was a problem hiding this comment.
Docs reflect the misleading --fp16 default; update alongside CLI fix.
The table documents --fp16 as a bool defaulting to True, which matches the current (broken) CLI behavior where the flag cannot be disabled. Once the CLI is corrected to use BooleanOptionalAction (see review on scripts/export_tensorrt.py), please document the --no-fp16 opt-out here as well. Also, per markdownlint MD058, surround the table with blank lines (already mostly present — verify there is a blank line directly before Line 62 and after Line 68).
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 62-62: Tables should be surrounded by blank lines
(MD058, blanks-around-tables)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/tensorrt_conversion.md` around lines 61 - 69, Update the docs to reflect
the corrected CLI behavior for the --fp16 flag: document that --fp16 remains
enabled by default but can be disabled with the generated --no-fp16 opt-out
(mirror the BooleanOptionalAction change made in scripts/export_tensorrt.py) and
ensure the table lists `--fp16` (bool, Default: True) and `--no-fp16` as the
opt-out; also add a blank line immediately before and after the Markdown table
to satisfy MD058.
| try: | ||
| # Ultralytics natively wraps the ONNX -> TensorRT conversion process | ||
| exported_path = model.export( | ||
| format="engine", | ||
| half=fp16, | ||
| int8=int8, | ||
| imgsz=imgsz, | ||
| device=device, | ||
| dynamic=True # Enables dynamic batching support | ||
| ) | ||
| logger.info("========================================= SUCCESS =========================================") | ||
| logger.info("TensorRT Engine compiled and optimized successfully!") | ||
| logger.info(f"Saved optimized model to: {os.path.abspath(exported_path)}") | ||
| logger.info("===========================================================================================") | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"An error occurred during TensorRT compilation: {e}") | ||
| logger.error( | ||
| "Please ensure you have the TensorRT Python API and CUDA toolkit properly installed. " | ||
| "Refer to docs/tensorrt_conversion.md for assistance." | ||
| ) | ||
| sys.exit(1) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Ultralytics YOLO export to TensorRT engine int8 calibration data argument requirement
💡 Result:
When exporting an Ultralytics YOLO model to a TensorRT engine with INT8 quantization, the data argument is essential for providing the representative dataset required for post-training calibration [1][2]. Key requirements and considerations for INT8 calibration include: 1. Data Argument: You must specify the data argument, which should point to a dataset configuration YAML file (e.g., data="my_dataset.yaml") [1][3]. The calibration process typically utilizes the validation images defined in this YAML file [1][4]. 2. Fallback Behavior: If the data argument is omitted when int8=True is enabled, the export process will default to using one of the "small" example datasets (such as coco8.yaml) rather than throwing an error [1][5]. 3. Calibration Dataset Size: NVIDIA and Ultralytics recommend using at least 500 to 1,000 images that are representative of your specific data to minimize accuracy loss [1][3]. While this is a guideline rather than a strict hard requirement, larger, more representative sets generally yield better results [1][5][3]. 4. Calibration Fraction: You can use the fraction argument (float, default 1.0) to specify the portion of the dataset used for calibration [1][5]. Setting a value less than 1.0 is useful for experiments or when computational resources are constrained [2]. 5. Calibration Algorithm: The default algorithm used is MINMAX_CALIBRATION, which is generally recommended for most use cases [1]. Example export code snippet: from ultralytics import YOLO model = YOLO("path/to/model.pt") model.export( format="engine", int8=True, data="path/to/dataset.yaml", fraction=1.0 # Optional: adjust based on resource availability)
Citations:
- 1: https://github.com/ultralytics/ultralytics/blob/main/docs/en/integrations/tensorrt.md
- 2: https://github.com/ultralytics/ultralytics/blob/18b4666b/docs/macros/export-args.md
- 3: Export YOLOv8 by using TensorRT with INT8 calibration ultralytics/ultralytics#11848
- 4: INT8 calibration set clarification ultralytics/ultralytics#14121
- 5: https://docs.ultralytics.com/integrations/tensorrt
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show file around the referenced lines
sed -n '1,200p' scripts/export_tensorrt.py | nl -ba | sed -n '50,120p'
# Search for CLI args related to data/int8/fp16
rg -n --hidden --no-ignore -S "int8|fp16|--data|data=" scripts/export_tensorrt.py || true
# Show full argument parsing section if present
rg -n "argparse|ArgumentParser|add_argument" scripts/export_tensorrt.py || trueRepository: Devnil434/Eagle
Length of output: 102
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Print lines 40-130 with line numbers using awk (nl not available)
awk 'NR>=40 && NR<=130 {printf "%4d\t%s\n", NR, $0}' scripts/export_tensorrt.py
# Search for CLI args / forwarding to model.export
rg -n "int8|fp16|--int8|--fp16|--data|data=|data\s*=" scripts/export_tensorrt.py || true
rg -n "ArgumentParser|argparse|add_argument|click\." scripts/export_tensorrt.py || true
# Also show the model.export call block for exact args passed
rg -n "model\.export\(" -n scripts/export_tensorrt.pyRepository: Devnil434/Eagle
Length of output: 4489
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show the remainder around argument parsing and export invocation
awk 'NR>=90 && NR<=190 {printf "%4d\t%s\n", NR, $0}' scripts/export_tensorrt.py
# Search for any data/fraction/calibration-related CLI options or forwarding
rg -n --hidden -S "(--data|data=|fraction|calib|calibration|dataset)" scripts/export_tensorrt.py || true
# Search for any mutual-exclusion / validation of fp16 vs int8
rg -n --hidden -S "(int8.*fp16|fp16.*int8|mutual|exclusive|refus|if .*int8|if .*fp16)" scripts/export_tensorrt.py || trueRepository: Devnil434/Eagle
Length of output: 2069
Fix INT8 TensorRT export: add calibration data wiring
In scripts/export_tensorrt.py (model.export call), int8=int8 is passed without any data= calibration dataset argument, and the CLI also provides no --data option. Ultralytics’ INT8 TensorRT flow uses data for calibration; without it, export may use a small default dataset or behave unexpectedly. Add --data (and forward data= to model.export(...)), and error when --int8 is set but --data is missing. Also consider validating precision flags since --fp16 defaults to True, so enabling --int8 currently enables both half and int8.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/export_tensorrt.py` around lines 72 - 93, The model.export call in
scripts/export_tensorrt.py is invoking INT8 conversion without passing a
calibration dataset; update the CLI to accept a --data argument, forward that
value as data=<data> to model.export(...), and add a pre-flight validation that
raises/logs an error and exits when --int8 is specified but --data is missing;
additionally validate precision flags (e.g., if int8 is True, ensure half/fp16
is disabled or warn and force-consistent behavior) so you don’t simultaneously
enable both fp16 and int8 in the export flow.
| parser.add_argument( | ||
| "--fp16", | ||
| action="store_true", | ||
| default=True, | ||
| help="Enable FP16 (half-precision) float operations for faster inference (recommended)." | ||
| ) | ||
| parser.add_argument( | ||
| "--int8", | ||
| action="store_true", | ||
| default=False, | ||
| help="Enable INT8 quantization (requires calibrating dataset)." | ||
| ) |
There was a problem hiding this comment.
--fp16 flag cannot be disabled; store_true + default=True is contradictory.
With action="store_true" and default=True, the flag is always True regardless of whether the user passes --fp16 — there is no way to opt out of FP16 from the CLI. The same anti-pattern applies (less harmfully) to --int8, where the explicit default=False is redundant. Use argparse.BooleanOptionalAction so users can pass --no-fp16, or flip the semantics with a --no-fp16 flag.
🛠️ Proposed fix
parser.add_argument(
"--fp16",
- action="store_true",
- default=True,
+ action=argparse.BooleanOptionalAction,
+ default=True,
help="Enable FP16 (half-precision) float operations for faster inference (recommended)."
)
parser.add_argument(
"--int8",
action="store_true",
- default=False,
help="Enable INT8 quantization (requires calibrating dataset)."
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| parser.add_argument( | |
| "--fp16", | |
| action="store_true", | |
| default=True, | |
| help="Enable FP16 (half-precision) float operations for faster inference (recommended)." | |
| ) | |
| parser.add_argument( | |
| "--int8", | |
| action="store_true", | |
| default=False, | |
| help="Enable INT8 quantization (requires calibrating dataset)." | |
| ) | |
| parser.add_argument( | |
| "--fp16", | |
| action=argparse.BooleanOptionalAction, | |
| default=True, | |
| help="Enable FP16 (half-precision) float operations for faster inference (recommended)." | |
| ) | |
| parser.add_argument( | |
| "--int8", | |
| action="store_true", | |
| help="Enable INT8 quantization (requires calibrating dataset)." | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/export_tensorrt.py` around lines 106 - 117, The parser arguments for
"--fp16" and "--int8" use action="store_true" with contradictory defaults (fp16
default=True) so users cannot disable FP16; change these to use
argparse.BooleanOptionalAction for both flags (or provide an explicit
"--no-fp16") and remove the contradictory default values so the flag can be
toggled from the CLI; update the parser.add_argument calls for "--fp16" and
"--int8" in scripts/export_tensorrt.py to use BooleanOptionalAction and sensible
defaults (e.g., default=False for fp16 if you want FP16 off by default, or keep
True but allow --no-fp16) and adjust help text accordingly.
| should_try_engine = self.model_path.endswith(".engine") or engine_path.exists() | ||
|
|
||
| if should_try_engine: | ||
| resolved_engine_path = self.model_path if self.model_path.endswith(".engine") else str(engine_path) | ||
|
|
||
| # TensorRT requires an NVIDIA GPU with CUDA | ||
| if "cuda" in self.device.lower(): | ||
| try: | ||
| logger.info(f"Attempting optimized TensorRT engine load: {resolved_engine_path}") | ||
| self.load_tensorrt_model(resolved_engine_path) | ||
| return | ||
| except Exception as e: | ||
| logger.warning( | ||
| f"Failed to load TensorRT engine '{resolved_engine_path}': {e}. " | ||
| f"Triggering automatic fallback to standard model format..." | ||
| ) | ||
| else: | ||
| logger.warning( | ||
| f"TensorRT engine '{resolved_engine_path}' cannot run on non-CUDA device '{self.device}'. " | ||
| f"Triggering automatic fallback to standard model format..." | ||
| ) | ||
|
|
||
| # Main loader routing based on model extension | ||
| if self.model_path.endswith(".onnx"): | ||
| self.load_onnx_model(self.model_path) | ||
| elif self.model_path.endswith(".pt"): | ||
| self.load_pytorch_model(self.model_path) | ||
| else: | ||
| # If explicitly requested .engine failed or file is generic, seek compatible counterpart | ||
| pt_path = parent_dir / f"{base_name}.pt" | ||
| onnx_path = parent_dir / f"{base_name}.onnx" | ||
|
|
||
| if pt_path.exists(): | ||
| logger.info(f"Auto-fallback: Loading counterpart PyTorch model: {pt_path}") | ||
| self.load_pytorch_model(str(pt_path)) | ||
| elif onnx_path.exists(): | ||
| logger.info(f"Auto-fallback: Loading counterpart ONNX model: {onnx_path}") | ||
| self.load_onnx_model(str(onnx_path)) | ||
| else: | ||
| logger.info(f"No counterpart found. Loading default fallback model path: {self.model_path}") | ||
| self.load_pytorch_model(self.model_path) |
There was a problem hiding this comment.
Case-sensitive extension matching and brittle string routing.
self.model_path.endswith(".engine" / ".onnx" / ".pt") misclassifies MODEL.PT, weights.ENGINE, etc., on case-insensitive filesystems (Windows, macOS HFS+). Use Path(self.model_path).suffix.lower() once and route on that. Also note that when an explicit .engine request fails on a CUDA device (line 97 fallback), control flows into the else branch on line 113 — which is correct, but only because .engine doesn't match the .onnx/.pt arms; worth a comment so the next reader doesn't "fix" the missing elif.
♻️ Proposed routing
- should_try_engine = self.model_path.endswith(".engine") or engine_path.exists()
+ suffix = path.suffix.lower()
+ should_try_engine = suffix == ".engine" or engine_path.exists()
...
- resolved_engine_path = self.model_path if self.model_path.endswith(".engine") else str(engine_path)
+ resolved_engine_path = self.model_path if suffix == ".engine" else str(engine_path)
...
- if self.model_path.endswith(".onnx"):
+ if suffix == ".onnx":
self.load_onnx_model(self.model_path)
- elif self.model_path.endswith(".pt"):
+ elif suffix == ".pt":
self.load_pytorch_model(self.model_path)
else:
+ # Reached when path has no extension, or when an explicit .engine load failed/was skipped.
...🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/detection/detection.py` around lines 86 - 126, The extension checks
are case-sensitive and brittle: replace repeated self.model_path.endswith(...)
checks with a single normalized extension variable (e.g., ext =
Path(self.model_path).suffix.lower()) and route loading based on ext (compare to
".engine", ".onnx", ".pt"); keep the existing TensorRT attempt using
resolved_engine_path and load_tensorrt_model(self, ...), and when an explicit
".engine" fails on CUDA let control fall through to the generic fallback branch
(add a short inline comment near the else to explain this intentional flow)
while using load_onnx_model and load_pytorch_model based on the normalized ext
or found counterparts.
| try: | ||
| import pycuda.driver as cuda | ||
| import pycuda.autoinit # Automatically handles CUDA context creation/destruction | ||
| CUDA_AVAILABLE = True | ||
| except ImportError: | ||
| cuda = None | ||
| CUDA_AVAILABLE = False |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
ls -la
rg -n "trt_utils|TensorRTInference|is_tensorrt_supported|pycuda\.autoinit|pycuda\.driver|CUDA_AVAILABLE" -S services/detection || true
sed -n '1,120p' services/detection/trt_utils.py
sed -n '120,260p' services/detection/trt_utils.pyRepository: Devnil434/Eagle
Length of output: 8772
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "services\.detection\.trt_utils|detection\.trt_utils|trt_utils|is_tensorrt_supported\(" -S . || true
rg -n "TensorRTInference\s*\(" -S . || trueRepository: Devnil434/Eagle
Length of output: 235
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "is_tensorrt_supported" -S . || true
rg -n "TensorRTInference" -S . || true
rg -n "trt_utils" -S . || trueRepository: Devnil434/Eagle
Length of output: 418
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "services\.detection\.trt_utils|detection\.trt_utils|trt_utils\.py|importlib.*trt_utils|__import__\(.*trt_utils" -S . || true
rg -n "tensorrt_supported|CUDA_AVAILABLE|TRT_AVAILABLE" -S . || true
rg -n "TensorRTInference" -S . || trueRepository: Devnil434/Eagle
Length of output: 887
🌐 Web query:
pycuda.autoinit import failure exception type LogicError vs ImportError
💡 Result:
When you encounter a failure during the import of pycuda.autoinit, the exception type you receive provides critical information about the nature of the problem, distinguishing between structural installation issues and runtime environment failures [1][2][3]. ImportError (or ModuleNotFoundError) An ImportError usually signifies a fundamental structural problem with the PyCUDA installation [2][3][4]. Causes: - Missing or incompatible binary files (e.g., DLLs on Windows or shared objects on Linux) [5][2]. - Mismatched versions between the installed PyCUDA package and the underlying CUDA toolkit (e.g., trying to use a library compiled for CUDA 6.5 with CUDA 7.0) [2][3]. - An incomplete installation, such as failing to clone submodules (e.g., missing the pycuda.compyte module) [4]. - Environment path issues where the system cannot locate the required CUDA libraries [5]. LogicError (or other pycuda._driver.Error subclasses) A LogicError indicates that the library was imported successfully, but the CUDA runtime failed to initialize or execute a specific operation [1][6][7]. Causes: - No CUDA-capable device is detected on the system [1]. - Attempts to interact with the GPU inside a context that has not been initialized or is invalid, which is common in multi-threaded environments like Flask or web servers [6]. - Conflicts with other libraries that also attempt to control the GPU, leading to context creation failures [8]. - Incorrect CUDA driver/runtime interaction, often resolved by verifying that the CUDA headers match the active driver [3]. In summary, if you see an ImportError, focus on fixing your installation (reinstalling from PyPI, ensuring dependencies are met, or rebuilding from source) [3][4]. If you see a LogicError, focus on your runtime environment (checking GPU availability, handling context creation properly in multi-threaded applications, or verifying driver compatibility) [1][6].
Citations:
- 1: https://stackoverflow.com/questions/79861471/pycuda-driver-error-cuinit-failed-unknown-error
- 2: https://stackoverflow.com/questions/30404269/pycuda-driver-import-error-windows
- 3: Can't import pycuda.driver: undefined symbol: cuIpcOpenMemHandle_v2 inducer/pycuda#409
- 4: No module named 'pycuda.compyte' inducer/pycuda#339
- 5: PyCuda can't find DLL inducer/pycuda#281
- 6: https://stackoverflow.com/questions/50601029/pycuda-with-flask-gives-pycuda-driver-logicerror-cumoduleloaddataex
- 7: https://ask.csdn.net/questions/8136673
- 8: Return StageCmdFailedError when I import pycuda.autoinit treeverse/dvc#8222
🌐 Web query:
pycuda.autoinit import failure exception type LogicError vs ImportError
💡 Result:
In PyCUDA, an failure when importing pycuda.autoinit typically results in one of several exception types depending on the root cause, rather than being strictly limited to LogicError or ImportError. An ImportError often occurs when the underlying C++ extension module (pycuda._driver) cannot be loaded [1][2][3]. This is common on Windows or when the PyCUDA installation is misconfigured, causing the Python interpreter to fail to locate or link the necessary dynamic libraries (DLLs or.so files) [1][2]. A LogicError or RuntimeError is more common when the import process itself succeeds, but the subsequent automated initialization logic fails [4][5][6]. For example: - LogicError (or generic Error from _driver): Often raised when the CUDA driver API call (cuInit) fails, such as when an incompatible driver version is installed, or when no CUDA-capable devices are detected [4][5]. - RuntimeError: Frequently encountered when the initialization successfully calls CUDA functions, but fails to create a CUDA context (e.g., make_default_context fails) [6]. This can happen if all GPUs are exhausted, permissions are insufficient, or the environment is restricted (e.g., in certain containers) [7][8][6]. In summary, an ImportError indicates a failure to load the PyCUDA module files, while LogicError or RuntimeError indicates that the module loaded successfully but the attempt to initialize the CUDA driver or context failed at runtime [1][2][4][6].
Citations:
- 1: PyCuda can't find DLL inducer/pycuda#281
- 2: https://stackoverflow.com/questions/30404269/pycuda-driver-import-error-windows
- 3: https://forums.developer.nvidia.com/t/importerror-no-module-named-driver/26751
- 4: https://stackoverflow.com/questions/79861471/pycuda-driver-error-cuinit-failed-unknown-error
- 5: https://wenku.csdn.net/answer/3579e9da34e14ada98a9cd2bae1e00d4
- 6: https://forums.developer.nvidia.com/t/make-default-context-wasnt-able-to-create-a-context/252243
- 7: https://forums.developer.nvidia.com/t/import-pycuda-autoinit-error-can-t-create-context-on-any-of-the-1detected-devices/217712
- 8: Return StageCmdFailedError when I import pycuda.autoinit treeverse/dvc#8222
Avoid importing pycuda.autoinit at module import time (catch CUDA-driver init failures)
In services/detection/trt_utils.py (imports around lines 23-29), pycuda.autoinit is imported at module load and only ImportError is caught. If pycuda is installed but CUDA driver/context initialization fails, pycuda.autoinit commonly raises non-ImportError exceptions (e.g., pycuda._driver.Error/LogicError/RuntimeError), which will crash module import and break the intended “safe import” behavior.
🛡️ Proposed fix
try:
import pycuda.driver as cuda
- import pycuda.autoinit # Automatically handles CUDA context creation/destruction
CUDA_AVAILABLE = True
-except ImportError:
+except Exception: # ImportError or driver init failure
cuda = None
CUDA_AVAILABLE = FalseThen lazily import pycuda.autoinit inside TensorRTInference.__init__ (wrapped in a try/except that also handles pycuda._driver.Error/driver-init failures) so probes and pytest collection can safely import the module.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/detection/trt_utils.py` around lines 23 - 29, The current
module-level import of pycuda.autoinit can raise non-ImportError exceptions and
crash imports; remove the autoinit import from the top-level try/except (keep
only "import pycuda.driver as cuda" and set CUDA_AVAILABLE based on that
import), and move the "import pycuda.autoinit" into a guarded, lazy import
inside TensorRTInference.__init__; wrap that lazy import in a try/except that
catches pycuda._driver.Error (and a broad Exception fallback) so
driver/context-init failures are caught, set or clear CUDA_AVAILABLE
accordingly, and ensure TensorRTInference.__init__ handles the case where
autoinit failed without letting the module import fail.
| if not self.inputs: | ||
| raise ValueError("No input bindings allocated in the TensorRT engine.") | ||
|
|
||
| input_info = self.inputs[0] | ||
| # Fast copy to pagelocked host buffer | ||
| np.copyto(input_info["host"], input_data.ravel()) |
There was a problem hiding this comment.
Add input shape/dtype validation before np.copyto.
np.copyto(input_info["host"], input_data.ravel()) fails with a generic ValueError if input_data.size != size and silently mis-feeds the engine on dtype mismatch (since ravel() keeps the source dtype while the destination is the binding dtype, which can trigger an implicit cast or raise TypeError). For a wrapper that is called per-frame in a detection loop, surface a clear error early.
🛡️ Proposed validation
input_info = self.inputs[0]
- # Fast copy to pagelocked host buffer
- np.copyto(input_info["host"], input_data.ravel())
+ expected_size = input_info["host"].size
+ if input_data.size != expected_size:
+ raise ValueError(
+ f"Input size mismatch: got {input_data.size}, expected {expected_size} "
+ f"for binding '{input_info['name']}' with shape {input_info['shape']}."
+ )
+ # Fast copy to pagelocked host buffer (dtype-cast if needed)
+ np.copyto(input_info["host"], input_data.ravel().astype(input_info["dtype"], copy=False))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/detection/trt_utils.py` around lines 127 - 132, Validate the
incoming input_data against the engine input binding before calling np.copyto:
retrieve the target host buffer and its expected element count and dtype from
self.inputs[0] (e.g., input_info = self.inputs[0]; expected_size =
input_info["host"].size; expected_dtype = input_info["host"].dtype), check that
input_data.size == expected_size (or reshapeable to that size) and that
input_data.dtype == expected_dtype; if sizes/dtypes mismatch either raise a
clear ValueError/TypeError naming the binding, expected size/dtype and actual
size/dtype, or explicitly convert with input_data.astype(expected_dtype,
copy=False) and reshape to match before calling np.copyto(input_info["host"],
...). Ensure these checks/convert happen just before the np.copyto call that
currently uses input_data.ravel().
Description
This pull request introduces high-performance NVIDIA TensorRT
.enginemodel support inside the Eagle Surveillance framework to significantly accelerate inference speed, lower latency, and optimize GPU memory boundaries.Importantly, it adds a hardware-resilient, smart auto-fallback layer ensuring that the application remains fully functional out-of-the-box on any machine (even those without NVIDIA GPUs or TensorRT libraries) by gracefully redirecting to standard PyTorch (
.pt) or ONNX (.onnx) models without crashing.Key Enhancements Added
Format Routing & Smart Fallback (
services/detection/detection.py):Detectorto support.pt,.onnx, and.engineloaders..enginecounterpart (e.g.,yolov8n.engineforyolov8n.pt).Low-level TensorRT Inference Utilities (
services/detection/trt_utils.py):TensorRTInferenceutility class utilizing the nativetensorrtandpycudaAPI bindings.Model Conversion Utility (
scripts/export_tensorrt.py):Multi-Format Pipeline Benchmarking (
benchmark.py):.pt,.onnx, and.enginefiles.docs/benchmarks/comparison_report.mdcomparing Throughput (FPS), Latency (ms), and peak VRAM/RAM memory metrics.Technical Documentation (
docs/tensorrt_conversion.md):Comprehensive Unit Tests (
tests/test_tensorrt_routing.py):Why
.engineFiles are Not Included in this RepositoryTensorRT engines are hardware-specific serialized binaries.
Unlike standard
.ptor.onnxmodels (which are hardware-agnostic), a TensorRT.enginefile compiles the model's layers directly into optimized GPU memory maps and CUDA kernels tailored exclusively to the host machine's exact GPU microarchitecture (e.g., CUDA Compute Capability, Tensor Core architecture, VRAM size) and the active CUDA/TensorRT driver versions.Distributing pre-compiled
.enginefiles in a repository is not feasible because they will fail to run on any computer with a different GPU card or driver configuration.Instead, developers and deployment systems should generate their tailored
.enginemodels locally in one click by running the newly introduced conversion utility: