16 Apr 16:13

nvluxiaoz

2620a97

v0.6.1 Latest

Latest

TensorRT Edge-LLM 0.6.1 Release 2026-04-15

Added DriveOS 7.2.4 official support
Fixed EAGLE draft model weights loading issue to retrieve acceptance rate

Assets 2

0 Join discussion

19 Mar 04:50

nvluxiaoz

v0.6.0

996623c

v0.6.0

TensorRT Edge-LLM 0.6.0 Release 2026-03-16

We are very excited to announce the 0.6.0 release of TensorRT Edge-LLM!

TensorRT Edge-LLM is featured in GTC 2026! Link to our blog.
Our developer roadmap for H1 2026 is listed in #32.
Welcome to check out our flagship demo "Run Physical AI Models Faster on Jetson" at GTC 2026. This demo uses TensorRT Edge-LLM as one of the backends to showcase ASR/LLM/TTS capability of NVIDIA Jetson AGX Thor.

Breaking Changes

Due to the ViT Attention Plugin and split QKV changes, ONNX models exported by previous versions are not compatible with 0.6.0.

Key Features

Model Support

Added Nemotron-Nano-9B-v2 support via mamba_ssm and causal_conv1d
Added day 0 support for Nemotron-3-Nano-4B
Added Qwen3-30B-A3B-GPTQ-Int4 support via INT4 MoE Plugin
Added Qwen3-ASR and Qwen3-TTS end-to-end support

Performance Improvements

Added cutedsl FMHA kernels to speed up prefill performance on Blackwell
Used ViT Attention Plugin with fmha-v2 and cutedsl kernels to speed up multi image ViT performance
Updated Attention Plugin to split q, k and v to save memory usage

Runtime Maturity

Added LoRA support for Speculative Decoding
Fixed several compiler warnings and document exceptions for functions
Added coverage tests

NVIDIA Contributors

@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @Jasper-NV @yuanyao-nv @fkaster-nvidia @poweiw @charllll @mahu888 @xinrzhang-star @JCalafato @nvmbreughe

Contributors

JCalafato, charllll, and 11 other contributors

Assets 2

5 Join discussion

20 Feb 00:36

nvluxiaoz

v0.5.0

8fe7fe1

v0.5.0

TensorRT Edge-LLM 0.5.0 Release 2026-02-19

We are very excited to announce the second release of TensorRT Edge-LLM! We welcome community contribution and have merged the first community PR #13 @taoz27) in this release. Our developer roadmap for H1 2026 is listed in #32.

Breaking Changes

Due to the standalone embedding processing module features, ONNX models exported by previous versions are not compatible with 0.5.0.

Key Features

Implemented and used standalone embedding processing module to reduce multi-modal modeling complexity and reduce Eagle inference memory footprint
Added FP8 KV Cache support
Unified TensorRT execution context for prefill and decode to reduce memory footprint
Supported vanilla decoding for speculative decoding runtime
Used collision resistant hashing for CUDA graphs
Updated int4GroupwiseGemmPlugin to TensorRT Plugin-v3 interface
Refactored documentations
Added ViT attention mask and RoPE parameter caching to reduce recomputation for Qwen (Merged #13 by @taoz27)
Added Jetpack 6.2 compatibility

NVIDIA Contributors

@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @fkaster-nvidia @chemudupatiks @poweiw @charllll @mahu888 @xinrzhang-star @byte-deve @JCalafato @nvmbreughe

Contributors

JCalafato, taoz27, and 12 other contributors

Assets 2

0 Join discussion

05 Jan 22:50

nvluxiaoz

v0.4.0

50a61d0

v0.4.0

TensorRT Edge-LLM 0.4.0 Release 2026-01-06

We are very excited to announce the first release of TensorRT Edge-LLM! TensorRT Edge-LLM is NVIDIA's high-performance C++ inference runtime for Large Language Models (LLMs) and Vision-Language Models (VLMs) on embedded platforms. Please follow our Quick Start Guide for the usage.

Key Components

Flexible Python CLI exporter from HuggingFace checkpoints to ONNX
C++ TensorRT engine builder
C++ tokenizers and multi-modal processors
C++ runtime, including vanilla decoding and EAGLE3 speculative decoding
Optimized CUDA kernels for Multi-Head Attention, Sampling and EAGLE3 utility
Examples to run inference and perform accuracy evaluations and perf benchmarks

Model Support

Llama3.x
Qwen2/2.5/3 (Dense)
Qwen2/2.5/3-VL (Dense)
InternVL3
Phi4-Multimodal

Please check the model support page for more details.

Key Features

Model Export

nvfp4/fp8/int4 quantization
nvfp4/fp8 lm_head quantization
EAGLE3 draft quantization
Vocab reduction

Runtime

Multi-batch EAGLE3 speculative decoding for LLM and VLM
Decoding CUDA Graph
System prompt KVCache reuse
Open-AI style chat template
Dynamic LoRA switching

Assets 2

0 Join discussion

Releases: NVIDIA/TensorRT-Edge-LLM

v0.6.1

TensorRT Edge-LLM 0.6.1 Release 2026-04-15

Uh oh!

v0.6.0

TensorRT Edge-LLM 0.6.0 Release 2026-03-16

Breaking Changes

Key Features

Model Support

Performance Improvements

Runtime Maturity

NVIDIA Contributors

Contributors

Uh oh!

v0.5.0

TensorRT Edge-LLM 0.5.0 Release 2026-02-19

Breaking Changes

Key Features

NVIDIA Contributors

Contributors

Uh oh!

v0.4.0

TensorRT Edge-LLM 0.4.0 Release 2026-01-06

Key Components

Model Support

Key Features

Model Export

Runtime

Uh oh!