Releases: NVIDIA/TensorRT-Edge-LLM
v0.6.1
v0.6.0
TensorRT Edge-LLM 0.6.0 Release 2026-03-16
We are very excited to announce the 0.6.0 release of TensorRT Edge-LLM!
- TensorRT Edge-LLM is featured in GTC 2026! Link to our blog.
- Our developer roadmap for H1 2026 is listed in #32.
- Welcome to check out our flagship demo "Run Physical AI Models Faster on Jetson" at GTC 2026. This demo uses TensorRT Edge-LLM as one of the backends to showcase ASR/LLM/TTS capability of NVIDIA Jetson AGX Thor.
Breaking Changes
- Due to the ViT Attention Plugin and split QKV changes, ONNX models exported by previous versions are not compatible with 0.6.0.
Key Features
Model Support
- Added Nemotron-Nano-9B-v2 support via mamba_ssm and causal_conv1d
- Added day 0 support for Nemotron-3-Nano-4B
- Added Qwen3-30B-A3B-GPTQ-Int4 support via INT4 MoE Plugin
- Added Qwen3-ASR and Qwen3-TTS end-to-end support
Performance Improvements
- Added cutedsl FMHA kernels to speed up prefill performance on Blackwell
- Used ViT Attention Plugin with fmha-v2 and cutedsl kernels to speed up multi image ViT performance
- Updated Attention Plugin to split q, k and v to save memory usage
Runtime Maturity
- Added LoRA support for Speculative Decoding
- Fixed several compiler warnings and document exceptions for functions
- Added coverage tests
NVIDIA Contributors
@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @Jasper-NV @yuanyao-nv @fkaster-nvidia @poweiw @charllll @mahu888 @xinrzhang-star @JCalafato @nvmbreughe
v0.5.0
TensorRT Edge-LLM 0.5.0 Release 2026-02-19
We are very excited to announce the second release of TensorRT Edge-LLM! We welcome community contribution and have merged the first community PR #13 @taoz27) in this release. Our developer roadmap for H1 2026 is listed in #32.
Breaking Changes
- Due to the standalone embedding processing module features, ONNX models exported by previous versions are not compatible with 0.5.0.
Key Features
- Implemented and used standalone embedding processing module to reduce multi-modal modeling complexity and reduce Eagle inference memory footprint
- Added FP8 KV Cache support
- Unified TensorRT execution context for prefill and decode to reduce memory footprint
- Supported vanilla decoding for speculative decoding runtime
- Used collision resistant hashing for CUDA graphs
- Updated int4GroupwiseGemmPlugin to TensorRT Plugin-v3 interface
- Refactored documentations
- Added ViT attention mask and RoPE parameter caching to reduce recomputation for Qwen (Merged #13 by @taoz27)
- Added Jetpack 6.2 compatibility
NVIDIA Contributors
@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @fkaster-nvidia @chemudupatiks @poweiw @charllll @mahu888 @xinrzhang-star @byte-deve @JCalafato @nvmbreughe
v0.4.0
TensorRT Edge-LLM 0.4.0 Release 2026-01-06
We are very excited to announce the first release of TensorRT Edge-LLM! TensorRT Edge-LLM is NVIDIA's high-performance C++ inference runtime for Large Language Models (LLMs) and Vision-Language Models (VLMs) on embedded platforms. Please follow our Quick Start Guide for the usage.
Key Components
- Flexible Python CLI exporter from HuggingFace checkpoints to ONNX
- C++ TensorRT engine builder
- C++ tokenizers and multi-modal processors
- C++ runtime, including vanilla decoding and EAGLE3 speculative decoding
- Optimized CUDA kernels for Multi-Head Attention, Sampling and EAGLE3 utility
- Examples to run inference and perform accuracy evaluations and perf benchmarks
Model Support
- Llama3.x
- Qwen2/2.5/3 (Dense)
- Qwen2/2.5/3-VL (Dense)
- InternVL3
- Phi4-Multimodal
Please check the model support page for more details.
Key Features
Model Export
- nvfp4/fp8/int4 quantization
- nvfp4/fp8 lm_head quantization
- EAGLE3 draft quantization
- Vocab reduction
Runtime
- Multi-batch EAGLE3 speculative decoding for LLM and VLM
- Decoding CUDA Graph
- System prompt KVCache reuse
- Open-AI style chat template
- Dynamic LoRA switching