Skip to content

Releases: NVIDIA/TensorRT-Edge-LLM

v0.6.1

16 Apr 16:13
2620a97

Choose a tag to compare

TensorRT Edge-LLM 0.6.1 Release 2026-04-15

  • Added DriveOS 7.2.4 official support
  • Fixed EAGLE draft model weights loading issue to retrieve acceptance rate

v0.6.0

19 Mar 04:50
996623c

Choose a tag to compare

TensorRT Edge-LLM 0.6.0 Release 2026-03-16

We are very excited to announce the 0.6.0 release of TensorRT Edge-LLM!

  • TensorRT Edge-LLM is featured in GTC 2026! Link to our blog.
  • Our developer roadmap for H1 2026 is listed in #32.
  • Welcome to check out our flagship demo "Run Physical AI Models Faster on Jetson" at GTC 2026. This demo uses TensorRT Edge-LLM as one of the backends to showcase ASR/LLM/TTS capability of NVIDIA Jetson AGX Thor.

Breaking Changes

  • Due to the ViT Attention Plugin and split QKV changes, ONNX models exported by previous versions are not compatible with 0.6.0.

Key Features

Model Support

  • Added Nemotron-Nano-9B-v2 support via mamba_ssm and causal_conv1d
  • Added day 0 support for Nemotron-3-Nano-4B
  • Added Qwen3-30B-A3B-GPTQ-Int4 support via INT4 MoE Plugin
  • Added Qwen3-ASR and Qwen3-TTS end-to-end support

Performance Improvements

  • Added cutedsl FMHA kernels to speed up prefill performance on Blackwell
  • Used ViT Attention Plugin with fmha-v2 and cutedsl kernels to speed up multi image ViT performance
  • Updated Attention Plugin to split q, k and v to save memory usage

Runtime Maturity

  • Added LoRA support for Speculative Decoding
  • Fixed several compiler warnings and document exceptions for functions
  • Added coverage tests

NVIDIA Contributors

@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @Jasper-NV @yuanyao-nv @fkaster-nvidia @poweiw @charllll @mahu888 @xinrzhang-star @JCalafato @nvmbreughe

v0.5.0

20 Feb 00:36
8fe7fe1

Choose a tag to compare

TensorRT Edge-LLM 0.5.0 Release 2026-02-19

We are very excited to announce the second release of TensorRT Edge-LLM! We welcome community contribution and have merged the first community PR #13 @taoz27) in this release. Our developer roadmap for H1 2026 is listed in #32.

Breaking Changes

  • Due to the standalone embedding processing module features, ONNX models exported by previous versions are not compatible with 0.5.0.

Key Features

  • Implemented and used standalone embedding processing module to reduce multi-modal modeling complexity and reduce Eagle inference memory footprint
  • Added FP8 KV Cache support
  • Unified TensorRT execution context for prefill and decode to reduce memory footprint
  • Supported vanilla decoding for speculative decoding runtime
  • Used collision resistant hashing for CUDA graphs
  • Updated int4GroupwiseGemmPlugin to TensorRT Plugin-v3 interface
  • Refactored documentations
  • Added ViT attention mask and RoPE parameter caching to reduce recomputation for Qwen (Merged #13 by @taoz27)
  • Added Jetpack 6.2 compatibility

NVIDIA Contributors

@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @fkaster-nvidia @chemudupatiks @poweiw @charllll @mahu888 @xinrzhang-star @byte-deve @JCalafato @nvmbreughe

v0.4.0

05 Jan 22:50
50a61d0

Choose a tag to compare

TensorRT Edge-LLM 0.4.0 Release 2026-01-06

We are very excited to announce the first release of TensorRT Edge-LLM! TensorRT Edge-LLM is NVIDIA's high-performance C++ inference runtime for Large Language Models (LLMs) and Vision-Language Models (VLMs) on embedded platforms. Please follow our Quick Start Guide for the usage.

Key Components

Model Support

  • Llama3.x
  • Qwen2/2.5/3 (Dense)
  • Qwen2/2.5/3-VL (Dense)
  • InternVL3
  • Phi4-Multimodal

Please check the model support page for more details.

Key Features

Model Export

  • nvfp4/fp8/int4 quantization
  • nvfp4/fp8 lm_head quantization
  • EAGLE3 draft quantization
  • Vocab reduction

Runtime

  • Multi-batch EAGLE3 speculative decoding for LLM and VLM
  • Decoding CUDA Graph
  • System prompt KVCache reuse
  • Open-AI style chat template
  • Dynamic LoRA switching