Vision-Language Model (VLM) implementations using Google Gemma 3 4B for real-time camera analysis.
This repository provides two different approaches to using vision-language models with live camera feeds:
- VLM Video Chat - Optimized video chat interface
- ROS2 VLM Integration - Full robotics integration
VLM/
├── README.md # This file
├── CLAUDE.md # Development notes
├── vlm_video_chat/ # Optimized video chat interface
│ ├── README.md
│ ├── setup.sh
│ ├── run.sh
│ └── vlm_standalone.py
└── ros2_vlm/ # Full ROS2 integration
├── README.md # Detailed architecture guide
├── demo.sh # Main launcher
├── setup/ # Setup and testing
├── demos/ # Demo applications
└── nodes/ # ROS2 nodes and processing
Run a persistent VLM server with smart Linux/macOS detection, optional vLLM, and audio transcription:
./serve.shThe server starts on http://localhost:8080. Demos auto-use it when available.
If you just want to test VLM with your camera:
cd ros2_vlm
sudo ./setup/install_ros2_packages.sh
./demo.sh
# Choose option 1 (Live Demo)This gives you a single window with live camera + VLM analysis overlay.
If you're building ROS2 robotics applications:
cd ros2_vlm
sudo ./setup/install_ros2_packages.sh
./demo.sh
# Choose option 2 (ROS2 Demo)This provides full ROS2 topics/services integration.
If you want the simplest possible setup:
cd vlm_video_chat
./setup.sh
./run.shThis is an optimized VLM video chat with performance enhancements.
Purpose: Optimized VLM video chat interface Use case: Quick testing, demonstrations, interactive VLM usage Architecture: Single Python process with performance optimizations
Features:
- Clean video chat interface with real-time camera
- 10 quick prompt buttons with tooltips
- Optimized for RTX 5000 Ada (16GB GPU)
- Performance tuned: 512px inputs, KV cache, explicit device mapping
- Chat history with timestamps and processing times
Purpose: Production-ready VLM for robotics Use case: Robot navigation, scene understanding, multi-node systems Architecture: Hybrid system (ROS2 + conda environments)
Features:
- Two demo modes (integrated + ROS2)
- Real-time camera analysis
- ROS2 topic/service integration
- Keyboard shortcuts for quick prompts
- Full model output display
- GPU optimization for RTX 5000 Ada
Camera → VLM Processing → Chat Interface
(Single environment, simple setup)
Camera → ROS2 Node → VLM Subprocess → ROS2 Topics
(Environment isolation, production-ready)
Problem: ROS2 Jazzy requires Python 3.12, but VLM models work best in conda Python 3.10
Solution: Hybrid architecture where:
- ROS2 nodes run in system Python 3.12
- VLM processing runs in conda Python 3.10
- Subprocess communication bridges environments
Problem: Gemma 3 4B model has complex device mapping requirements
Solution: Smart device mapping that:
- Detects GPU memory (15.7GB RTX 5000 Ada)
- Uses explicit device placement for large GPUs
- Falls back to auto-mapping for smaller GPUs
- Handles CPU fallback gracefully
System:
- Ubuntu 22.04/24.04
- USB camera at
/dev/video0 - 8GB+ RAM (16GB+ recommended)
GPU (Optional but Recommended):
- NVIDIA GPU with 8GB+ VRAM
- RTX 5000 Ada (16GB) for optimal performance
- CPU fallback available
Authentication:
- HuggingFace account
- Access to google/gemma-3-4b-it model
- HuggingFace CLI login
With RTX 5000 Ada (16GB):
- Model loading: 3-5 seconds
- Analysis: 1-2 seconds per frame
- Memory usage: 8-10GB VRAM
CPU Fallback:
- Model loading: 10-15 seconds
- Analysis: 3-5 seconds per frame
- Memory usage: 6-8GB RAM
| Use Case | Project | Setup Complexity | Features |
|---|---|---|---|
| Quick Testing | ros2_vlm/ (Option 1) |
Medium | Live camera + overlay |
| Robotics Development | ros2_vlm/ (Option 2) |
Medium | Full ROS2 integration |
| Optimized Video Chat | vlm_video_chat/ |
Low | Performance-tuned interface |
- Start with:
ros2_vlm/project for most use cases - Read: The detailed
ros2_vlm/README.mdfor architecture understanding - Test first: Use the test script to verify VLM functionality
- Troubleshoot: Check camera, GPU, and HuggingFace authentication
Each project directory contains its own detailed README with specific setup instructions and troubleshooting guides.