Vision-Language Navigation Robot

TurtleBot3 Waffle navigates to objects described in natural language inside a Gazebo environment it has never seen before. No pre-built map. No hand-coded waypoints.

"Go to the chair, move past the trash can, stop at the bed."

The robot parses natural language into a goal sequence, builds a semantic 3D map in real time using vision models, and navigates using Nav2. It explores autonomously when a target has not yet been seen.

Why This Problem

Most VLN systems assume a pre-built map or require extensive environment-specific training data. Real robots in real deployments face environments they have never seen. This system is designed to work zero-shot:

No pre-built semantic map — the robot builds one as it explores
No hand-coded waypoints — goals come from natural language
No environment-specific training — YOLO, CLIP, and Gemini are used zero-shot
Fully autonomous exploration when a target is not yet visible

This directly addresses PS6 (autonomous robotic exploration with natural language goals): the robot handles an unknown environment, builds semantic memory, and navigates — all from a single text or voice command.

How It Works

Pipeline

flowchart LR
    A(["Voice / Text"]) --> B["vln_parser<br/>Gemini 2.5 Flash"]
    B -->|"/vln/goal_stack"| C["Arbitrator<br/>10 Hz"]
    subgraph DM ["DualMap - 15 Hz"]
        direction TB
        D["RGB-D<br/>640x480"] --> E["YOLO + SAM<br/>+ FastSAM"]
        E --> F["CLIP ViT-L-14<br/>768-dim"]
        F --> G[("Global Map<br/>+ Local Map")]
    end
    G -->|"/action_path"| C
    G -->|"semantic_map.json 5s"| C
    C -->|"actions.yaml 100ms poll"| G
    C -->|"explore/resume"| H["explore_lite<br/>Wavefront BFS"]
    C -->|"NavigateToPose"| I["Nav2<br/>NavFn A* + DWB"]
    C -->|"NavigateToPose"| J["TD3 RL<br/>policy"]
    H --> I
    L["Cartographer<br/>Graph SLAM"] --> I
    L --> M["map_odom_bridge<br/>10 Hz"]
    M -->|"/odom_map"| D
    I --> K(["/cmd_vel"])
    J --> K

DualMap — Two-Layer Semantic Memory

Each camera frame is not processed independently. DualMap triggers a keyframe when the robot moves enough (≥5 cm, ≥3°, or ≥0.3 s), then runs the full perception pipeline on that keyframe. Objects observed across ≥3 stable keyframes with a low-mobility score get promoted to the global map as navigation targets.

flowchart TD
    T(["Keyframe trigger<br/>d_pos >= 5cm<br/>d_rot >= 3deg<br/>d_t >= 0.3s"])
    T --> A["RGB 640x480"]
    T --> B["Depth float32m"]
    A --> C["YOLOv8x-World v2<br/>conf=0.2"]
    A --> D["FastSAM-x<br/>conf=0.35"]
    C --> E["MobileSAM<br/>bbox to masks"]
    E --> F{"merge + filter<br/>IoU > 0.80 dedup<br/>area &lt; 80px removed"}
    D --> F
    B --> G["Unproject 3D<br/>fx=fy=565.6"]
    F --> G
    G --> H["DBSCAN<br/>eps=0.1 min=10"]
    H --> I["map frame<br/>via /odom_map"]
    F --> J["CLIP ViT-L-14<br/>0.7*img + 0.3*text<br/>768-dim"]
    I --> LM["Local Map<br/>10-frame window<br/>high-mobility objects"]
    J --> LM
    LM -->|"3+ stable keyframes<br/>low-mobility score"| GM[("Global Map<br/>nav targets<br/>A* graph")]

Mobility routing — CLIP cosine similarity against prototype names classifies objects at inference time. Low-mobility objects (chair, bed, shelf) go to the global map as navigation targets. High-mobility objects (cup, pillow, remote) go to the local map only and are never navigated to.

Arbitrator State Machine

stateDiagram-v2
    [*] --> IDLE
    IDLE --> QUERY : goal_stack received
    QUERY --> HOMING : DualMap path found under 10s
    QUERY --> HOMING : semantic_map.json hit
    QUERY --> EXPLORE : 10s timeout no hit
    EXPLORE --> HOMING : DualMap path found
    EXPLORE --> HOMING : semantic_map.json hit 5s check
    EXPLORE --> STUCK : 300s timeout
    STUCK --> IDLE : skip goal
    HOMING --> IDLE : Nav2 SUCCEEDED
    HOMING --> REPLAN : Nav2 FAILED
    REPLAN --> HOMING : retry max 3
    REPLAN --> IDLE : skip after 3 fails
    IDLE --> [*] : all goals done

QUERY waits 10 s for DualMap to find the target. On timeout, it falls back to semantic_map.json. If still nothing, it enters EXPLORE (frontier BFS) and checks the semantic map every 5 s until the object is discovered.

Stack

Layer	Tech
Simulation	Gazebo Harmonic · small_house.world
Robot	TurtleBot3 Waffle [1]
Middleware	ROS 2 Humble
SLAM	Cartographer (Graph SLAM)
Classical nav	Nav2 · NavFn A* · DWB (7 critics) [2]
RL nav	TD3 · 44→512→512→2 MLP · 2 training stages [3]
Exploration	explore_lite (Wavefront BFS)
Detection	YOLOv8x-World v2 (conf=0.2)
Segmentation	MobileSAM + FastSAM-x (conf=0.35)
Features	CLIP ViT-L-14 · 768-dim · 0.7×img + 0.3×text
Clustering	DBSCAN (eps=0.1 m, min_pts=10)
Language	Gemini 2.5 Flash · Pydantic structured output
STT	Whisper (local HuggingFace)
Web GUI	Next.js · rosbridge WebSocket · WebSocket STT server
Semantic mapping	DualMap architecture [4]

Installation

Prerequisites

ROS 2 Humble
Gazebo Harmonic
conda (for DualMap Python environment)
Node.js ≥18 (for web GUI only)
CUDA-capable GPU (for DualMap — YOLO, SAM, CLIP inference)

# ROS 2 packages needed beyond base install
sudo apt install ros-humble-cartographer-ros ros-humble-nav2-bringup \
                 ros-humble-rosbridge-server ros-humble-rviz2

1. Clone and Build ROS 2 Workspace

cd ~/dl_hackathon
colcon build --symlink-install
source install/setup.bash

2. Set Up DualMap Conda Environment

DualMap runs as a Python subprocess outside ROS 2. It needs its own environment with CUDA PyTorch, open_clip, ultralytics, and open3d.

cd ~/dl_hackathon/src/hackathon_updated/hackathon/scratch/DualMap
conda env create -f environment.yml
conda activate dualmap

The environment.yml installs: torch, torchvision, open_clip_torch, ultralytics==8.3.103, open3d, hydra-core, supervision==0.25.1, rerun-sdk==0.22.1, faiss-cpu, and others.

3. Download Model Weights

Place these inside src/hackathon_updated/hackathon/scratch/DualMap/model/:

File	Source
`yolov8x-worldv2.pt`	Ultralytics (auto-downloads on first run)
`mobile_sam.pt`	MobileSAM (Zhang et al., 2023) [8]
`FastSAM-x.pt`	FastSAM (Zhao et al., 2023) [7]
CLIP ViT-L-14	Auto-downloaded by open_clip on first run

CLIP weights (datacomp_xl_s13b_b90k) are fetched automatically by open_clip_torch on first use.

4. Set Gemini API Key

export GEMINI_API_KEY="your_key_here"

Used in src/vln_pipeline/vln_pipeline/structured_out.py via os.environ.get("GEMINI_API_KEY"). Get a key at aistudio.google.com.

5. Set TurtleBot3 Model

export TURTLEBOT3_MODEL=waffle

6. Install Web GUI Dependencies (GUI Mode Only)

cd ~/dl_hackathon/VLN-Robot-Control/my-app
npm install

Also install the Python STT server dependencies:

cd ~/dl_hackathon/VLN-Robot-Control
pip install -r requirements.txt
# Linux microphone support
sudo apt install portaudio19-dev
# spaCy model
python -m spacy download en_core_web_sm

Running

Option A — Topic Mode (No GUI)

Send goals via ROS 2 topic directly. No browser needed.

source ~/dl_hackathon/install/setup.bash
export TURTLEBOT3_MODEL=waffle
export GEMINI_API_KEY="your_key_here"

# Start full pipeline
ros2 launch turtlebot3_gazebo gz_vln_exploration.launch.py

# In another terminal — send a text instruction
ros2 topic pub --once /vln/instruction std_msgs/String "data: 'go to the chair'"

Option B — Web GUI Mode (Voice + Text in Browser)

The GUI launch adds rosbridge (port 9090) and an STT WebSocket server (port 8765). The browser connects to both, sends audio or text, and displays the robot's goal progress.

# Terminal 1 — start the full pipeline with GUI bridges
source ~/dl_hackathon/install/setup.bash
export TURTLEBOT3_MODEL=waffle
export GEMINI_API_KEY="your_key_here"
ros2 launch turtlebot3_gazebo gz_vln_gui.launch.py

# Terminal 2 — start the Next.js frontend
cd ~/dl_hackathon/VLN-Robot-Control/my-app
npm run dev

# Open browser
# http://localhost:3000

GUI flow:

Browser connects to rosbridge at ws://localhost:9090
Click microphone → browser streams Float32 PCM audio to STT server at ws://localhost:8765
Whisper [11] transcribes → Gemini [10] parses → GoalStack JSON returned to browser
Browser publishes GoalStack to /vln/goal_stack via rosbridge
Arbitrator picks up goal, DualMap searches, Nav2 navigates

You can also type a command directly in the text box — it skips Whisper and goes straight to Gemini parsing.

Launch Startup Sequence

Both launch files follow the same sequence:

Time	What Starts
0 s	Gazebo server + GUI, robot spawn, Cartographer, Nav2 (5 lifecycle nodes), map_odom_bridge, rviz2
0 s (GUI only)	rosbridge (port 9090), STT server (port 8765)
25 s	explore_lite, DualMap subprocess, vln_parser, arbitrator

The 25-second delay exists because Nav2 costmaps need LiDAR data to initialize. DualMap and explore_lite fail if attached before the costmap is ready.

Repository Layout

src/
│
├── vln_pipeline/                          # Main ROS 2 package
│   ├── vln_pipeline/
│   │   ├── arbitrator_node.py             # 10 Hz state machine: IDLE→QUERY→EXPLORE→HOMING
│   │   ├── vln_parser_node.py             # Subscribes /vln/instruction → calls Gemini → publishes /vln/goal_stack
│   │   ├── structured_out.py              # Pydantic GoalStack schema + Gemini client
│   │   ├── map_odom_bridge.py             # Reads Cartographer TF map→base_footprint → publishes /odom_map at 10 Hz
│   │   └── utils.py                       # Shared helpers (pose math, file IPC read/write)
│   ├── stt/
│   │   ├── stt_server.py                  # WebSocket server: audio → Whisper → Gemini → GoalStack JSON
│   │   └── speech_to_text.py             # Whisper wrapper
│   └── config/
│       └── vln_params.yaml                # arbitrator + vln_parser ROS params
│
├── hackathon_updated/hackathon/scratch/DualMap/   # Dual-level semantic mapping
│   ├── applications/
│   │   └── runner_ros.py                  # Entry point — ROS subscriber loop at 15 Hz
│   ├── dualmap/
│   │   ├── local_map.py                   # LocalMapManager: 10-frame sliding window, NetworkX graph
│   │   ├── global_map.py                  # GlobalMapManager: persistent objects, A* pathfinding
│   │   ├── perception.py                  # YOLO + MobileSAM + FastSAM + CLIP + DBSCAN pipeline
│   │   └── mobility.py                    # CLIP cosine similarity mobility classifier
│   ├── config/
│   │   ├── runner_ros.yaml                # ros_rate=15, use_fastsam, use_parallel, etc.
│   │   ├── actions.yaml                   # IPC file: arbitrator writes, DualMap polls every 100ms
│   │   └── system_config.yaml             # model paths, DBSCAN params, CLIP config
│   ├── model/                             # Model weight files
│   └── environment.yml                    # conda env with all Python deps
│
├── turtlebot3_simulations/
│   └── turtlebot3_gazebo/
│       └── launch/
│           ├── gz_vln_exploration.launch.py   # Main entry point — Nav2 + DualMap + language pipeline
│           ├── gz_vln_gui.launch.py           # GUI variant — adds rosbridge + STT WebSocket server
│           └── gz_rl_exploration.launch.py    # RL variant — TD3 replaces Nav2 bt_navigator
│
├── rl_nav/                                # TD3 inference node
│   └── rl_nav/
│       └── rl_nav_node.py                 # NavigateToPose action server backed by TD3 actor MLP
│
├── turtlebot3_drl/                        # TD3 training infrastructure
│   └── turtlebot3_drl/
│       ├── drl_agent/                     # TD3 actor/critic networks, replay buffer, training loop
│       ├── drl_environment/               # Gazebo environment reward, reset, step logic
│       └── drl_gazebo/                    # Gazebo plugin bridge for DRL training
│
├── m-explore-ros2/                        # explore_lite — frontier exploration
│   └── explore_lite/                      # Wavefront BFS over Nav2 global costmap
│
├── turtlebot3/                            # TurtleBot3 ROS 2 packages
├── turtlebot3_msgs/                       # TurtleBot3 message definitions
└── aws-robomaker-small-house-world/       # Gazebo small_house.world environment

VLN-Robot-Control/                         # Web GUI frontend
└── my-app/
    ├── app/                               # Next.js app router pages
    ├── components/                        # React components (voice control, status display)
    └── hooks/                             # Custom hooks (rosbridge connection, STT stream)

IPC Between Arbitrator and DualMap

DualMap runs as a Python subprocess (Hydra config), not a ROS 2 node. Communication is file-based to avoid the overhead of spinning up a full ROS 2 node inside the conda environment:

config/actions.yaml — arbitrator writes queries (what to search, whether to plan a path), DualMap polls every 100 ms
semantic_map.json — DualMap writes every 5 s, arbitrator reads during EXPLORE state to check if target has been discovered

# actions.yaml written by arbitrator
calculate_path: true
get_goal_mode: inquiry
inquiry_sentence: chair
trigger_find_next: false

RL Navigation

Reward Function

The reward at each step combines a dense progress signal, a collision penalty, and a sparse goal-arrival bonus:

r(t) = r_distance + r_collision + r_goal

r_distance — positive reward proportional to the reduction in Euclidean distance to the goal between consecutive steps. Provides a continuous learning signal to guide the robot toward the target.
r_collision — large negative penalty (−200) triggered when any LiDAR reading falls below the minimum safe threshold, driving the policy to learn obstacle clearance without explicit rules.
r_goal — large positive sparse reward (+200) when the robot reaches within a success radius of the goal, anchoring the terminal objective.

Two training stages apply the same reward structure in progressively harder environments: Stage 1 uses static obstacles so the robot learns basic goal-seeking and clearance; Stage 2 adds moving obstacles, forcing the policy to react to dynamic threats.

Twin Critics

TD3 maintains two independent critic networks, Q₁ and Q₂, each estimating the value of (state, action) pairs. The target for both critics uses the minimum of their estimates:

y = r + γ · min(Q₁'(s', ã), Q₂'(s', ã))

where ã is the action from the target actor with added clipped Gaussian noise (target policy smoothing). Taking the minimum prevents the overestimation bias that afflicts single-critic methods — if one network drifts high, the other clips it down. The two critics are updated independently but share the same replay buffer.

The actor is updated less frequently than the critics (delayed policy update, every 2 critic steps by default). This ensures the critic has converged sufficiently before the policy gradient step, reducing error accumulation over training.

Why TD3 over DDPG and DQN

Fujimoto et al. (2018) [3] prove that function approximation errors persist in actor-critic settings and lead to systematic overestimation of Q-values, motivating the choice of TD3 over its predecessors. Key comparisons:

DQN discretises the action space, selecting from a finite set of predefined linear and angular velocity pairs. While the robot can navigate around obstacles, the discrete nature of the policy produces abrupt velocity switches that cause unsmooth trajectories — a significant problem for a robot that must also maintain stable LiDAR scans for SLAM.

DDPG operates in continuous action space and is the direct predecessor of TD3. It uses a single critic, which makes it prone to overestimation bias. The approximate policy is updated using the gradient of an overestimated value function, resulting in a policy that can exploit estimation errors rather than learn the true objective.

TD3 [3] addresses all three failure modes of DDPG through:

Clipped Double Q-learning — two independent critics, target computed as min(Q₁', Q₂'), preventing runaway overestimation.
Delayed policy updates — the actor is updated only every second critic step, giving the critic time to stabilize before the policy gradient is applied.
Target policy smoothing — clipped Gaussian noise is added to target actions when computing critic targets, acting as a regularizer and preventing the policy from exploiting narrow peaks in the value landscape.

For the two-stage curriculum in this system, TD3's stability under noisy early-training value estimates is critical. The policy does not chase a poorly-calibrated critic signal during Stage 1, and the replay buffer allows efficient reuse of samples from both stages during Stage 2 fine-tuning.

Why RL Is Not Used in the Main Pipeline

The TD3 policy uses 2.0 rad/s max angular velocity — twice what Nav2 allows. When the robot spins fast, the LiDAR completes a 360° sweep while the heading is still changing. Cartographer receives a scan acquired across multiple orientations and treats it as a single-pose snapshot. Scan-to-submap matching fails, submaps misalign, and the global map drifts. Once the map drifts, DualMap's 3D object positions (which are transformed using /odom_map from Cartographer) also drift — the entire semantic pipeline breaks.

The main launch (gz_vln_exploration.launch.py) therefore uses Nav2 [2] with a 1.0 rad/s cap, which keeps Cartographer stable.

The fix — retrain with angular velocity capped at ≤1.0 rad/s, or add LiDAR motion distortion correction before feeding scans to Cartographer. Not completed due to time constraints.

Why RL Is the Right Choice for Dynamic Environments

Nav2 with A* + DWB replans on a costmap. If an obstacle moves after the plan is computed, Nav2 must detect the change in the costmap, trigger a replan, and issue new velocity commands — that loop has latency. TD3 reads the current LiDAR state and outputs velocity directly at 10 Hz with no planning step. A person walking across the robot's path is handled immediately rather than after a replan cycle.

Training

Two stages, both in Gazebo:

Stage 1 — static obstacles: empty room with fixed walls and boxes. Robot learns basic goal-reaching and collision avoidance.
Stage 2 — dynamic obstacles: same room with moving obstacles added. Robot learns to react to objects crossing its path.

Final weights: actor_stage2_episode*.pt (44→512→512→2 MLP, Tanh output). Loaded by rl_nav_node.py via the model_path ROS parameter.

References

ROBOTIS. TurtleBot3. https://github.com/ROBOTIS-GIT/turtlebot3, 2023.
S. Macenski, F. Martín, R. White, J. G. Clavero. The Marathon 2: A Navigation System. IEEE/RSJ IROS, 2020.
S. Fujimoto, H. van Hoof, D. Meger. Addressing Function Approximation Error in Actor-Critic Methods. ICML, pp. 1582–1591, 2018.
J. Jiang, Y. Zhu, Z. Wu, J. Song. DualMap: Online Open-Vocabulary Semantic Mapping for Natural Language Navigation in Dynamic Changing Scenes. arXiv:2506.01950, 2025.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
VLN-Robot-Control		VLN-Robot-Control
scripts		scripts
src		src
.gitignore		.gitignore
DRL_TRAINING.md		DRL_TRAINING.md
README.md		README.md
_figure.png		_figure.png
architecture_diagram.md		architecture_diagram.md
cmd.txt		cmd.txt
cyclonedds.xml		cyclonedds.xml
json		json
math		math
semantic_map.json		semantic_map.json
time		time

Folders and files

Latest commit

History

Repository files navigation

Vision-Language Navigation Robot

Why This Problem

How It Works

Pipeline

DualMap — Two-Layer Semantic Memory

Arbitrator State Machine

Stack

Installation

Prerequisites

1. Clone and Build ROS 2 Workspace

2. Set Up DualMap Conda Environment

3. Download Model Weights

4. Set Gemini API Key

5. Set TurtleBot3 Model

6. Install Web GUI Dependencies (GUI Mode Only)

Running

Option A — Topic Mode (No GUI)

Option B — Web GUI Mode (Voice + Text in Browser)

Launch Startup Sequence

Repository Layout

IPC Between Arbitrator and DualMap

RL Navigation

Reward Function

Twin Critics

Why TD3 over DDPG and DQN

Why RL Is Not Used in the Main Pipeline

Why RL Is the Right Choice for Dynamic Environments

Training

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages