TurtleBot3 Waffle navigates to objects described in natural language inside a Gazebo environment it has never seen before. No pre-built map. No hand-coded waypoints.
"Go to the chair, move past the trash can, stop at the bed."
The robot parses natural language into a goal sequence, builds a semantic 3D map in real time using vision models, and navigates using Nav2. It explores autonomously when a target has not yet been seen.
Most VLN systems assume a pre-built map or require extensive environment-specific training data. Real robots in real deployments face environments they have never seen. This system is designed to work zero-shot:
- No pre-built semantic map — the robot builds one as it explores
- No hand-coded waypoints — goals come from natural language
- No environment-specific training — YOLO, CLIP, and Gemini are used zero-shot
- Fully autonomous exploration when a target is not yet visible
This directly addresses PS6 (autonomous robotic exploration with natural language goals): the robot handles an unknown environment, builds semantic memory, and navigates — all from a single text or voice command.
flowchart LR
A(["Voice / Text"]) --> B["vln_parser<br/>Gemini 2.5 Flash"]
B -->|"/vln/goal_stack"| C["Arbitrator<br/>10 Hz"]
subgraph DM ["DualMap - 15 Hz"]
direction TB
D["RGB-D<br/>640x480"] --> E["YOLO + SAM<br/>+ FastSAM"]
E --> F["CLIP ViT-L-14<br/>768-dim"]
F --> G[("Global Map<br/>+ Local Map")]
end
G -->|"/action_path"| C
G -->|"semantic_map.json 5s"| C
C -->|"actions.yaml 100ms poll"| G
C -->|"explore/resume"| H["explore_lite<br/>Wavefront BFS"]
C -->|"NavigateToPose"| I["Nav2<br/>NavFn A* + DWB"]
C -->|"NavigateToPose"| J["TD3 RL<br/>policy"]
H --> I
L["Cartographer<br/>Graph SLAM"] --> I
L --> M["map_odom_bridge<br/>10 Hz"]
M -->|"/odom_map"| D
I --> K(["/cmd_vel"])
J --> K
Each camera frame is not processed independently. DualMap triggers a keyframe when the robot moves enough (≥5 cm, ≥3°, or ≥0.3 s), then runs the full perception pipeline on that keyframe. Objects observed across ≥3 stable keyframes with a low-mobility score get promoted to the global map as navigation targets.
flowchart TD
T(["Keyframe trigger<br/>d_pos >= 5cm<br/>d_rot >= 3deg<br/>d_t >= 0.3s"])
T --> A["RGB 640x480"]
T --> B["Depth float32m"]
A --> C["YOLOv8x-World v2<br/>conf=0.2"]
A --> D["FastSAM-x<br/>conf=0.35"]
C --> E["MobileSAM<br/>bbox to masks"]
E --> F{"merge + filter<br/>IoU > 0.80 dedup<br/>area < 80px removed"}
D --> F
B --> G["Unproject 3D<br/>fx=fy=565.6"]
F --> G
G --> H["DBSCAN<br/>eps=0.1 min=10"]
H --> I["map frame<br/>via /odom_map"]
F --> J["CLIP ViT-L-14<br/>0.7*img + 0.3*text<br/>768-dim"]
I --> LM["Local Map<br/>10-frame window<br/>high-mobility objects"]
J --> LM
LM -->|"3+ stable keyframes<br/>low-mobility score"| GM[("Global Map<br/>nav targets<br/>A* graph")]
Mobility routing — CLIP cosine similarity against prototype names classifies objects at inference time. Low-mobility objects (chair, bed, shelf) go to the global map as navigation targets. High-mobility objects (cup, pillow, remote) go to the local map only and are never navigated to.
stateDiagram-v2
[*] --> IDLE
IDLE --> QUERY : goal_stack received
QUERY --> HOMING : DualMap path found under 10s
QUERY --> HOMING : semantic_map.json hit
QUERY --> EXPLORE : 10s timeout no hit
EXPLORE --> HOMING : DualMap path found
EXPLORE --> HOMING : semantic_map.json hit 5s check
EXPLORE --> STUCK : 300s timeout
STUCK --> IDLE : skip goal
HOMING --> IDLE : Nav2 SUCCEEDED
HOMING --> REPLAN : Nav2 FAILED
REPLAN --> HOMING : retry max 3
REPLAN --> IDLE : skip after 3 fails
IDLE --> [*] : all goals done
QUERY waits 10 s for DualMap to find the target. On timeout, it falls back to semantic_map.json. If still nothing, it enters EXPLORE (frontier BFS) and checks the semantic map every 5 s until the object is discovered.
| Layer | Tech |
|---|---|
| Simulation | Gazebo Harmonic · small_house.world |
| Robot | TurtleBot3 Waffle [1] |
| Middleware | ROS 2 Humble |
| SLAM | Cartographer (Graph SLAM) |
| Classical nav | Nav2 · NavFn A* · DWB (7 critics) [2] |
| RL nav | TD3 · 44→512→512→2 MLP · 2 training stages [3] |
| Exploration | explore_lite (Wavefront BFS) |
| Detection | YOLOv8x-World v2 (conf=0.2) |
| Segmentation | MobileSAM + FastSAM-x (conf=0.35) |
| Features | CLIP ViT-L-14 · 768-dim · 0.7×img + 0.3×text |
| Clustering | DBSCAN (eps=0.1 m, min_pts=10) |
| Language | Gemini 2.5 Flash · Pydantic structured output |
| STT | Whisper (local HuggingFace) |
| Web GUI | Next.js · rosbridge WebSocket · WebSocket STT server |
| Semantic mapping | DualMap architecture [4] |
- ROS 2 Humble
- Gazebo Harmonic
- conda (for DualMap Python environment)
- Node.js ≥18 (for web GUI only)
- CUDA-capable GPU (for DualMap — YOLO, SAM, CLIP inference)
# ROS 2 packages needed beyond base install
sudo apt install ros-humble-cartographer-ros ros-humble-nav2-bringup \
ros-humble-rosbridge-server ros-humble-rviz2cd ~/dl_hackathon
colcon build --symlink-install
source install/setup.bashDualMap runs as a Python subprocess outside ROS 2. It needs its own environment with CUDA PyTorch, open_clip, ultralytics, and open3d.
cd ~/dl_hackathon/src/hackathon_updated/hackathon/scratch/DualMap
conda env create -f environment.yml
conda activate dualmapThe environment.yml installs: torch, torchvision, open_clip_torch, ultralytics==8.3.103, open3d, hydra-core, supervision==0.25.1, rerun-sdk==0.22.1, faiss-cpu, and others.
Place these inside src/hackathon_updated/hackathon/scratch/DualMap/model/:
| File | Source |
|---|---|
yolov8x-worldv2.pt |
Ultralytics (auto-downloads on first run) |
mobile_sam.pt |
MobileSAM (Zhang et al., 2023) [8] |
FastSAM-x.pt |
FastSAM (Zhao et al., 2023) [7] |
| CLIP ViT-L-14 | Auto-downloaded by open_clip on first run |
CLIP weights (datacomp_xl_s13b_b90k) are fetched automatically by open_clip_torch on first use.
export GEMINI_API_KEY="your_key_here"Used in src/vln_pipeline/vln_pipeline/structured_out.py via os.environ.get("GEMINI_API_KEY"). Get a key at aistudio.google.com.
export TURTLEBOT3_MODEL=wafflecd ~/dl_hackathon/VLN-Robot-Control/my-app
npm installAlso install the Python STT server dependencies:
cd ~/dl_hackathon/VLN-Robot-Control
pip install -r requirements.txt
# Linux microphone support
sudo apt install portaudio19-dev
# spaCy model
python -m spacy download en_core_web_smSend goals via ROS 2 topic directly. No browser needed.
source ~/dl_hackathon/install/setup.bash
export TURTLEBOT3_MODEL=waffle
export GEMINI_API_KEY="your_key_here"
# Start full pipeline
ros2 launch turtlebot3_gazebo gz_vln_exploration.launch.py
# In another terminal — send a text instruction
ros2 topic pub --once /vln/instruction std_msgs/String "data: 'go to the chair'"The GUI launch adds rosbridge (port 9090) and an STT WebSocket server (port 8765). The browser connects to both, sends audio or text, and displays the robot's goal progress.
# Terminal 1 — start the full pipeline with GUI bridges
source ~/dl_hackathon/install/setup.bash
export TURTLEBOT3_MODEL=waffle
export GEMINI_API_KEY="your_key_here"
ros2 launch turtlebot3_gazebo gz_vln_gui.launch.py
# Terminal 2 — start the Next.js frontend
cd ~/dl_hackathon/VLN-Robot-Control/my-app
npm run dev
# Open browser
# http://localhost:3000GUI flow:
- Browser connects to rosbridge at
ws://localhost:9090 - Click microphone → browser streams Float32 PCM audio to STT server at
ws://localhost:8765 - Whisper [11] transcribes → Gemini [10] parses → GoalStack JSON returned to browser
- Browser publishes GoalStack to
/vln/goal_stackvia rosbridge - Arbitrator picks up goal, DualMap searches, Nav2 navigates
You can also type a command directly in the text box — it skips Whisper and goes straight to Gemini parsing.
Both launch files follow the same sequence:
| Time | What Starts |
|---|---|
| 0 s | Gazebo server + GUI, robot spawn, Cartographer, Nav2 (5 lifecycle nodes), map_odom_bridge, rviz2 |
| 0 s (GUI only) | rosbridge (port 9090), STT server (port 8765) |
| 25 s | explore_lite, DualMap subprocess, vln_parser, arbitrator |
The 25-second delay exists because Nav2 costmaps need LiDAR data to initialize. DualMap and explore_lite fail if attached before the costmap is ready.
src/
│
├── vln_pipeline/ # Main ROS 2 package
│ ├── vln_pipeline/
│ │ ├── arbitrator_node.py # 10 Hz state machine: IDLE→QUERY→EXPLORE→HOMING
│ │ ├── vln_parser_node.py # Subscribes /vln/instruction → calls Gemini → publishes /vln/goal_stack
│ │ ├── structured_out.py # Pydantic GoalStack schema + Gemini client
│ │ ├── map_odom_bridge.py # Reads Cartographer TF map→base_footprint → publishes /odom_map at 10 Hz
│ │ └── utils.py # Shared helpers (pose math, file IPC read/write)
│ ├── stt/
│ │ ├── stt_server.py # WebSocket server: audio → Whisper → Gemini → GoalStack JSON
│ │ └── speech_to_text.py # Whisper wrapper
│ └── config/
│ └── vln_params.yaml # arbitrator + vln_parser ROS params
│
├── hackathon_updated/hackathon/scratch/DualMap/ # Dual-level semantic mapping
│ ├── applications/
│ │ └── runner_ros.py # Entry point — ROS subscriber loop at 15 Hz
│ ├── dualmap/
│ │ ├── local_map.py # LocalMapManager: 10-frame sliding window, NetworkX graph
│ │ ├── global_map.py # GlobalMapManager: persistent objects, A* pathfinding
│ │ ├── perception.py # YOLO + MobileSAM + FastSAM + CLIP + DBSCAN pipeline
│ │ └── mobility.py # CLIP cosine similarity mobility classifier
│ ├── config/
│ │ ├── runner_ros.yaml # ros_rate=15, use_fastsam, use_parallel, etc.
│ │ ├── actions.yaml # IPC file: arbitrator writes, DualMap polls every 100ms
│ │ └── system_config.yaml # model paths, DBSCAN params, CLIP config
│ ├── model/ # Model weight files
│ └── environment.yml # conda env with all Python deps
│
├── turtlebot3_simulations/
│ └── turtlebot3_gazebo/
│ └── launch/
│ ├── gz_vln_exploration.launch.py # Main entry point — Nav2 + DualMap + language pipeline
│ ├── gz_vln_gui.launch.py # GUI variant — adds rosbridge + STT WebSocket server
│ └── gz_rl_exploration.launch.py # RL variant — TD3 replaces Nav2 bt_navigator
│
├── rl_nav/ # TD3 inference node
│ └── rl_nav/
│ └── rl_nav_node.py # NavigateToPose action server backed by TD3 actor MLP
│
├── turtlebot3_drl/ # TD3 training infrastructure
│ └── turtlebot3_drl/
│ ├── drl_agent/ # TD3 actor/critic networks, replay buffer, training loop
│ ├── drl_environment/ # Gazebo environment reward, reset, step logic
│ └── drl_gazebo/ # Gazebo plugin bridge for DRL training
│
├── m-explore-ros2/ # explore_lite — frontier exploration
│ └── explore_lite/ # Wavefront BFS over Nav2 global costmap
│
├── turtlebot3/ # TurtleBot3 ROS 2 packages
├── turtlebot3_msgs/ # TurtleBot3 message definitions
└── aws-robomaker-small-house-world/ # Gazebo small_house.world environment
VLN-Robot-Control/ # Web GUI frontend
└── my-app/
├── app/ # Next.js app router pages
├── components/ # React components (voice control, status display)
└── hooks/ # Custom hooks (rosbridge connection, STT stream)
DualMap runs as a Python subprocess (Hydra config), not a ROS 2 node. Communication is file-based to avoid the overhead of spinning up a full ROS 2 node inside the conda environment:
config/actions.yaml— arbitrator writes queries (what to search, whether to plan a path), DualMap polls every 100 mssemantic_map.json— DualMap writes every 5 s, arbitrator reads during EXPLORE state to check if target has been discovered
# actions.yaml written by arbitrator
calculate_path: true
get_goal_mode: inquiry
inquiry_sentence: chair
trigger_find_next: falseThe reward at each step combines a dense progress signal, a collision penalty, and a sparse goal-arrival bonus:
r(t) = r_distance + r_collision + r_goal
- r_distance — positive reward proportional to the reduction in Euclidean distance to the goal between consecutive steps. Provides a continuous learning signal to guide the robot toward the target.
- r_collision — large negative penalty (−200) triggered when any LiDAR reading falls below the minimum safe threshold, driving the policy to learn obstacle clearance without explicit rules.
- r_goal — large positive sparse reward (+200) when the robot reaches within a success radius of the goal, anchoring the terminal objective.
Two training stages apply the same reward structure in progressively harder environments: Stage 1 uses static obstacles so the robot learns basic goal-seeking and clearance; Stage 2 adds moving obstacles, forcing the policy to react to dynamic threats.
TD3 maintains two independent critic networks, Q₁ and Q₂, each estimating the value of (state, action) pairs. The target for both critics uses the minimum of their estimates:
y = r + γ · min(Q₁'(s', ã), Q₂'(s', ã))
where ã is the action from the target actor with added clipped Gaussian noise (target policy smoothing). Taking the minimum prevents the overestimation bias that afflicts single-critic methods — if one network drifts high, the other clips it down. The two critics are updated independently but share the same replay buffer.
The actor is updated less frequently than the critics (delayed policy update, every 2 critic steps by default). This ensures the critic has converged sufficiently before the policy gradient step, reducing error accumulation over training.
Fujimoto et al. (2018) [3] prove that function approximation errors persist in actor-critic settings and lead to systematic overestimation of Q-values, motivating the choice of TD3 over its predecessors. Key comparisons:
DQN discretises the action space, selecting from a finite set of predefined linear and angular velocity pairs. While the robot can navigate around obstacles, the discrete nature of the policy produces abrupt velocity switches that cause unsmooth trajectories — a significant problem for a robot that must also maintain stable LiDAR scans for SLAM.
DDPG operates in continuous action space and is the direct predecessor of TD3. It uses a single critic, which makes it prone to overestimation bias. The approximate policy is updated using the gradient of an overestimated value function, resulting in a policy that can exploit estimation errors rather than learn the true objective.
TD3 [3] addresses all three failure modes of DDPG through:
- Clipped Double Q-learning — two independent critics, target computed as
min(Q₁', Q₂'), preventing runaway overestimation. - Delayed policy updates — the actor is updated only every second critic step, giving the critic time to stabilize before the policy gradient is applied.
- Target policy smoothing — clipped Gaussian noise is added to target actions when computing critic targets, acting as a regularizer and preventing the policy from exploiting narrow peaks in the value landscape.
For the two-stage curriculum in this system, TD3's stability under noisy early-training value estimates is critical. The policy does not chase a poorly-calibrated critic signal during Stage 1, and the replay buffer allows efficient reuse of samples from both stages during Stage 2 fine-tuning.
The TD3 policy uses 2.0 rad/s max angular velocity — twice what Nav2 allows. When the robot spins fast, the LiDAR completes a 360° sweep while the heading is still changing. Cartographer receives a scan acquired across multiple orientations and treats it as a single-pose snapshot. Scan-to-submap matching fails, submaps misalign, and the global map drifts. Once the map drifts, DualMap's 3D object positions (which are transformed using /odom_map from Cartographer) also drift — the entire semantic pipeline breaks.
The main launch (gz_vln_exploration.launch.py) therefore uses Nav2 [2] with a 1.0 rad/s cap, which keeps Cartographer stable.
The fix — retrain with angular velocity capped at ≤1.0 rad/s, or add LiDAR motion distortion correction before feeding scans to Cartographer. Not completed due to time constraints.
Nav2 with A* + DWB replans on a costmap. If an obstacle moves after the plan is computed, Nav2 must detect the change in the costmap, trigger a replan, and issue new velocity commands — that loop has latency. TD3 reads the current LiDAR state and outputs velocity directly at 10 Hz with no planning step. A person walking across the robot's path is handled immediately rather than after a replan cycle.
Two stages, both in Gazebo:
- Stage 1 — static obstacles: empty room with fixed walls and boxes. Robot learns basic goal-reaching and collision avoidance.
- Stage 2 — dynamic obstacles: same room with moving obstacles added. Robot learns to react to objects crossing its path.
Final weights: actor_stage2_episode*.pt (44→512→512→2 MLP, Tanh output). Loaded by rl_nav_node.py via the model_path ROS parameter.
-
ROBOTIS. TurtleBot3. https://github.com/ROBOTIS-GIT/turtlebot3, 2023.
-
S. Macenski, F. Martín, R. White, J. G. Clavero. The Marathon 2: A Navigation System. IEEE/RSJ IROS, 2020.
-
S. Fujimoto, H. van Hoof, D. Meger. Addressing Function Approximation Error in Actor-Critic Methods. ICML, pp. 1582–1591, 2018.
-
J. Jiang, Y. Zhu, Z. Wu, J. Song. DualMap: Online Open-Vocabulary Semantic Mapping for Natural Language Navigation in Dynamic Changing Scenes. arXiv:2506.01950, 2025.
