Rui Zhou* · Xander Yap* · Jianwen Cao · Allison Lau · Boyang Sun · Marc Pollefeys
(* Equal Contribution)
Memory Over Maps is a reconstruction-free approach to 3D object localization and navigation using retrieval and VLM re-ranking over posed RGB-D streams.
Real-time, interactive, open-vocabulary scene understanding from posed RGBD images alone — no 3D reconstruction or scene graph is required. Type any natural-language query — a rare object (audio speaker), a functional place (where can I sleep?, where can I cook?, where can I eat?), a material (made of metal), a physical property (something that emits light), an abstract concept (cozy, festive, cluttered), or a spatial relationship (sofa next to the TV, door in the bedroom) — and the corresponding regions are highlighted. The reconstructed mesh shown in the demo is purely for visualization.
Follow docs/demo.md to set up and play with the real-time demo yourself.
git clone https://github.com/RuiZhou-cn/memory-over-maps.git
cd memory-over-maps
conda create -n MoM python=3.9 -y && conda activate MoM
bash scripts/install.shSee docs/data.md for download and setup instructions for all datasets (Goat-Core, HM3D, HM3D-OVON, MP3D, SUN RGB-D).
The full pipeline (VLM + spatial fusion + multi-goal + keyframing) runs by default. Run any command with --help for the full list of options.
python -m src.cli.eval_goatcore
python -m src.cli.eval_hm3d
python -m src.cli.eval_ovon
python -m src.cli.eval_mp3d
python -m src.cli.eval_sunrgbdOut of GPU memory? Use a smaller VLM (e.g.
vlm.model: Qwen/Qwen2.5-VL-3B-Instruct) and reducesam3.batch_sizein your config.
SigLIP2 is the default; other registered backbones can be selected from YAML. Override retrieval.model in the config that applies to your run (e.g. configs/demo.yaml, configs/hm3d.yaml):
retrieval:
model: Qwen/Qwen3-VL-Embedding-2B # or a friendly name: qwen3-vl-2b, clip-large, align, flava
extractor_kwargs: # backbone-specific kwargs (optional)
instruction: "Retrieve images relevant to query."
batch_size: 1 # Qwen3-VL is token-heavy; keep small on 24GB
max_pixels: 147456 # caps visual tokens per imageQwen3-VL-Embedding requires the optional install step [6/6] in scripts/install.sh (clones third_party/qwen3-vl-embedding). Adding a new backbone is one file + one @register_extractor(...) decorator under src/models/retrieval/.
Table I — Goat-Core Localization. SR@5 (%) across scenes and query types (Average, Object, Image, Language).
![]() Sequence 1 |
![]() Sequence 2 |
src/
├── cli/ # Evaluation entry points (one per benchmark)
├── pipelines/ # Paper pipeline steps (retrieval → localization → navigation)
├── envs/ # Dataset-specific configs and loaders (HM3D, MP3D, Goat-Core, OVON)
├── evaluation/ # Metrics accumulators and evaluation helpers
├── models/
│ ├── vlm/ # Vision-language model (Qwen2.5-VL)
│ ├── retrieval/ # Pluggable feature extractors (SigLIP2 default; CLIP, ALIGN, FLAVA, Qwen3-VL-Embedding) + FAISS search
│ ├── navigation/ # DD-PPO PointNav policy + multi-goal agent
│ └── segmentation/ # SAM3 text-prompted segmentation
└── utils/ # Projection, geometry, data loading, keyframing, spatial fusion
demo/ # Interactive 3D viewer (viser)
checkpoints/
└── navigation/
└── pointnav_weights.pth # DD-PPO PointNav policy weights
configs/ # YAML configs (per-benchmark eval + demo)
scripts/ # Install + data preparation
If you use this code in your research, please cite:
@misc{zhou2026memorymaps3dobject,
title={Memory Over Maps: 3D Object Localization Without Reconstruction},
author={Rui Zhou and Xander Yap and Jianwen Cao and Allison Lau and Boyang Sun and Marc Pollefeys},
year={2026},
eprint={2603.20530},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2603.20530},
}




