Memory Over Maps:
3D Object Localization Without Reconstruction

Rui Zhou* · Xander Yap* · Jianwen Cao · Allison Lau · Boyang Sun · Marc Pollefeys

(* Equal Contribution)

Paper | Video | Project Page | Interactive Demo

Overview

Memory Over Maps is a reconstruction-free approach to 3D object localization and navigation using retrieval and VLM re-ranking over posed RGB-D streams.

Real-Time Interactive Demo

Real-time, interactive, open-vocabulary scene understanding from posed RGBD images alone — no 3D reconstruction or scene graph is required. Type any natural-language query — a rare object (audio speaker), a functional place (where can I sleep?, where can I cook?, where can I eat?), a material (made of metal), a physical property (something that emits light), an abstract concept (cozy, festive, cluttered), or a spatial relationship (sofa next to the TV, door in the bedroom) — and the corresponding regions are highlighted. The reconstructed mesh shown in the demo is purely for visualization.

Follow docs/demo.md to set up and play with the real-time demo yourself.

Installation

Clone and Install

git clone https://github.com/RuiZhou-cn/memory-over-maps.git
cd memory-over-maps
conda create -n MoM python=3.9 -y && conda activate MoM
bash scripts/install.sh

Datasets

See docs/data.md for download and setup instructions for all datasets (Goat-Core, HM3D, HM3D-OVON, MP3D, SUN RGB-D).

Evaluation

The full pipeline (VLM + spatial fusion + multi-goal + keyframing) runs by default. Run any command with --help for the full list of options.

python -m src.cli.eval_goatcore
python -m src.cli.eval_hm3d
python -m src.cli.eval_ovon
python -m src.cli.eval_mp3d
python -m src.cli.eval_sunrgbd

Out of GPU memory? Use a smaller VLM (e.g. vlm.model: Qwen/Qwen2.5-VL-3B-Instruct) and reduce sam3.batch_size in your config.

Swapping the retrieval backbone

SigLIP2 is the default; other registered backbones can be selected from YAML. Override retrieval.model in the config that applies to your run (e.g. configs/demo.yaml, configs/hm3d.yaml):

retrieval:
  model: Qwen/Qwen3-VL-Embedding-2B        # or a friendly name: qwen3-vl-2b, clip-large, align, flava
  extractor_kwargs:                        # backbone-specific kwargs (optional)
    instruction: "Retrieve images relevant to query."
    batch_size: 1                          # Qwen3-VL is token-heavy; keep small on 24GB
    max_pixels: 147456                     # caps visual tokens per image

Qwen3-VL-Embedding requires the optional install step [6/6] in scripts/install.sh (clones third_party/qwen3-vl-embedding). Adding a new backbone is one file + one @register_extractor(...) decorator under src/models/retrieval/.

Results

Table I — Goat-Core Localization. SR@5 (%) across scenes and query types (Average, Object, Image, Language).

Table II — Object Goal Navigation. SR and SPL (%) on HM3D, MP3D, and HM3D-OVON.

Table III — Text-to-Image Retrieval. AR@1 and AR@5 (%) on SUN RGB-D across sensor types.

Real-World Experiments

Sequence 1

Sequence 2

Repository Structure

src/
├── cli/              # Evaluation entry points (one per benchmark)
├── pipelines/        # Paper pipeline steps (retrieval → localization → navigation)
├── envs/             # Dataset-specific configs and loaders (HM3D, MP3D, Goat-Core, OVON)
├── evaluation/       # Metrics accumulators and evaluation helpers
├── models/
│   ├── vlm/          # Vision-language model (Qwen2.5-VL)
│   ├── retrieval/    # Pluggable feature extractors (SigLIP2 default; CLIP, ALIGN, FLAVA, Qwen3-VL-Embedding) + FAISS search
│   ├── navigation/   # DD-PPO PointNav policy + multi-goal agent
│   └── segmentation/ # SAM3 text-prompted segmentation
└── utils/            # Projection, geometry, data loading, keyframing, spatial fusion
demo/                 # Interactive 3D viewer (viser)
checkpoints/
└── navigation/
    └── pointnav_weights.pth   # DD-PPO PointNav policy weights
configs/              # YAML configs (per-benchmark eval + demo)
scripts/              # Install + data preparation

Citation

If you use this code in your research, please cite:

@misc{zhou2026memorymaps3dobject,
      title={Memory Over Maps: 3D Object Localization Without Reconstruction},
      author={Rui Zhou and Xander Yap and Jianwen Cao and Allison Lau and Boyang Sun and Marc Pollefeys},
      year={2026},
      eprint={2603.20530},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.20530},
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
checkpoints/navigation		checkpoints/navigation
configs		configs
data		data
demo		demo
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Memory Over Maps:
3D Object Localization Without Reconstruction

Paper | Video | Project Page | Interactive Demo

Overview

Real-Time Interactive Demo

Installation

Clone and Install

Datasets

Evaluation

Swapping the retrieval backbone

Results

Real-World Experiments

Repository Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Memory Over Maps:3D Object Localization Without Reconstruction

Paper | Video | Project Page | Interactive Demo

Overview

Real-Time Interactive Demo

Installation

Clone and Install

Datasets

Evaluation

Swapping the retrieval backbone

Results

Real-World Experiments

Repository Structure

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Memory Over Maps:
3D Object Localization Without Reconstruction

Packages