Skip to content

RuiZhou-cn/memory-over-maps

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Memory Over Maps:
3D Object Localization Without Reconstruction

Rui Zhou* · Xander Yap* · Jianwen Cao · Allison Lau · Boyang Sun · Marc Pollefeys

(* Equal Contribution)


Overview

Memory Over Maps is a reconstruction-free approach to 3D object localization and navigation using retrieval and VLM re-ranking over posed RGB-D streams.

Real-Time Interactive Demo

Interactive Demo

Real-time, interactive, open-vocabulary scene understanding from posed RGBD images alone — no 3D reconstruction or scene graph is required. Type any natural-language query — a rare object (audio speaker), a functional place (where can I sleep?, where can I cook?, where can I eat?), a material (made of metal), a physical property (something that emits light), an abstract concept (cozy, festive, cluttered), or a spatial relationship (sofa next to the TV, door in the bedroom) — and the corresponding regions are highlighted. The reconstructed mesh shown in the demo is purely for visualization.

Follow docs/demo.md to set up and play with the real-time demo yourself.

Installation

Clone and Install

git clone https://github.com/RuiZhou-cn/memory-over-maps.git
cd memory-over-maps
conda create -n MoM python=3.9 -y && conda activate MoM
bash scripts/install.sh

Datasets

See docs/data.md for download and setup instructions for all datasets (Goat-Core, HM3D, HM3D-OVON, MP3D, SUN RGB-D).

Evaluation

The full pipeline (VLM + spatial fusion + multi-goal + keyframing) runs by default. Run any command with --help for the full list of options.

python -m src.cli.eval_goatcore
python -m src.cli.eval_hm3d
python -m src.cli.eval_ovon
python -m src.cli.eval_mp3d
python -m src.cli.eval_sunrgbd

Out of GPU memory? Use a smaller VLM (e.g. vlm.model: Qwen/Qwen2.5-VL-3B-Instruct) and reduce sam3.batch_size in your config.

Swapping the retrieval backbone

SigLIP2 is the default; other registered backbones can be selected from YAML. Override retrieval.model in the config that applies to your run (e.g. configs/demo.yaml, configs/hm3d.yaml):

retrieval:
  model: Qwen/Qwen3-VL-Embedding-2B        # or a friendly name: qwen3-vl-2b, clip-large, align, flava
  extractor_kwargs:                        # backbone-specific kwargs (optional)
    instruction: "Retrieve images relevant to query."
    batch_size: 1                          # Qwen3-VL is token-heavy; keep small on 24GB
    max_pixels: 147456                     # caps visual tokens per image

Qwen3-VL-Embedding requires the optional install step [6/6] in scripts/install.sh (clones third_party/qwen3-vl-embedding). Adding a new backbone is one file + one @register_extractor(...) decorator under src/models/retrieval/.

Results

Table I — Goat-Core Localization. SR@5 (%) across scenes and query types (Average, Object, Image, Language).

Goat-Core Results

Table II — Object Goal Navigation. SR and SPL (%) on HM3D, MP3D, and HM3D-OVON.

ObjectNav Results

Table III — Text-to-Image Retrieval. AR@1 and AR@5 (%) on SUN RGB-D across sensor types.

SUN RGB-D Results

Real-World Experiments

Real-world experiment 1
Sequence 1
Real-world experiment 2
Sequence 2

Repository Structure

src/
├── cli/              # Evaluation entry points (one per benchmark)
├── pipelines/        # Paper pipeline steps (retrieval → localization → navigation)
├── envs/             # Dataset-specific configs and loaders (HM3D, MP3D, Goat-Core, OVON)
├── evaluation/       # Metrics accumulators and evaluation helpers
├── models/
│   ├── vlm/          # Vision-language model (Qwen2.5-VL)
│   ├── retrieval/    # Pluggable feature extractors (SigLIP2 default; CLIP, ALIGN, FLAVA, Qwen3-VL-Embedding) + FAISS search
│   ├── navigation/   # DD-PPO PointNav policy + multi-goal agent
│   └── segmentation/ # SAM3 text-prompted segmentation
└── utils/            # Projection, geometry, data loading, keyframing, spatial fusion
demo/                 # Interactive 3D viewer (viser)
checkpoints/
└── navigation/
    └── pointnav_weights.pth   # DD-PPO PointNav policy weights
configs/              # YAML configs (per-benchmark eval + demo)
scripts/              # Install + data preparation

Citation

If you use this code in your research, please cite:

@misc{zhou2026memorymaps3dobject,
      title={Memory Over Maps: 3D Object Localization Without Reconstruction},
      author={Rui Zhou and Xander Yap and Jianwen Cao and Allison Lau and Boyang Sun and Marc Pollefeys},
      year={2026},
      eprint={2603.20530},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.20530},
}

About

Memory Over Maps: 3D Object Localization Without Reconstruction

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors