Overview

One problem with LLMs is, having been trained on text and text alone, that they have a pretty poor representation of the world, and as such do very badly at spatial reasoning with language.

We know this because of benchmark datasets such as StepGame, which (loosely speaking) describe scenes and then pose the models questions about where things are in that space.

So what I’m wondering is, as models are becoming more multimodal, does training a model on additional media besides just text help it make sense of the linguistic spatial concepts that we take for granted?

I’ll explore different types of models, such as Sentence-BERT, which is text-only, and CLIP, a model trained on image-caption pairs.

The goal isn't just to see which model performs better on spatial reasoning, it’s to look into the model’s embeddings to figure out which spatial concepts benefit from additional training — and for this I’m going to try and use probing classifiers, which are a way of looking "under the hood", so to speak.

N.B. We are encouraged to use AI to code in this class, so I'm disclosing that here. The research design and insights are all mine, based on the ML and NLP classes I'm taking currently.

Spatial Reasoning Probing Study – Study 1

Investigating whether visually grounded embeddings (CLIP) encode spatial relations better than purely distributional embeddings (BERT/SBERT) using probing classifiers.

Core Research Question

Does visual contrastive training (CLIP) produce text embeddings that encode spatial relations better than purely distributional training (BERT/SBERT)? And which specific spatial relation types benefit — or remain resistant?

Quick Start

pip install -r requirements.txt

Development runs locally in VS Code. GPU-bound embedding extraction runs on Google Colab.

Project Structure

spatial_probing/
├── CLAUDE.md              # detailed implementation guide (not committed)
├── README.md              # this file
├── requirements.txt       # dependencies
├── .env                   # gitignored (local config)
│
├── notebooks/
│   ├── 01_data_exploration.ipynb
│   ├── 02_embedding_extraction.ipynb
│   ├── 03_probing_experiments.ipynb
│   ├── 04_rsa_analysis.ipynb
│   └── 05_visualization.ipynb
│
├── src/
│   ├── __init__.py
│   ├── datasets.py        # dataset loading + preprocessing
│   ├── embedders.py       # model loading + embedding extraction
│   ├── probing.py         # probe training + evaluation
│   └── analysis.py        # RSA, ranking tasks, visualization helpers
│
└── results/
    ├── embeddings/        # cached .npy files (gitignored)
    ├── probes/            # saved probe models (gitignored)
    └── figures/           # output plots

Models Under Study

Sentence-BERT (sentence-transformers/all-mpnet-base-v2) — baseline distributional model
CLIP Text (openai/clip-vit-base-patch32 text encoder) — contrastive training, text-only
CLIP Multimodal (text + image) — contrastive training, joint embedding

Datasets

VSR (Visual Spatial Reasoning) — primary dataset, 10k examples with images
SpartQA — text-only multi-hop spatial inference
StepGame — structured complexity scaling

Build Order

src/datasets.py — VSR loader
src/embedders.py — SBERT and CLIP text embedders
notebooks/01_data_exploration.ipynb
notebooks/02_embedding_extraction.ipynb
src/probing.py — logistic regression probe
notebooks/03_probing_experiments.ipynb
Add CLIP multimodal embedder
src/analysis.py — RSA
notebooks/04_rsa_analysis.ipynb
notebooks/05_visualization.ipynb

Key References

Radford et al. (2021) — CLIP: Learning Transferable Visual Models From Natural Language Supervision
Liu et al. (2022) — VSR: Visual Spatial Reasoning
Belinkov (2022) — Probing Classifiers survey: Promises, Shortcomings, and Advances

Authors

Nora Gully

Class

University of Colorado at Boulder 4622 Machine Learning Spring 2026

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
notebooks		notebooks
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Spatial Reasoning Probing Study – Study 1

Core Research Question

Quick Start

Project Structure

Models Under Study

Datasets

Build Order

Key References

Authors

Class

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Spatial Reasoning Probing Study – Study 1

Core Research Question

Quick Start

Project Structure

Models Under Study

Datasets

Build Order

Key References

Authors

Class

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages