One problem with LLMs is, having been trained on text and text alone, that they have a pretty poor representation of the world, and as such do very badly at spatial reasoning with language.
We know this because of benchmark datasets such as StepGame, which (loosely speaking) describe scenes and then pose the models questions about where things are in that space.
So what I’m wondering is, as models are becoming more multimodal, does training a model on additional media besides just text help it make sense of the linguistic spatial concepts that we take for granted?
I’ll explore different types of models, such as Sentence-BERT, which is text-only, and CLIP, a model trained on image-caption pairs.
The goal isn't just to see which model performs better on spatial reasoning, it’s to look into the model’s embeddings to figure out which spatial concepts benefit from additional training — and for this I’m going to try and use probing classifiers, which are a way of looking "under the hood", so to speak.
N.B. We are encouraged to use AI to code in this class, so I'm disclosing that here. The research design and insights are all mine, based on the ML and NLP classes I'm taking currently.
Investigating whether visually grounded embeddings (CLIP) encode spatial relations better than purely distributional embeddings (BERT/SBERT) using probing classifiers.
Does visual contrastive training (CLIP) produce text embeddings that encode spatial relations better than purely distributional training (BERT/SBERT)? And which specific spatial relation types benefit — or remain resistant?
pip install -r requirements.txtDevelopment runs locally in VS Code. GPU-bound embedding extraction runs on Google Colab.
spatial_probing/
├── CLAUDE.md # detailed implementation guide (not committed)
├── README.md # this file
├── requirements.txt # dependencies
├── .env # gitignored (local config)
│
├── notebooks/
│ ├── 01_data_exploration.ipynb
│ ├── 02_embedding_extraction.ipynb
│ ├── 03_probing_experiments.ipynb
│ ├── 04_rsa_analysis.ipynb
│ └── 05_visualization.ipynb
│
├── src/
│ ├── __init__.py
│ ├── datasets.py # dataset loading + preprocessing
│ ├── embedders.py # model loading + embedding extraction
│ ├── probing.py # probe training + evaluation
│ └── analysis.py # RSA, ranking tasks, visualization helpers
│
└── results/
├── embeddings/ # cached .npy files (gitignored)
├── probes/ # saved probe models (gitignored)
└── figures/ # output plots
- Sentence-BERT (
sentence-transformers/all-mpnet-base-v2) — baseline distributional model - CLIP Text (
openai/clip-vit-base-patch32text encoder) — contrastive training, text-only - CLIP Multimodal (text + image) — contrastive training, joint embedding
- VSR (Visual Spatial Reasoning) — primary dataset, 10k examples with images
- SpartQA — text-only multi-hop spatial inference
- StepGame — structured complexity scaling
src/datasets.py— VSR loadersrc/embedders.py— SBERT and CLIP text embeddersnotebooks/01_data_exploration.ipynbnotebooks/02_embedding_extraction.ipynbsrc/probing.py— logistic regression probenotebooks/03_probing_experiments.ipynb- Add CLIP multimodal embedder
src/analysis.py— RSAnotebooks/04_rsa_analysis.ipynbnotebooks/05_visualization.ipynb
- Radford et al. (2021) — CLIP: Learning Transferable Visual Models From Natural Language Supervision
- Liu et al. (2022) — VSR: Visual Spatial Reasoning
- Belinkov (2022) — Probing Classifiers survey: Promises, Shortcomings, and Advances
Nora Gully
University of Colorado at Boulder 4622 Machine Learning Spring 2026