Skip to content

MoritzM00/fall-detection-mllm

Repository files navigation

Video-based Fall Detection using Multimodal Large Language Models

License: MIT Python 3.11+ Ruff vLLM pre-commit

Project Overview

This project provides code for the master thesis on Video-based Fall Detection using Multimodal Large Language Models (MLLMs), specifically the detection of Human Falls and the subsequent state of being fallen. We also evaluate MLLMs jointly with general Human Activity classes like walking or standing to assess models on Human Activity Recognition (HAR).

The main experiments we conduct are:

  • Zero-shot: No exemplars are given, just the task instruction
  • Few-shot: Video exemplars with ground truth labels are supplied for In-Context Learning (ICL), selected randomly or via similarity-based retrieval
  • Chain-of-Thought (CoT): Zero-Shot CoT where the model produces its own reasoning trace

Quick Start

Requirements:

  1. Setup Environment with conda/uv
  2. Set recommended environment variables

The main entrypoint is scripts/vllm_inference.py and experiments can be configured using e.g.,experiment=zeroshot (the default is debug)

To run zero-shot experiments with InternVL3.5-8B, execute

python scripts/vllm_inference.py experiment=zeroshot model=internvl model.params=8B

To run few-shot experiments with random exemplar selection, execute

python scripts/vllm_inference.py experiment=fewshot model=qwenvl model.params=4B

To run few-shot experiments with similarity-based exemplar retrieval (requires precomputed embeddings, see below), execute

python scripts/vllm_inference.py experiment=fewshot_similarity model=qwenvl model.params=8B

To run CoT experiments with the default model, execute

python scripts/vllm_inference.py experiment=zeroshot_cot

Predictions and evaluation artifacts are saved under the configured output_dir, for example output_dir/predictions/<wandb-project>/ and output_dir/evaluation_results/<wandb-project>/.

Video Tensor Caching

Preprocessing videos (PyAV decode + resize/crop) is deterministic and can be cached to avoid repeating work across runs. Two independent cache layers are available:

Disk cache — preprocessed tensors saved as .pt files, persistent across runs.

# Pre-build the cache (must run before inference with cache_read_only=true)
python scripts/build_tensor_cache.py experiment=zeroshot data.cache_dir=outputs/tensor_cache

# Run inference using the cache (reads only, never writes)
python scripts/vllm_inference.py experiment=zeroshot data.cache_dir=outputs/tensor_cache

Each dataset × split × mode combination is stored in an isolated namespace, so changing num_frames, model_fps, or data.size automatically uses a new namespace and ignores stale entries. cache_read_only=true (the default) ensures inference never accidentally populates the cache; only build_tensor_cache.py writes.

In-memory cache — lazy dict populated on first access, useful for the few-shot exemplar corpus where the same train videos are loaded repeatedly across batches.

python scripts/vllm_inference.py experiment=fewshot data.cache_in_memory=true

Only the exemplar corpus gets in-memory caching (not the test dataloader, which accesses each video once and uses forked worker processes anyway). Cache hit/miss stats are logged every 500 accesses and as a final summary after inference.

Both layers can be combined: the exemplar corpus will hit memory first, then fall back to disk, then decode from video.

Computing Embeddings

Similarity-based few-shot requires precomputed embeddings. To compute them:

python scripts/vllm_inference.py experiment=embed

This uses the Qwen3-VL-Embedding model and saves embeddings to outputs/embeddings/.

Fine-tuning

Supervised fine-tuning of Qwen3-VL with LoRA via TRL SFTTrainer, driven by Hydra.

python scripts/train_sft.py                     # default: training=quick
python scripts/train_sft.py training=full       # full run preset
python scripts/train_sft.py training=smoke      # short wiring check

Common overrides:

python scripts/train_sft.py model.params=4B     # different model
python scripts/train_sft.py wandb.mode=offline  # disable W&B sync
python scripts/train_sft.py training.max_steps=20
python scripts/train_sft.py training.attn_implementation=null  # disable flash attention 2

Pair training=full with the dataset group you want. Single-source splits live in config/dataset/omnifall/video/; multi-source mixes live in config/dataset/combined/video/:

python scripts/train_sft.py training=full dataset=omnifall/video/oops
python scripts/train_sft.py training=full dataset=omnifall/video/staged-cs
python scripts/train_sft.py training=full dataset=omnifall/video/staged-cv
python scripts/train_sft.py training=full dataset=omnifall/video/staged-oops
python scripts/train_sft.py training=full dataset=omnifall/video/all
python scripts/train_sft.py training=full dataset=combined/video/wanfall-rand-staged-cs-oops

By default dataset_val mirrors dataset (see config/training_config.yaml). To override the val set independently:

python scripts/train_sft.py training=full \
    dataset=omnifall/video/staged-cs \
    dataset@dataset_val=omnifall/video/cmdfall

For staged-only training, eval on cmdfall (the representative staged benchmark) avoids leakage; mixed-training runs use the matching *-cmdfall eval groups (omnifall/video/oops-cmdfall, combined/video/wanfall-oops-cmdfall).

Outputs land under outputs/training/<run_name>/, with the final adapter at outputs/training/<run_name>/adapter. Load it at inference time via the lora config group in inference_config.yaml:

python scripts/vllm_inference.py \
    model.params=8B \
    lora.path=outputs/training/<run_name>/adapter \
    lora.max_rank=8

LoRA-rank ablation runner

scripts/ablations/run_sft_ablations.py sweeps LoRA r (alpha = 2·r) across placements (attn, mlp, both) and dataset groups:

# dry run: print every accelerate launch command
python scripts/ablations/run_sft_ablations.py --dry-run

# default sweep (r in {4,8,16,32}, placement=both, dataset=oops)
python scripts/ablations/run_sft_ablations.py

# data-mix ablation
python scripts/ablations/run_sft_ablations.py \
    --rank 16 --dataset staged staged-oops staged-oops-wanfall

Multi-GPU training

accelerate launch --config_file config/accelerate/ddp_bf16.yaml \
    --num_processes 4 scripts/train_sft.py training=quick

# or:
torchrun --nproc_per_node=4 scripts/train_sft.py training=quick

config/accelerate/ddp_bf16.yaml is a single-node DDP + bf16 setup; --num_processes overrides the value in the file.

For larger models or longer sequences, run with DeepSpeed ZeRO-2 (optimizer + gradient sharding):

accelerate launch --config_file config/accelerate/deepspeed_zero2.yaml \
    --num_processes 4 scripts/train_sft.py training=full

# or pass the JSON directly without an accelerate config:
torchrun --nproc_per_node=4 scripts/train_sft.py training=full \
    training.deepspeed=config/deepspeed/zero2.json

Relevant configs:

  • config/training_config.yaml — root config; composes model, prompt, dataset, lora, training.
  • config/training/smoke.yaml, quick.yaml, full.yaml.
  • config/lora/train.yaml — PEFT LoRA hyperparameters.
  • config/accelerate/ddp_bf16.yaml, deepspeed_zero2.yaml.
  • config/deepspeed/zero2.json — DeepSpeed ZeRO-2 config consumed by the above.

Configuration options

Besides settings experiments, the main configuration options are

  1. vLLM configs in config/vllm (default: default, for faster warmup times, use debug)
  2. Sampling configs in config/sampling (i.e. greedy, qwen3_instruct)
  3. Model configs in config/model (default: qwenvl)
  4. Prompt configs in config/prompt (default: default) with text-based output and Role Prompt

Other settings include:

  1. Data Processing options, i.e. data.size=224 or data.split=cv
  2. Hardware settings, notably
    • batch_size: specifies how many videos are loaded into memory at once. Reduce of RAM-constrained
    • num_workers: Number of worker processed for data loading
  3. Wandb logging config, notably
    • wandb.mode (online, offline or disabled)
    • wandb.project (also configured by experiment)

Debugging options

  • num_samples (int): constrain the number of samples used for inference
  • vllm.use_mock (bool): if True, skip vLLM engine and produce random predictions for debugging purposes that do not depend on vLLM
  • vllm=debug for faster warm-up times

Tech Stack

We use the vLLM inference engine, optimized for high-throughput and memory-efficient LLM inference with multimodal support. Hydra is used for configuration management (see above)

Create the environment

  1. Install Conda
  2. Run
make env
conda activate cu130_vllm20_py312
  1. Install additional dependencies using uv (installed inside conda environment)
make install

This installs vLLM + flash-attn, the inference and fine-tuning Python deps (transformers, peft, trl, accelerate, datasets, ...), dev tools, and the package itself in editable mode.

At the time of writing, vLLM is compiled for cu130 by default. If you need a different version of CUDA, you have to install vLLM from source.

Environment variables

Required

OMNIFALL_ROOT=path/to/omnifall
VLLM_WORKER_MULTIPROC_METHOD=spawn

Recommended

These variables should be set before launching the vllm inference script.

CUDA_VISIBLE_DEVICES=0 # or e.g., 0,1
VLLM_CONFIGURE_LOGGING=0

About

Video-based Fall Detection with Multimodal Large Language Models

Topics

Resources

License

Stars

Watchers

Forks

Contributors