Skip to content

NEBULIS-Lab/E-RECAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

100 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

E-RECAP: Embodied REplanning with Cost-Aware Pruning

NEBULIS Lab arXiv PyTorch Transformers DeepSpeed Accelerate

This project implements E-RECAP (Embodied REplanning with Cost-Aware Pruning), a system-level, drop-in method for accelerating replanning in embodied agents by cost-aware pruning of planner context. E-RECAP operates as a Planner optimization module that can be seamlessly integrated into embodied AI systems without modifying task definitions, environments, or control policies.

Overview

In embodied AI systems, agents frequently need to replan due to partial observability, dynamic environments, and execution uncertainties. When using LLM/VLM as high-level planners, each replanning cycle requires processing long contexts that accumulate over time, making replanning a major computational bottleneck—especially in multi-agent settings where context grows with the number of agents.

E-RECAP addresses this by:

  • Learning task-agnostic token importance from large-scale instruction-following data (Dolly-15k, Alpaca, Self-Instruct)
  • Cost-aware dynamic pruning of planner context during replanning, reducing computation while preserving decision quality
  • System-level integration that works with any Transformer-based planner without modifying perception or control modules

E-RECAP is evaluated in both single-agent and cooperative multi-agent settings, with embodied evaluation planned on Habitat-Lab (PointNav/ObjectNav tasks).

Project Structure

E-RECAP/
├── checkpoints/          # Model weights and checkpoints
│   ├── pruning_module.pt    # Stage 2 trained Token Pruner (required for inference)
│   ├── saliency.pt          # Stage 1 saliency baseline (optional)
│   └── <model-name>/        # Your local model directory (e.g., llama2-7b, mistral-7b, etc.)
│       ├── config.json          # Model configuration (required)
│       ├── model.safetensors    # Model weights (or model.bin, required)
│       ├── tokenizer.json       # Tokenizer configuration (required)
│       └── ...                  # Other model files (e.g., generation_config.json, etc.)
│
├── data/                 # Datasets
│   └── raw/                 # Raw data files (e.g., Dolly-15k)
│
├── results/              # Experimental results and reports
│   ├── fig/                 # Visualization figures
│   └── part1_sum.md         # Stage 1 summary report
│
├── scripts/              # Execution scripts
│   ├── run_stage1.sh        # Stage 1: Saliency computation
│   ├── run_stage2.sh        # Stage 2: Pruning module training
│   ├── run_inference.sh     # Single GPU inference
│   ├── run_inference_multigpu.sh  # Multi-GPU inference
│   ├── check_pruning_baselines.py # Sanity-check Random/Recency baselines
│   ├── tune_prune_layers.py  # Tune pruning-layer placement (paper heuristic)
│   ├── check_full_env.sh    # Environment check
│   └── install.sh           # Dependency installation
│
└── src/                  # Source code
    ├── stage1_saliency.py        # Stage 1: Gradient × hidden states
    ├── stage2_pruning.py         # Stage 2: Learnable Token Pruner
    ├── erecap_model.py            # Core model with pruning logic
    ├── inference_erecap.py        # Single GPU inference
    ├── inference_erecap_multigpu.py  # Multi-GPU inference
    ├── multigpu_test.py          # Multi-GPU memory profiling
    └── multi_agent/              # Cooperative multi-agent planning
        ├── cooperative_planner.py    # Main planner with E-RECAP integration
        ├── context_buffer.py         # Shared planning context buffer
        ├── structured_output.py      # Structured agent output parser
        ├── agent_config.py           # Agent configuration definitions
        ├── task_definitions.py       # Task step definitions
        ├── framework_wrapper.py      # Optional CrewAI/LangChain support
        └── framework_optional/       # Optional framework files (not in Git)
            └── agents_config.json    # Agent config for CrewAI (if used)

Quick Start

Requirements

  • Python 3.10+
  • CUDA 12.1+
  • ≥50GB disk space for model storage

Hardware Requirements

Note: The hardware requirements depend on the model you choose to use. The following are our test configurations, but you can run E-RECAP on any hardware that meets the minimum requirements for your selected model.

Our test setup:

  • 8× NVIDIA RTX 5880 Ada Generation (48GB VRAM each)
    • Single GPU mode: Uses one GPU
    • Multi-GPU mode: Uses all 8 GPUs

Recommended VRAM by model:

Model Params Rec VRAM Model Params Rec VRAM
LLaMA-27B / 13B~14 GB / ~26 GBLLaMA-3 / 3.18B / 70B~16 GB / ~140 GB
LLaMA-3.21B / 3B / 11B / 90B~2 GB / ~6 GB / ~22 GB / ~180 GBMistral7B8 GB
Qwen27B / 14B / 32B / 72B~14 GB / ~28 GB / ~64 GB / ~144 GBQwen2.57B / 14B12 GB / 16 GB
Qwen30.6B / 1.7B / 4B / 8B / 14B / 32B~1.2 GB / ~3.4 GB / ~8 GB / ~16 GB / ~28 GB / ~64 GBYi6B / 13B8 GB / 16 GB
DeepSeek-LLM7B / 67B~14 GB / ~134 GBGemma-29B~18 GB
Phi-33.8B / 7B~7.6 GB / ~14 GBChatGLM36B8 GB
Baichuan27B / 13B~14 GB / ~26 GBInternLM27B / 20B12 GB / 24 GB

Installation

Prerequisites:

  • Install CUDA 12.1+ (includes nvcc compiler) and NVIDIA GPU drivers
  • Verify CUDA installation: nvcc --version and nvidia-smi

Install Python packages:

pip install -r requirements.txt

Note: PyTorch will automatically use the installed CUDA version. For CUDA 12.x, install PyTorch with:

pip install torch --index-url https://download.pytorch.org/whl/cu121

File Organization

Required files and their locations:

  1. Model filescheckpoints/<model-name>/

    • Place your HuggingFace-compatible model here
    • Must include: config.json, model weights (.safetensors or .bin), tokenizer files
    • Example structure:
      checkpoints/
      └── your-model-name/
          ├── config.json
          ├── model.safetensors (or model-*.safetensors)
          ├── tokenizer.json
          └── ...
      
  2. Pruning modulecheckpoints/pruning_module.pt

    • Generated by Stage 2 training
    • Model-specific (tied to model's hidden_size)
    • Required for inference
  3. Saliency baselinecheckpoints/saliency.pt

    • Generated by Stage 1 (optional)
    • Used for training pruning module in Stage 2
  4. Training datadata/raw/dolly15k/ or dolly15k/

    • Primary training data:
    • Optional auxiliary data: Textualized embodied samples from ALFRED, TEACh, BabyAI, BEHAVIOR-1K, ProcTHOR
      • Used at lower frequency to refine replanning-aware saliency patterns
      • Not required for E-RECAP to work, but helps improve replanning sensitivity
    • Can use any HuggingFace-compatible dataset
  5. Resultsresults/

    • All benchmark results and logs are saved here

Usage

Model Setup

Note: This repository does not ship model weights or training checkpoints. You should provide your own HuggingFace-compatible model directory under checkpoints/<model-name>/. E-RECAP supports many transformer backbones; pruning modules and saliency checkpoints are generated locally by Stage 1/2.

Place your model files:

  1. Download or copy your model to checkpoints/<model-name>/ directory

    • The model directory should contain config.json, model weights (.safetensors or .bin), and tokenizer files
    • Example: checkpoints/llama2-7b/, checkpoints/mistral-7b/, etc.
  2. Configure model path

    • You can keep the default MODEL_PATH in code, or override it from CLI:
      python3 -u src/inference_erecap.py --mode profile --model_path checkpoints/<your-model-name>
    • If you want to use a different pruning checkpoint without editing code:
      python3 -u src/inference_erecap.py --mode generate --model_path checkpoints/<your-model-name> --pruning_ckpt checkpoints/pruning_module.pt

Pre-flight Check

Verify that model and checkpoints exist (run from project root):

# Check model (replace <model-name> with your actual model directory)
ls -lh checkpoints/<model-name>/config.json

# Check checkpoints
ls -lh checkpoints/pruning_module.pt checkpoints/saliency.pt

If model or checkpoints are missing:

  • Model: Download/copy your model to checkpoints/<model-name>/
  • Pruning module: Run Stage 2 to train (see below)
  • Saliency: Run Stage 1 to generate (optional)

Quick Verification

Test that all components are ready:

python3 -c "
import sys
sys.path.insert(0, 'src')
from inference_erecap import load_model_and_pruners, MODEL_PATH, PRUNING_CKPT
import os
print('✓ MODEL_PATH:', MODEL_PATH)
print('✓ MODEL_PATH exists:', os.path.exists(MODEL_PATH))
print('✓ PRUNING_CKPT exists:', os.path.exists(PRUNING_CKPT))
print('✓ All checks passed!')
"

Note: If MODEL_PATH doesn't exist, edit src/inference_erecap.py and set MODEL_PATH to your model directory path.

Stage 1: Saliency Computation (Optional)

Only needed if checkpoints/saliency.pt doesn't exist:

bash scripts/run_stage1.sh 1000

Note: This stage uses Dolly-15k dataset for training and the model specified in src/stage1_saliency.py. Make sure to set the correct model path there if using a different model.

Stage 2: Pruning Module Training (Required)

Only needed if checkpoints/pruning_module.pt doesn't exist:

bash scripts/run_stage2.sh 1e-4 2

Parameters:

  • First argument: Learning rate (default: 1e-4)
  • Second argument: Number of epochs (default: 2)

Note: This stage trains a model-specific pruning module using Dolly-15k dataset. The trained pruning_module.pt is tied to the model's hidden_size. If you change models, you may need to retrain the pruning module if the new model has a different hidden_size.

Inference

Single GPU Inference:

  1. Prefill-only benchmark (fast, ~5-10 minutes):

    bash scripts/run_inference.sh profile prefill
  2. End-to-end benchmark (includes decode, ~15-30 minutes):

    bash scripts/run_inference.sh profile end2end
  3. Text generation test (quick verification):

    bash scripts/run_inference.sh generate "Hello, E-RECAP!"

Multi-GPU Inference (for long sequences, 32K+ tokens):

# Multi-GPU profiling
bash scripts/run_inference_multigpu.sh profile

# Multi-GPU text generation
bash scripts/run_inference_multigpu.sh generate "Your prompt here"

Cooperative Multi-Agent Planning (with E-RECAP context pruning):

# Run E-RECAP version (default: 15-step iterative replanning task)
bash scripts/run_cooperative_replanning.sh --keep_ratio 0.7 --save_results

# Run 10 times for longer evaluation (ensures >5 minutes runtime)
bash scripts/run_cooperative_replanning.sh --keep_ratio 0.7 --num_runs 10 --save_results

# Run baseline (no pruning) for comparison
bash scripts/run_cooperative_replanning.sh --baseline --save_results

# Run baseline 10 times
bash scripts/run_cooperative_replanning.sh --baseline --num_runs 10 --save_results

# Compare results (for 10 runs)
python3 src/multi_agent/compare_baseline_erecap.py \
    --baseline_file results/cooperative_planning_iterative_replanning_baseline_10runs.json \
    --erecap_file results/cooperative_planning_iterative_replanning_0.7_10runs.json

Run single configuration directly:

cd src
python3 -u inference_erecap.py \
  --mode profile \
  --config keep07 \
  --benchmark_mode prefill \
  --lengths 1024 2048 4096

Results Location

  • Single GPU results: results/latency_results_keep*.json
  • Multi-GPU results: results/latency_erecap_multigpu.json
  • Baseline results: results/latency_baseline_keep*.json

Available Scripts

The scripts/ directory contains helper scripts for common tasks:

Core Scripts

  • run_inference.sh: Single-GPU inference and benchmarking

    • bash scripts/run_inference.sh profile prefill - Prefill-only benchmark
    • bash scripts/run_inference.sh profile end2end - End-to-end benchmark
    • bash scripts/run_inference.sh generate "prompt" - Text generation
  • run_inference_multigpu.sh: Multi-GPU inference for long sequences

    • bash scripts/run_inference_multigpu.sh profile - Multi-GPU profiling
    • bash scripts/run_inference_multigpu.sh generate "prompt" - Multi-GPU generation
  • run_stage1.sh: Generate saliency baseline (optional)

    • bash scripts/run_stage1.sh [num_samples] - Default: 1000 samples
    • Advanced (no code edits): python -u src/stage1_saliency.py --model_path <LOCAL_MODEL_DIR> --data_path dolly15k --prune_layers 4 7 10 13 16 19 22 25
  • run_stage2.sh: Train pruning module (required if missing)

    • bash scripts/run_stage2.sh [learning_rate] [epochs] - Default: 1e-4, 2 epochs
    • Advanced (no overwrites): python3 -u src/stage2_pruning.py --model_path <LOCAL_MODEL_DIR> --saliency_path checkpoints/saliency.pt --output_path checkpoints/pruning_module.pt --prune_layers 4 7 10 13 16 19 22 25

Utility Scripts

  • install.sh: Install Python dependencies and PyTorch

    • bash scripts/install.sh
  • check_full_env.sh: Comprehensive environment check

    • bash scripts/check_full_env.sh - Verifies GPU, CUDA, Python, dependencies
  • run_plot_latency.sh: Generate latency comparison plots

    • bash scripts/run_plot_latency.sh [output_dir] - Default: results/fig
  • run_multigpu_test.sh: Test multi-GPU memory usage

    • bash scripts/run_multigpu_test.sh - Memory profiling for long sequences

Evaluation Scripts (Optional)

  • run_longbench.sh: Run LongBench evaluation

    • bash scripts/run_longbench.sh [task] [type] [num_samples] - Default: hotpotqa, baseline, 30
  • run_longbench_setup.sh: Setup LongBench evaluation

    • bash scripts/run_longbench_setup.sh [task] [model] [pruning_module] [output]
  • run_ablation.sh: Run ablation study

    • bash scripts/run_ablation.sh - Generates ablation results
  • run_cooperative_replanning.sh: Cooperative multi-agent planning with E-RECAP

    • bash scripts/run_cooperative_replanning.sh --keep_ratio 0.7 --save_results - Run with default task
    • bash scripts/run_cooperative_replanning.sh --task_type embodied - Run embodied replanning scenario

Key Features

  • Cost-Aware Pruning: Remove redundant tokens during prefill to reduce computation (up to 71% token reduction, 2-40× speedup depending on sequence length and GPU configuration)
  • Layer-wise Pruning: Progressive pruning across Transformer layers (8 pruning points: layers 4, 7, 10, 13, 16, 19, 22, 25)
  • Multi-GPU Support: Automatic distributed inference for long sequences (tested up to 32K tokens, achieving 20.7× average speedup on 8× RTX 5880)
  • Learnable Pruning Module: Lightweight MLP (hidden_size → hidden_size/4 → 1) trained on instruction-following data
  • Cooperative Multi-Agent Planning: Sequential multi-agent replanning with E-RECAP context pruning (K=2-8 agents, see Multi-Agent Planning)
  • Quality Preservation: Maintains task success rate while significantly reducing computation (typically <2% quality degradation at keep_ratio=0.7)

Multi-Agent Planning

E-RECAP supports cooperative multi-agent planning where multiple agents operate sequentially, each receiving a shared planning context pruned by E-RECAP's cost-aware token pruning module. This setting captures multi-agent replanning characteristics—context growth, information aggregation, and iterative plan revision—while maintaining strict control over experimental variables.

Note: This is a planning-level multi-agent setting (not multi-robot physical control). Multiple planning agents contribute information sequentially to a shared context, which is pruned by E-RECAP before each agent invocation. This design systematically amplifies context growth to evaluate E-RECAP's scalability (K=2-8 agents).

Quick example (uses default task, no input required):

# Run with default task description
bash scripts/run_cooperative_replanning.sh --keep_ratio 0.7 --save_results

# Or Python (default task included)
python3 src/multi_agent/run_cooperative_test.py --keep_ratio 0.7 --save_results

Key features:

  • Shared Context Buffer: Accumulates task descriptions, plans, constraints, and agent contributions
  • E-RECAP Pruning: Context pruned before each agent invocation to control growth
  • Structured Output: Agent outputs in structured format (observations, conflicts, plan patches)
  • Framework Compatible: Optional CrewAI/LangChain support (see below)

Optional Framework Support (CrewAI/LangChain):

To enable CrewAI or LangChain integration, install the optional dependencies:

pip install crewai>=0.28.8 langchain>=0.1.17 langchain-community>=0.1.17

Then copy the agent configuration file to the framework_optional directory (if you have one):

# Create framework_optional directory if it doesn't exist
mkdir -p src/multi_agent/framework_optional

# Copy your agents_config.json file to the framework_optional directory
# The file should follow the format expected by CrewAI/LangChain

Note: Framework support is optional. The default implementation works without CrewAI/LangChain. When enabled, frameworks are used only for scheduling/role-assignment, while prompt construction, context buffering, and pruning remain under E-RECAP control. See src/multi_agent/framework_wrapper.py for implementation details.

Baseline comparison:

# Run baseline (no pruning) for comparison
bash scripts/run_cooperative_replanning.sh --baseline --save_results

# Compare baseline vs E-RECAP results
python3 src/multi_agent/compare_baseline_erecap.py \
    --baseline_file results/cooperative_planning_cooperative_baseline.json \
    --erecap_file results/cooperative_planning_cooperative_0.7.json

For detailed implementation, see paper/part3_sum.md.

Embodied Evaluation (Planned)

E-RECAP is designed for embodied AI replanning scenarios. Planned evaluation includes:

  • Platform: Habitat-Lab (PointNav, ObjectNav tasks)
  • Scenes: Matterport3D (MP3D), Gibson, Replica
  • Setting: Cooperative multi-agent replanning (K=2-8 agents)
  • Metrics: Success Rate, SPL, token cost, latency, replanning frequency
  • Baselines: No-Pruning, Random-Pruning (token-count-matched), Heuristic-Pruning(recency) (token-count-matched)

The Habitat integration will evaluate E-RECAP's effectiveness in real embodied replanning scenarios where context naturally grows through plan-execute-observe-replan cycles. See paper/habitat_integration_design.md for detailed design.

Model Configuration

E-RECAP supports any HuggingFace-compatible Transformer model. To use a different model:

  1. Place model files in checkpoints/<your-model-name>/

  2. Point scripts to your model

    • Inference: pass --model_path (and optionally --pruning_ckpt) to src/inference_erecap.py
    • Stage 1: pass --model_path / --data_path / --prune_layers to src/stage1_saliency.py
    • Stage 2: pass --model_path / --data_path / --saliency_path / --output_path / --prune_layers to src/stage2_pruning.py
  3. Train pruning module (if switching to a model with different hidden_size):

    • The pruning module is model-specific and depends on the model's hidden_size
    • If your new model has the same hidden_size, you can reuse the existing pruning_module.pt
    • Otherwise, retrain by running Stage 2 with the new model

About

InfoComm 2026 Workshop

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published