This project implements E-RECAP (Embodied REplanning with Cost-Aware Pruning), a system-level, drop-in method for accelerating replanning in embodied agents by cost-aware pruning of planner context. E-RECAP operates as a Planner optimization module that can be seamlessly integrated into embodied AI systems without modifying task definitions, environments, or control policies.
In embodied AI systems, agents frequently need to replan due to partial observability, dynamic environments, and execution uncertainties. When using LLM/VLM as high-level planners, each replanning cycle requires processing long contexts that accumulate over time, making replanning a major computational bottleneck—especially in multi-agent settings where context grows with the number of agents.
E-RECAP addresses this by:
- Learning task-agnostic token importance from large-scale instruction-following data (Dolly-15k, Alpaca, Self-Instruct)
- Cost-aware dynamic pruning of planner context during replanning, reducing computation while preserving decision quality
- System-level integration that works with any Transformer-based planner without modifying perception or control modules
E-RECAP is evaluated in both single-agent and cooperative multi-agent settings, with embodied evaluation planned on Habitat-Lab (PointNav/ObjectNav tasks).
E-RECAP/
├── checkpoints/ # Model weights and checkpoints
│ ├── pruning_module.pt # Stage 2 trained Token Pruner (required for inference)
│ ├── saliency.pt # Stage 1 saliency baseline (optional)
│ └── <model-name>/ # Your local model directory (e.g., llama2-7b, mistral-7b, etc.)
│ ├── config.json # Model configuration (required)
│ ├── model.safetensors # Model weights (or model.bin, required)
│ ├── tokenizer.json # Tokenizer configuration (required)
│ └── ... # Other model files (e.g., generation_config.json, etc.)
│
├── data/ # Datasets
│ └── raw/ # Raw data files (e.g., Dolly-15k)
│
├── results/ # Experimental results and reports
│ ├── fig/ # Visualization figures
│ └── part1_sum.md # Stage 1 summary report
│
├── scripts/ # Execution scripts
│ ├── run_stage1.sh # Stage 1: Saliency computation
│ ├── run_stage2.sh # Stage 2: Pruning module training
│ ├── run_inference.sh # Single GPU inference
│ ├── run_inference_multigpu.sh # Multi-GPU inference
│ ├── check_pruning_baselines.py # Sanity-check Random/Recency baselines
│ ├── tune_prune_layers.py # Tune pruning-layer placement (paper heuristic)
│ ├── check_full_env.sh # Environment check
│ └── install.sh # Dependency installation
│
└── src/ # Source code
├── stage1_saliency.py # Stage 1: Gradient × hidden states
├── stage2_pruning.py # Stage 2: Learnable Token Pruner
├── erecap_model.py # Core model with pruning logic
├── inference_erecap.py # Single GPU inference
├── inference_erecap_multigpu.py # Multi-GPU inference
├── multigpu_test.py # Multi-GPU memory profiling
└── multi_agent/ # Cooperative multi-agent planning
├── cooperative_planner.py # Main planner with E-RECAP integration
├── context_buffer.py # Shared planning context buffer
├── structured_output.py # Structured agent output parser
├── agent_config.py # Agent configuration definitions
├── task_definitions.py # Task step definitions
├── framework_wrapper.py # Optional CrewAI/LangChain support
└── framework_optional/ # Optional framework files (not in Git)
└── agents_config.json # Agent config for CrewAI (if used)
- Python 3.10+
- CUDA 12.1+
- ≥50GB disk space for model storage
Note: The hardware requirements depend on the model you choose to use. The following are our test configurations, but you can run E-RECAP on any hardware that meets the minimum requirements for your selected model.
Our test setup:
- 8× NVIDIA RTX 5880 Ada Generation (48GB VRAM each)
- Single GPU mode: Uses one GPU
- Multi-GPU mode: Uses all 8 GPUs
Recommended VRAM by model:
| Model | Params | Rec VRAM | Model | Params | Rec VRAM |
|---|---|---|---|---|---|
| LLaMA-2 | 7B / 13B | ~14 GB / ~26 GB | LLaMA-3 / 3.1 | 8B / 70B | ~16 GB / ~140 GB |
| LLaMA-3.2 | 1B / 3B / 11B / 90B | ~2 GB / ~6 GB / ~22 GB / ~180 GB | Mistral | 7B | 8 GB |
| Qwen2 | 7B / 14B / 32B / 72B | ~14 GB / ~28 GB / ~64 GB / ~144 GB | Qwen2.5 | 7B / 14B | 12 GB / 16 GB |
| Qwen3 | 0.6B / 1.7B / 4B / 8B / 14B / 32B | ~1.2 GB / ~3.4 GB / ~8 GB / ~16 GB / ~28 GB / ~64 GB | Yi | 6B / 13B | 8 GB / 16 GB |
| DeepSeek-LLM | 7B / 67B | ~14 GB / ~134 GB | Gemma-2 | 9B | ~18 GB |
| Phi-3 | 3.8B / 7B | ~7.6 GB / ~14 GB | ChatGLM3 | 6B | 8 GB |
| Baichuan2 | 7B / 13B | ~14 GB / ~26 GB | InternLM2 | 7B / 20B | 12 GB / 24 GB |
Prerequisites:
- Install CUDA 12.1+ (includes nvcc compiler) and NVIDIA GPU drivers
- Verify CUDA installation:
nvcc --versionandnvidia-smi
Install Python packages:
pip install -r requirements.txtNote: PyTorch will automatically use the installed CUDA version. For CUDA 12.x, install PyTorch with:
pip install torch --index-url https://download.pytorch.org/whl/cu121Required files and their locations:
-
Model files →
checkpoints/<model-name>/- Place your HuggingFace-compatible model here
- Must include:
config.json, model weights (.safetensorsor.bin), tokenizer files - Example structure:
checkpoints/ └── your-model-name/ ├── config.json ├── model.safetensors (or model-*.safetensors) ├── tokenizer.json └── ...
-
Pruning module →
checkpoints/pruning_module.pt- Generated by Stage 2 training
- Model-specific (tied to model's
hidden_size) - Required for inference
-
Saliency baseline →
checkpoints/saliency.pt- Generated by Stage 1 (optional)
- Used for training pruning module in Stage 2
-
Training data →
data/raw/dolly15k/ordolly15k/- Primary training data:
- Dolly-15k (HuggingFace), Alpaca (HuggingFace), processed Self-Instruct (HuggingFace)
- Used for Stage 1 (saliency computation) and Stage 2 (pruning module training)
- Learn task-agnostic token importance priors across diverse reasoning patterns
- Optional auxiliary data: Textualized embodied samples from ALFRED, TEACh, BabyAI, BEHAVIOR-1K, ProcTHOR
- Used at lower frequency to refine replanning-aware saliency patterns
- Not required for E-RECAP to work, but helps improve replanning sensitivity
- Can use any HuggingFace-compatible dataset
- Primary training data:
-
Results →
results/- All benchmark results and logs are saved here
Note: This repository does not ship model weights or training checkpoints. You should provide your own HuggingFace-compatible model directory under checkpoints/<model-name>/. E-RECAP supports many transformer backbones; pruning modules and saliency checkpoints are generated locally by Stage 1/2.
Place your model files:
-
Download or copy your model to
checkpoints/<model-name>/directory- The model directory should contain
config.json, model weights (.safetensorsor.bin), and tokenizer files - Example:
checkpoints/llama2-7b/,checkpoints/mistral-7b/, etc.
- The model directory should contain
-
Configure model path
- You can keep the default
MODEL_PATHin code, or override it from CLI:python3 -u src/inference_erecap.py --mode profile --model_path checkpoints/<your-model-name>
- If you want to use a different pruning checkpoint without editing code:
python3 -u src/inference_erecap.py --mode generate --model_path checkpoints/<your-model-name> --pruning_ckpt checkpoints/pruning_module.pt
- You can keep the default
Verify that model and checkpoints exist (run from project root):
# Check model (replace <model-name> with your actual model directory)
ls -lh checkpoints/<model-name>/config.json
# Check checkpoints
ls -lh checkpoints/pruning_module.pt checkpoints/saliency.ptIf model or checkpoints are missing:
- Model: Download/copy your model to
checkpoints/<model-name>/ - Pruning module: Run Stage 2 to train (see below)
- Saliency: Run Stage 1 to generate (optional)
Test that all components are ready:
python3 -c "
import sys
sys.path.insert(0, 'src')
from inference_erecap import load_model_and_pruners, MODEL_PATH, PRUNING_CKPT
import os
print('✓ MODEL_PATH:', MODEL_PATH)
print('✓ MODEL_PATH exists:', os.path.exists(MODEL_PATH))
print('✓ PRUNING_CKPT exists:', os.path.exists(PRUNING_CKPT))
print('✓ All checks passed!')
"Note: If MODEL_PATH doesn't exist, edit src/inference_erecap.py and set MODEL_PATH to your model directory path.
Only needed if checkpoints/saliency.pt doesn't exist:
bash scripts/run_stage1.sh 1000Note: This stage uses Dolly-15k dataset for training and the model specified in src/stage1_saliency.py. Make sure to set the correct model path there if using a different model.
Only needed if checkpoints/pruning_module.pt doesn't exist:
bash scripts/run_stage2.sh 1e-4 2Parameters:
- First argument: Learning rate (default: 1e-4)
- Second argument: Number of epochs (default: 2)
Note: This stage trains a model-specific pruning module using Dolly-15k dataset. The trained pruning_module.pt is tied to the model's hidden_size. If you change models, you may need to retrain the pruning module if the new model has a different hidden_size.
Single GPU Inference:
-
Prefill-only benchmark (fast, ~5-10 minutes):
bash scripts/run_inference.sh profile prefill
-
End-to-end benchmark (includes decode, ~15-30 minutes):
bash scripts/run_inference.sh profile end2end
-
Text generation test (quick verification):
bash scripts/run_inference.sh generate "Hello, E-RECAP!"
Multi-GPU Inference (for long sequences, 32K+ tokens):
# Multi-GPU profiling
bash scripts/run_inference_multigpu.sh profile
# Multi-GPU text generation
bash scripts/run_inference_multigpu.sh generate "Your prompt here"Cooperative Multi-Agent Planning (with E-RECAP context pruning):
# Run E-RECAP version (default: 15-step iterative replanning task)
bash scripts/run_cooperative_replanning.sh --keep_ratio 0.7 --save_results
# Run 10 times for longer evaluation (ensures >5 minutes runtime)
bash scripts/run_cooperative_replanning.sh --keep_ratio 0.7 --num_runs 10 --save_results
# Run baseline (no pruning) for comparison
bash scripts/run_cooperative_replanning.sh --baseline --save_results
# Run baseline 10 times
bash scripts/run_cooperative_replanning.sh --baseline --num_runs 10 --save_results
# Compare results (for 10 runs)
python3 src/multi_agent/compare_baseline_erecap.py \
--baseline_file results/cooperative_planning_iterative_replanning_baseline_10runs.json \
--erecap_file results/cooperative_planning_iterative_replanning_0.7_10runs.jsonRun single configuration directly:
cd src
python3 -u inference_erecap.py \
--mode profile \
--config keep07 \
--benchmark_mode prefill \
--lengths 1024 2048 4096- Single GPU results:
results/latency_results_keep*.json - Multi-GPU results:
results/latency_erecap_multigpu.json - Baseline results:
results/latency_baseline_keep*.json
The scripts/ directory contains helper scripts for common tasks:
-
run_inference.sh: Single-GPU inference and benchmarkingbash scripts/run_inference.sh profile prefill- Prefill-only benchmarkbash scripts/run_inference.sh profile end2end- End-to-end benchmarkbash scripts/run_inference.sh generate "prompt"- Text generation
-
run_inference_multigpu.sh: Multi-GPU inference for long sequencesbash scripts/run_inference_multigpu.sh profile- Multi-GPU profilingbash scripts/run_inference_multigpu.sh generate "prompt"- Multi-GPU generation
-
run_stage1.sh: Generate saliency baseline (optional)bash scripts/run_stage1.sh [num_samples]- Default: 1000 samples- Advanced (no code edits):
python -u src/stage1_saliency.py --model_path <LOCAL_MODEL_DIR> --data_path dolly15k --prune_layers 4 7 10 13 16 19 22 25
-
run_stage2.sh: Train pruning module (required if missing)bash scripts/run_stage2.sh [learning_rate] [epochs]- Default: 1e-4, 2 epochs- Advanced (no overwrites):
python3 -u src/stage2_pruning.py --model_path <LOCAL_MODEL_DIR> --saliency_path checkpoints/saliency.pt --output_path checkpoints/pruning_module.pt --prune_layers 4 7 10 13 16 19 22 25
-
install.sh: Install Python dependencies and PyTorchbash scripts/install.sh
-
check_full_env.sh: Comprehensive environment checkbash scripts/check_full_env.sh- Verifies GPU, CUDA, Python, dependencies
-
run_plot_latency.sh: Generate latency comparison plotsbash scripts/run_plot_latency.sh [output_dir]- Default:results/fig
-
run_multigpu_test.sh: Test multi-GPU memory usagebash scripts/run_multigpu_test.sh- Memory profiling for long sequences
-
run_longbench.sh: Run LongBench evaluationbash scripts/run_longbench.sh [task] [type] [num_samples]- Default: hotpotqa, baseline, 30
-
run_longbench_setup.sh: Setup LongBench evaluationbash scripts/run_longbench_setup.sh [task] [model] [pruning_module] [output]
-
run_ablation.sh: Run ablation studybash scripts/run_ablation.sh- Generates ablation results
-
run_cooperative_replanning.sh: Cooperative multi-agent planning with E-RECAPbash scripts/run_cooperative_replanning.sh --keep_ratio 0.7 --save_results- Run with default taskbash scripts/run_cooperative_replanning.sh --task_type embodied- Run embodied replanning scenario
- Cost-Aware Pruning: Remove redundant tokens during prefill to reduce computation (up to 71% token reduction, 2-40× speedup depending on sequence length and GPU configuration)
- Layer-wise Pruning: Progressive pruning across Transformer layers (8 pruning points: layers 4, 7, 10, 13, 16, 19, 22, 25)
- Multi-GPU Support: Automatic distributed inference for long sequences (tested up to 32K tokens, achieving 20.7× average speedup on 8× RTX 5880)
- Learnable Pruning Module: Lightweight MLP (hidden_size → hidden_size/4 → 1) trained on instruction-following data
- Cooperative Multi-Agent Planning: Sequential multi-agent replanning with E-RECAP context pruning (K=2-8 agents, see Multi-Agent Planning)
- Quality Preservation: Maintains task success rate while significantly reducing computation (typically <2% quality degradation at keep_ratio=0.7)
E-RECAP supports cooperative multi-agent planning where multiple agents operate sequentially, each receiving a shared planning context pruned by E-RECAP's cost-aware token pruning module. This setting captures multi-agent replanning characteristics—context growth, information aggregation, and iterative plan revision—while maintaining strict control over experimental variables.
Note: This is a planning-level multi-agent setting (not multi-robot physical control). Multiple planning agents contribute information sequentially to a shared context, which is pruned by E-RECAP before each agent invocation. This design systematically amplifies context growth to evaluate E-RECAP's scalability (K=2-8 agents).
Quick example (uses default task, no input required):
# Run with default task description
bash scripts/run_cooperative_replanning.sh --keep_ratio 0.7 --save_results
# Or Python (default task included)
python3 src/multi_agent/run_cooperative_test.py --keep_ratio 0.7 --save_resultsKey features:
- Shared Context Buffer: Accumulates task descriptions, plans, constraints, and agent contributions
- E-RECAP Pruning: Context pruned before each agent invocation to control growth
- Structured Output: Agent outputs in structured format (observations, conflicts, plan patches)
- Framework Compatible: Optional CrewAI/LangChain support (see below)
Optional Framework Support (CrewAI/LangChain):
To enable CrewAI or LangChain integration, install the optional dependencies:
pip install crewai>=0.28.8 langchain>=0.1.17 langchain-community>=0.1.17Then copy the agent configuration file to the framework_optional directory (if you have one):
# Create framework_optional directory if it doesn't exist
mkdir -p src/multi_agent/framework_optional
# Copy your agents_config.json file to the framework_optional directory
# The file should follow the format expected by CrewAI/LangChainNote: Framework support is optional. The default implementation works without CrewAI/LangChain. When enabled, frameworks are used only for scheduling/role-assignment, while prompt construction, context buffering, and pruning remain under E-RECAP control. See src/multi_agent/framework_wrapper.py for implementation details.
Baseline comparison:
# Run baseline (no pruning) for comparison
bash scripts/run_cooperative_replanning.sh --baseline --save_results
# Compare baseline vs E-RECAP results
python3 src/multi_agent/compare_baseline_erecap.py \
--baseline_file results/cooperative_planning_cooperative_baseline.json \
--erecap_file results/cooperative_planning_cooperative_0.7.jsonFor detailed implementation, see paper/part3_sum.md.
E-RECAP is designed for embodied AI replanning scenarios. Planned evaluation includes:
- Platform: Habitat-Lab (PointNav, ObjectNav tasks)
- Scenes: Matterport3D (MP3D), Gibson, Replica
- Setting: Cooperative multi-agent replanning (K=2-8 agents)
- Metrics: Success Rate, SPL, token cost, latency, replanning frequency
- Baselines: No-Pruning, Random-Pruning (token-count-matched), Heuristic-Pruning(recency) (token-count-matched)
The Habitat integration will evaluate E-RECAP's effectiveness in real embodied replanning scenarios where context naturally grows through plan-execute-observe-replan cycles. See paper/habitat_integration_design.md for detailed design.
E-RECAP supports any HuggingFace-compatible Transformer model. To use a different model:
-
Place model files in
checkpoints/<your-model-name>/ -
Point scripts to your model
- Inference: pass
--model_path(and optionally--pruning_ckpt) tosrc/inference_erecap.py - Stage 1: pass
--model_path/--data_path/--prune_layerstosrc/stage1_saliency.py - Stage 2: pass
--model_path/--data_path/--saliency_path/--output_path/--prune_layerstosrc/stage2_pruning.py
- Inference: pass
-
Train pruning module (if switching to a model with different
hidden_size):- The pruning module is model-specific and depends on the model's
hidden_size - If your new model has the same
hidden_size, you can reuse the existingpruning_module.pt - Otherwise, retrain by running Stage 2 with the new model
- The pruning module is model-specific and depends on the model's