Echo Workload Tracer

Echo Workload Tracer is a tool designed for tracing and analyzing the workloads of different deep learning frameworks. Currently, it supports PyTorch, DeepSpeed and Megatron-LM frameworks.

Project Overview

The Echo Workload Tracer focuses on capturing runtime information and generating detailed workload graphs from deep learning training jobs, using only 1 GPU device. This data is essential for analyzing performance, optimizing resource utilization, and simulating distributed training at scale.

PyTorch Support: Comprehensive tracing for PyTorch models, including:
- Support for HuggingFace Transformers models
- Support for libs (e.g., transformers, torchvision) models and custom PyTorch models
- Support training parallel mode like DDP
- Capturing both forward and backward passes and extracting execution graphs and runtime data

Note：We are developing PyTorch-tracer to support PyTorch native training framework torchtitan. Tracers of Deepspeed and Megatron-LM are under active development. We would keep updating these.

Installation

Prerequisites

NVIDIA GPU with CUDA support (at least 1 GPU device)

Setup Instructions

Clone the repository

git clone https://github.com/fwyc0573/Echo-workload-tracer.git
cd Echo-workload-tracer

export PYTHONPATH=$PYTHONPATH:/path/to/Echo-workload-tracer

Setup Conda environment

conda env create -f environment.yaml
conda activate simulator_echo

Usage

PyTorch Tracing

Using the Example Script

# Basic usage (local model)
./pytorch_tracing_run.sh --model gpt2 --batch_size 32 --sequence_length 512 --num_gpus 1 --model_source local

# Advanced usage (huggingface model)
./pytorch_tracing_run_huggingface.sh --model deepseek-ai/deepseek-coder-1.3b-base --model_source huggingface --batch_size 2 --sequence_length 256 --num_repeats 5 --num_gpus 1

# Advanced usage (ddp mode)
./pytorch_tracing_run_ddp.sh --model gpt2 --batch_size 1 --bucket_cap_mb 10 --sequence_length 512 --num_gpus 2 --model_source local

Common Parameters

--framework: Framework to use for workload tracing (choices: 'PyTorch', 'DeepSpeed', 'Megatron-LM', default: 'PyTorch')
--base_path: Path to save the output (default: 'output/')
--num_gpus: Number of GPUs to use in training (default: 1)

PyTorch Parameters

--model: Model to benchmark (default: 'gpt2')
--model_source: Model source (choices: 'huggingface', 'local', default: 'local')
--batch_size: Batch size for training/inference (default: 16)
--sequence_length: Sequence length for input data (default: 512)
--num_repeats: Number of repetitions for averaging results (default: 1)
--pytorch_ops_profiling: Enable operations profiling for PyTorch workload
--pytorch_graph_profiling: Enable graph profiling for PyTorch workload
--pytorch_ddp: Enable PyTorch DistributedDataParallel (DDP) mode
--bucket_cap_mb: Communication bucket size used in DDP mode (default: 25)

Output

The tracer generates the following outputs:

output/
├── logs/
│   └── PyTorch/
│       ├── config_[model_name]_bs[batch_size]_seq[seq_length].json   # Tracing config files for each run
├── PyTorch/
│   ├── pytorch_graph_profiling/
│   │   └── [model_source]/
│   │       └── [model_name]/
│   │           ├── forward_graph_profiling_bs[batch_size]_seq[seq_length].json  # Forward graph profiling
│   │           ├── backward_graph_profiling_bs[batch_size]_seq[seq_length].json # Backward graph profiling
│   │           └── global_graph_profiling_bs[batch_size]_seq[seq_length].json   # Global graph profiling
│   └── pytorch_ops_profiling/
│       └── [model_source]/
│           └── [model_name]/
│               ├── forward_ops_profiling_bs[batch_size]_seq[seq_length].json    # Forward ops profiling
│               ├── backward_ops_profiling_bs[batch_size]_seq[seq_length].json   # Backward ops profiling
│               ├── global_ops_profiling_bs[batch_size]_seq[seq_length].json     # Global ops profiling
│               └── PyTorch tracer_[timestamp].log                               # Tracer log files

Development Guide

Adding New Model Support

To add support for additional model types:

Define the model loading function in the appropriate tracer file
Register the model source in tracer_arguments.py
Handle any model-specific operations or patterns

Citation

If you use this tool in your research, please cite our paper:

@article{echo2024,
  title={Echo: Simulating Distributed Training At Scale},
  author={Yicheng Feng, Yuetao Chen, Kaiwen Chen, Jingzong Li, Tianyuan Wu, Peng Cheng, Chuan Wu, Wei Wang, Tsung-Yi Ho, Hong Xu},
  journal={arXiv preprint arXiv:2412.12487},
  year={2024}
}

Contact

Please email Yicheng Feng for questions or issues related to this project.

License

This project is licensed under the MIT license - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
examples		examples
legacy		legacy
output		output
tracer_core		tracer_core
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Echo Workload Tracer

Project Overview

Installation

Prerequisites

Setup Instructions

Usage

PyTorch Tracing

Using the Example Script

Common Parameters

PyTorch Parameters

Output

Development Guide

Adding New Model Support

Citation

Contact

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Echo Workload Tracer

Project Overview

Installation

Prerequisites

Setup Instructions

Usage

PyTorch Tracing

Using the Example Script

Common Parameters

PyTorch Parameters

Output

Development Guide

Adding New Model Support

Citation

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages