A Llama2 streaming output API with OpenAI style, support for multi-gpu inference with model of 13B or larger.
-
Install
llamafrom official repository. -
Download
llama2 weightsfrom this repository, it's recommended to usepthformat. -
Clone this repo:
git clone --depth=1 https://github.com/firslov/llama2-api.git- Install requirements:
pip install -r requirements.txtSet arguments in run_api.sh, then
./run_api.sh- 8/9/2023 The
torch.distributedmodule imposes a maximum timeout of 30 minutes. Since I couldn't find a suitable solution usingtorch.distributed, I had to resort to a less elegant approach of sending periodic POST requests to reset the timeout.