Llama2 API

A Llama2 streaming output API with OpenAI style, support for multi-gpu inference with model of 13B or larger.

Setup

Install llama from official repository.
Download llama2 weights from this repository, it's recommended to use pth format.
Clone this repo:

git clone --depth=1 https://github.com/firslov/llama2-api.git

pip install -r requirements.txt

Set arguments in run_api.sh, then

./run_api.sh

8/9/2023 The torch.distributed module imposes a maximum timeout of 30 minutes. Since I couldn't find a suitable solution using torch.distributed, I had to resort to a less elegant approach of sending periodic POST requests to reset the timeout.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
api.py		api.py
readme.md		readme.md
requirements.txt		requirements.txt
run_api.sh		run_api.sh
streamllama.py		streamllama.py
wakeup.py		wakeup.py