GPU Optimized LLM Inference Platform

High-performance LLM inference platform using vLLM, NVIDIA Triton Inference Server, FastAPI, and Docker for scalable GPU-accelerated AI serving.

Features

	Feature
⚡	High-throughput LLM serving using vLLM
🧠	OpenAI-compatible chat completion API
🚀	GPU-optimized inference pipeline
📦	NVIDIA Triton integration for tensor/model serving
🔥	FastAPI backend with REST endpoints
📊	Concurrent benchmarking support
🐳	Dockerized deployment
⚙️	Environment-based configuration
📈	Latency monitoring and performance testing

System Architecture

                    ┌────────────────────┐
                    │   Client / User    │
                    └─────────┬──────────┘
                              │
                              ▼
                    ┌────────────────────┐
                    │    FastAPI API     │
                    │   Gateway Layer    │
                    └───────┬────────────┘
                            │
          ┌─────────────────┴─────────────────┐
          │                                   │
          ▼                                   ▼
 ┌──────────────────┐              ┌────────────────────┐
 │      vLLM        │              │  NVIDIA Triton     │
 │  LLM Inference   │              │ Inference Server   │
 │ Continuous Batch │              │  Tensor Serving    │
 │    KV Cache      │              │ Dynamic Batching   │
 └──────────────────┘              └────────────────────┘

Tech Stack

Component	Technology
Backend API	FastAPI
LLM Serving	vLLM
GPU Model Serving	NVIDIA Triton
Containerization	Docker
Benchmarking	Python Requests + ThreadPool
Tensor Handling	NumPy
API Client	OpenAI SDK
Deployment	Docker Compose

Why vLLM?

vLLM is optimized specifically for LLM text generation.

Key Optimizations:

Continuous batching
PagedAttention
Efficient KV-cache management
High GPU utilization
Low latency inference

Best For: Chatbots · AI agents · Copilots · LLM APIs · Multi-user inference systems

Why Triton?

NVIDIA Triton is used for general GPU model serving.

Triton Handles:

Embedding models
TensorRT pipelines
Rerankers
Vision models
Speech models
Multi-model orchestration

Current Implementation: This project includes a sample identity model to demonstrate Triton model repository structure, tensor routing, GPU inference requests, and FastAPI ↔ Triton integration.

Project Structure

GPU_Inferencevllm/
│
├── app/
│   ├── main.py
│   ├── vllm_client.py
│   └── triton_client.py
│
├── triton/
│   └── model_repository/
│       └── identity_model/
│           ├── 1/
│           │   └── model.py
│           └── config.pbtxt
│
├── benchmark.py
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── .env.example
└── README.md

Setup

1. Clone Repository

git clone https://github.com/yourusername/GPU_Inferencevllm.git
cd GPU_Inferencevllm

2. Create Virtual Environment

python -m venv .venv
source .venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment Variables

Create a .env file:

VLLM_BASE_URL=http://host.docker.internal:8000/v1
VLLM_API_KEY=dummy
VLLM_MODEL=meta-llama/Llama-3-8B-Instruct
TRITON_URL=localhost:8001

5. Start Docker Services

docker compose up --build

This starts the FastAPI backend and Triton inference server.

6. Start vLLM Server

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --port 8000

API Endpoints

`GET /` — Root

{
  "message": "GPU Optimized LLM Inference Platform",
  "routes": ["/generate", "/triton/health", "/triton/identity"]
}

`POST /generate` — Generate Text

Request:

{
  "prompt": "Explain GPU optimization",
  "max_tokens": 100,
  "temperature": 0.2
}

Response:

{
  "prompt": "Explain GPU optimization",
  "response": "GPU optimization improves...",
  "latency_seconds": 1.42
}

`GET /triton/health` — Triton Health Check

{
  "server_live": true,
  "server_ready": true
}

`POST /triton/identity` — Triton Identity Model

Request:

{
  "values": [1.0, 2.0, 3.0]
}

Response:

{
  "input": [1.0, 2.0, 3.0],
  "output": [1.0, 2.0, 3.0],
  "latency_seconds": 0.01
}

Request Flow

LLM Generation:

Client → POST /generate → FastAPI → vLLM Client → vLLM Server → LLM Response

Triton Inference:

Client → POST /triton/identity → FastAPI → Triton Client → Triton Server → Tensor Output

Benchmarking

Run concurrent benchmark tests:

python benchmark.py

Metrics collected:

Total requests
Requests/sec
Average latency
Min / Max latency

Roadmap

Enterprise Use Cases

AI copilots
RAG systems
Document intelligence
AI chat platforms
GPU inference gateways
Enterprise LLM serving
Multi-model AI infrastructure

Key Learning Outcomes

This project demonstrates:

GPU inference optimization
Production-grade AI serving
vLLM architecture
Triton inference workflows
FastAPI orchestration
Docker deployment
Performance benchmarking
Scalable AI system design

Author

Rushi Iname

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU Optimized LLM Inference Platform

Features

System Architecture

Tech Stack

Why vLLM?

Why Triton?

Project Structure

Setup

1. Clone Repository

2. Create Virtual Environment

3. Install Dependencies

4. Configure Environment Variables

5. Start Docker Services

6. Start vLLM Server

API Endpoints

`GET /` — Root

`POST /generate` — Generate Text

`GET /triton/health` — Triton Health Check

`POST /triton/identity` — Triton Identity Model

Request Flow

Benchmarking

Roadmap

Enterprise Use Cases

Key Learning Outcomes

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
app		app
triton/model_repository/identity_model		triton/model_repository/identity_model
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

GPU Optimized LLM Inference Platform

Features

System Architecture

Tech Stack

Why vLLM?

Why Triton?

Project Structure

Setup

1. Clone Repository

2. Create Virtual Environment

3. Install Dependencies

4. Configure Environment Variables

5. Start Docker Services

6. Start vLLM Server

API Endpoints

GET / — Root

POST /generate — Generate Text

GET /triton/health — Triton Health Check

POST /triton/identity — Triton Identity Model

Request Flow

Benchmarking

Roadmap

Enterprise Use Cases

Key Learning Outcomes

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /` — Root

`POST /generate` — Generate Text

`GET /triton/health` — Triton Health Check

`POST /triton/identity` — Triton Identity Model

Packages