Skip to content

Rushikeshiname/GPU-Optimized-LLM-Inference-Platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GPU Optimized LLM Inference Platform

High-performance LLM inference platform using vLLM, NVIDIA Triton Inference Server, FastAPI, and Docker for scalable GPU-accelerated AI serving.


Features

Feature
⚑ High-throughput LLM serving using vLLM
🧠 OpenAI-compatible chat completion API
πŸš€ GPU-optimized inference pipeline
πŸ“¦ NVIDIA Triton integration for tensor/model serving
πŸ”₯ FastAPI backend with REST endpoints
πŸ“Š Concurrent benchmarking support
🐳 Dockerized deployment
βš™οΈ Environment-based configuration
πŸ“ˆ Latency monitoring and performance testing

System Architecture

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Client / User    β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚    FastAPI API     β”‚
                    β”‚   Gateway Layer    β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚                                   β”‚
          β–Ό                                   β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚      vLLM        β”‚              β”‚  NVIDIA Triton     β”‚
 β”‚  LLM Inference   β”‚              β”‚ Inference Server   β”‚
 β”‚ Continuous Batch β”‚              β”‚  Tensor Serving    β”‚
 β”‚    KV Cache      β”‚              β”‚ Dynamic Batching   β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Tech Stack

Component Technology
Backend API FastAPI
LLM Serving vLLM
GPU Model Serving NVIDIA Triton
Containerization Docker
Benchmarking Python Requests + ThreadPool
Tensor Handling NumPy
API Client OpenAI SDK
Deployment Docker Compose

Why vLLM?

vLLM is optimized specifically for LLM text generation.

Key Optimizations:

  • Continuous batching
  • PagedAttention
  • Efficient KV-cache management
  • High GPU utilization
  • Low latency inference

Best For: Chatbots Β· AI agents Β· Copilots Β· LLM APIs Β· Multi-user inference systems


Why Triton?

NVIDIA Triton is used for general GPU model serving.

Triton Handles:

  • Embedding models
  • TensorRT pipelines
  • Rerankers
  • Vision models
  • Speech models
  • Multi-model orchestration

Current Implementation: This project includes a sample identity model to demonstrate Triton model repository structure, tensor routing, GPU inference requests, and FastAPI ↔ Triton integration.


Project Structure

GPU_Inferencevllm/
β”‚
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ main.py
β”‚   β”œβ”€β”€ vllm_client.py
β”‚   └── triton_client.py
β”‚
β”œβ”€β”€ triton/
β”‚   └── model_repository/
β”‚       └── identity_model/
β”‚           β”œβ”€β”€ 1/
β”‚           β”‚   └── model.py
β”‚           └── config.pbtxt
β”‚
β”œβ”€β”€ benchmark.py
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .env.example
└── README.md

Setup

1. Clone Repository

git clone https://github.com/yourusername/GPU_Inferencevllm.git
cd GPU_Inferencevllm

2. Create Virtual Environment

python -m venv .venv
source .venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment Variables

Create a .env file:

VLLM_BASE_URL=http://host.docker.internal:8000/v1
VLLM_API_KEY=dummy
VLLM_MODEL=meta-llama/Llama-3-8B-Instruct
TRITON_URL=localhost:8001

5. Start Docker Services

docker compose up --build

This starts the FastAPI backend and Triton inference server.

6. Start vLLM Server

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --port 8000

API Endpoints

GET / β€” Root

{
  "message": "GPU Optimized LLM Inference Platform",
  "routes": ["/generate", "/triton/health", "/triton/identity"]
}

POST /generate β€” Generate Text

Request:

{
  "prompt": "Explain GPU optimization",
  "max_tokens": 100,
  "temperature": 0.2
}

Response:

{
  "prompt": "Explain GPU optimization",
  "response": "GPU optimization improves...",
  "latency_seconds": 1.42
}

GET /triton/health β€” Triton Health Check

{
  "server_live": true,
  "server_ready": true
}

POST /triton/identity β€” Triton Identity Model

Request:

{
  "values": [1.0, 2.0, 3.0]
}

Response:

{
  "input": [1.0, 2.0, 3.0],
  "output": [1.0, 2.0, 3.0],
  "latency_seconds": 0.01
}

Request Flow

LLM Generation:

Client β†’ POST /generate β†’ FastAPI β†’ vLLM Client β†’ vLLM Server β†’ LLM Response

Triton Inference:

Client β†’ POST /triton/identity β†’ FastAPI β†’ Triton Client β†’ Triton Server β†’ Tensor Output

Benchmarking

Run concurrent benchmark tests:

python benchmark.py

Metrics collected:

  • Total requests
  • Requests/sec
  • Average latency
  • Min / Max latency

Roadmap

  • Add embedding models on Triton
  • Add reranker pipelines
  • TensorRT optimization
  • Streaming responses
  • Kubernetes deployment
  • Autoscaling GPU workers
  • Prometheus + Grafana monitoring
  • RAG pipeline integration
  • Redis KV-cache layer
  • Multi-model Triton ensembles

Enterprise Use Cases

  • AI copilots
  • RAG systems
  • Document intelligence
  • AI chat platforms
  • GPU inference gateways
  • Enterprise LLM serving
  • Multi-model AI infrastructure

Key Learning Outcomes

This project demonstrates:

  • GPU inference optimization
  • Production-grade AI serving
  • vLLM architecture
  • Triton inference workflows
  • FastAPI orchestration
  • Docker deployment
  • Performance benchmarking
  • Scalable AI system design

Author

Rushi Iname

About

High-performance LLM inference platform using vLLM, NVIDIA Triton Inference Server, FastAPI, and Docker for scalable GPU-accelerated AI serving

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors