High-performance LLM inference platform using vLLM, NVIDIA Triton Inference Server, FastAPI, and Docker for scalable GPU-accelerated AI serving.
| Feature | |
|---|---|
| β‘ | High-throughput LLM serving using vLLM |
| π§ | OpenAI-compatible chat completion API |
| π | GPU-optimized inference pipeline |
| π¦ | NVIDIA Triton integration for tensor/model serving |
| π₯ | FastAPI backend with REST endpoints |
| π | Concurrent benchmarking support |
| π³ | Dockerized deployment |
| βοΈ | Environment-based configuration |
| π | Latency monitoring and performance testing |
ββββββββββββββββββββββ
β Client / User β
βββββββββββ¬βββββββββββ
β
βΌ
ββββββββββββββββββββββ
β FastAPI API β
β Gateway Layer β
βββββββββ¬βββββββββββββ
β
βββββββββββββββββββ΄ββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββββ
β vLLM β β NVIDIA Triton β
β LLM Inference β β Inference Server β
β Continuous Batch β β Tensor Serving β
β KV Cache β β Dynamic Batching β
ββββββββββββββββββββ ββββββββββββββββββββββ
| Component | Technology |
|---|---|
| Backend API | FastAPI |
| LLM Serving | vLLM |
| GPU Model Serving | NVIDIA Triton |
| Containerization | Docker |
| Benchmarking | Python Requests + ThreadPool |
| Tensor Handling | NumPy |
| API Client | OpenAI SDK |
| Deployment | Docker Compose |
vLLM is optimized specifically for LLM text generation.
Key Optimizations:
- Continuous batching
- PagedAttention
- Efficient KV-cache management
- High GPU utilization
- Low latency inference
Best For: Chatbots Β· AI agents Β· Copilots Β· LLM APIs Β· Multi-user inference systems
NVIDIA Triton is used for general GPU model serving.
Triton Handles:
- Embedding models
- TensorRT pipelines
- Rerankers
- Vision models
- Speech models
- Multi-model orchestration
Current Implementation: This project includes a sample identity model to demonstrate Triton model repository structure, tensor routing, GPU inference requests, and FastAPI β Triton integration.
GPU_Inferencevllm/
β
βββ app/
β βββ main.py
β βββ vllm_client.py
β βββ triton_client.py
β
βββ triton/
β βββ model_repository/
β βββ identity_model/
β βββ 1/
β β βββ model.py
β βββ config.pbtxt
β
βββ benchmark.py
βββ Dockerfile
βββ docker-compose.yml
βββ requirements.txt
βββ .env.example
βββ README.md
git clone https://github.com/yourusername/GPU_Inferencevllm.git
cd GPU_Inferencevllmpython -m venv .venv
source .venv/bin/activatepip install -r requirements.txtCreate a .env file:
VLLM_BASE_URL=http://host.docker.internal:8000/v1
VLLM_API_KEY=dummy
VLLM_MODEL=meta-llama/Llama-3-8B-Instruct
TRITON_URL=localhost:8001docker compose up --buildThis starts the FastAPI backend and Triton inference server.
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--port 8000{
"message": "GPU Optimized LLM Inference Platform",
"routes": ["/generate", "/triton/health", "/triton/identity"]
}Request:
{
"prompt": "Explain GPU optimization",
"max_tokens": 100,
"temperature": 0.2
}Response:
{
"prompt": "Explain GPU optimization",
"response": "GPU optimization improves...",
"latency_seconds": 1.42
}{
"server_live": true,
"server_ready": true
}Request:
{
"values": [1.0, 2.0, 3.0]
}Response:
{
"input": [1.0, 2.0, 3.0],
"output": [1.0, 2.0, 3.0],
"latency_seconds": 0.01
}LLM Generation:
Client β POST /generate β FastAPI β vLLM Client β vLLM Server β LLM Response
Triton Inference:
Client β POST /triton/identity β FastAPI β Triton Client β Triton Server β Tensor Output
Run concurrent benchmark tests:
python benchmark.pyMetrics collected:
- Total requests
- Requests/sec
- Average latency
- Min / Max latency
- Add embedding models on Triton
- Add reranker pipelines
- TensorRT optimization
- Streaming responses
- Kubernetes deployment
- Autoscaling GPU workers
- Prometheus + Grafana monitoring
- RAG pipeline integration
- Redis KV-cache layer
- Multi-model Triton ensembles
- AI copilots
- RAG systems
- Document intelligence
- AI chat platforms
- GPU inference gateways
- Enterprise LLM serving
- Multi-model AI infrastructure
This project demonstrates:
- GPU inference optimization
- Production-grade AI serving
- vLLM architecture
- Triton inference workflows
- FastAPI orchestration
- Docker deployment
- Performance benchmarking
- Scalable AI system design
Rushi Iname