A FastAPI-based server implementation for DeepSeek R1 language models, providing OpenAI-compatible API endpoints for chat completions and text generation. This server supports multiple models with optimized GPU memory management and automatic model loading.
- OpenAI-compatible API endpoints (
/v1/chat/completions) - Multi-GPU support with optimized memory management
- Automatic model loading and intelligent unloading
- 8-bit quantization for efficient memory usage
- Flash Attention 2 support
- Real-time monitoring and health checks
- NUMA-aware optimizations
- Streaming and non-streaming responses
- Optimized for NVIDIA A2 GPUs
- Automatic device mapping for multi-GPU setups
- Comprehensive monitoring and diagnostics
- Production-ready error handling
- Thermal management and overload protection
- Intelligent model persistence and unloading
- Extensive testing suite with stress testing capabilities
The server implements an intelligent model persistence strategy:
-
Core Models (Always Loaded)
deepseek-r1-distil-32b: Primary large modeldeepseek-r1-distil-14b: Balanced performance modeldeepseek-r1-distil-8b: Fast routing model
-
Specialized Models (Dynamic Loading)
mixtral-8x7b: Unloads after 1 hour of inactivityidefics-80b: Unloads after 30 minutes of inactivityfuyu-8b: Unloads immediately after use
Advanced thermal protection features:
thermal_management:
max_temperature: 87°C # Maximum safe temperature
thermal_throttle: 82°C # Start throttling
critical_temp: 85°C # Pause new requests
throttle_steps:
- {temp: 82°C, max_concurrent: 3}
- {temp: 84°C, max_concurrent: 2}
- {temp: 85°C, max_concurrent: 1}# Run comprehensive stress test
./tests/stress_test.shFeatures:
- Sequential and parallel model loading
- Throughput measurement (tokens/sec)
- Rate limiting validation
- Memory usage patterns
- Temperature monitoring
- VRAM utilization tracking
# Run API endpoint tests
./tests/curl_tests.shTests:
- Model loading/unloading
- Chat completions
- Streaming responses
- Error handling
- Rate limiting
# Run load test with custom parameters
python tests/load_test.py --concurrent 3 --requests 5 --delay 0.1Capabilities:
- Concurrent request handling
- Response time measurement
- Success rate tracking
- Resource utilization monitoring
- Prerequisites
- Installation
- Configuration
- Running Models
- API Usage
- Monitoring
- Performance Optimization
- Troubleshooting
- Contributing
- License
Required software:
- Python 3.8+
- PyTorch 2.2.0+
- CUDA-capable GPUs (4x NVIDIA A2 16GB recommended)
- 64GB+ System RAM
Optional but recommended:
- NVIDIA Container Toolkit (for Docker deployment)
- NUMA-enabled system
- PCIe Gen4 support
-
Clone the repository
-
Install dependencies:
pip install -r requirements.txt- Verify GPU setup:
python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}, Device count: {torch.cuda.device_count()}')"- Run system optimization:
sudo ./scripts/gpu-optimize.shThe server can be run in a Docker container with full GPU and NUMA support.
- Docker 20.10+
- NVIDIA Container Toolkit
- Docker Compose v2.x
- Dockerfile:
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive \
PYTHONUNBUFFERED=1 \
LANG=C.UTF-8 \
NVIDIA_VISIBLE_DEVICES=all \
NVIDIA_DRIVER_CAPABILITIES=compute,utility
# Install system dependencies
RUN apt-get update && apt-get install -y \
python3 \
python3-pip \
python3-dev \
numactl \
nvidia-utils-535 \
curl \
jq \
bc \
&& rm -rf /var/lib/apt/lists/*
# Set Python aliases
RUN ln -sf /usr/bin/python3 /usr/bin/python && \
ln -sf /usr/bin/pip3 /usr/bin/pip
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
RUN chmod +x scripts/*.sh scripts/models/*.sh tests/*.sh
CMD ["./scripts/start-server.sh"]- docker-compose.yml:
version: '3.8'
services:
llm-server:
build: .
runtime: nvidia
ports:
- "8000:8000"
volumes:
- /tmp/models:/app/models # Persistent storage for downloaded models
environment:
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
- CUDA_VISIBLE_DEVICES=0,1,2,3
- CUDA_DEVICE_MAX_CONNECTIONS=1
- CUDA_DEVICE_ORDER=PCI_BUS_ID
- PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128,expandable_segments:True
- TORCH_DISTRIBUTED_DEBUG=INFO
- TORCH_SHOW_CPP_STACKTRACES=1
- NUMA_GPU_NODE_PREFERRED=1
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ulimits:
memlock: -1
stack: 67108864
privileged: true # Needed for NUMA control
network_mode: host # Better performance for GPU communication- Build the container:
docker-compose build- Start the server:
docker-compose up -d- View logs:
docker-compose logs -f- Run tests:
docker-compose exec llm-server ./tests/stress_test.sh- NVIDIA GPU support with proper driver mapping
- NUMA-aware CPU and memory allocation
- Persistent model storage
- Optimized network performance with host networking
- Proper resource limits and GPU capabilities
- Python 3 environment with all dependencies
.
├── scripts/
│ ├── gpu-optimize.sh # GPU optimization script
│ ├── start-server.sh # Server startup script with NUMA optimizations
│ ├── monitor.sh # Standalone monitoring script
│ └── models/
│ ├── model-template.sh # Base template for model scripts
│ └── run-model.sh # Generic model runner script
├── lib/
│ └── models/ # Model library and configurations
│ ├── __init__.py
│ ├── base_config.py
│ ├── model_configs.py
│ └── model_manager.py
├── config.yml # Main configuration file
├── server.py # Main server implementation
└── requirements.txt # Python dependencies
The config.yml file controls all aspects of the server and models:
# Server Configuration
server:
host: "0.0.0.0"
port: 8000
workers: 1
auto_load_enabled: true # Enable automatic model loading
# GPU Configuration
gpu:
devices: [0, 1, 2, 3] # Using all 4 GPUs
memory_per_gpu: "15GB" # Leave 1GB headroom
optimization:
torch_compile: true
mixed_precision: "bf16"
attn_implementation: "flash_attention_2"
# Model Configurations
models:
deepseek-r1-distil-32b:
name: "DeepSeek-R1-Distill-32B"
auto_load: false # Don't load on startup
device_map: {...} # Distributed across GPUs
deepseek-r1-distil-14b:
name: "DeepSeek-R1-Distill-14B"
auto_load: true # Load on startup
device: "auto" # Automatic device mapping
deepseek-r1-distil-8b:
name: "DeepSeek-R1-Distill-8B"
auto_load: true # Load on startup
device: 0 # Single GPUModels can be configured to load automatically when the server starts:
- Set
auto_load_enabled: truein server config - Configure
auto_load: truefor specific models - Models will load during server initialization
The server respects the following environment variables:
# GPU Configuration
CUDA_VISIBLE_DEVICES="0,1,2,3"
CUDA_DEVICE_MAX_CONNECTIONS="1"
CUDA_DEVICE_ORDER="PCI_BUS_ID"
# PyTorch Optimization
PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:128,expandable_segments:True"
TORCH_DISTRIBUTED_DEBUG="INFO"
TORCH_SHOW_CPP_STACKTRACES="1"
# NUMA Configuration
NUMA_GPU_NODE_PREFERRED="1"The start-server.sh script provides optimized startup:
# Start server with NUMA optimizations
./scripts/start-server.shThe run-model.sh script provides flexible model management:
# Basic usage
./scripts/models/run-model.sh --model MODEL_NAME
# Examples:
# Run 8B model with custom port
./scripts/models/run-model.sh --model deepseek-r1-distil-8b --port 8001
# Run 14B model with specific thread count
./scripts/models/run-model.sh --model deepseek-r1-distil-14b --threads 32Run the GPU optimization script before starting the server:
# Optimize GPU settings
sudo ./scripts/gpu-optimize.shThis script:
- Sets optimal GPU clock speeds
- Configures power limits
- Optimizes PCIe settings
- Sets up NUMA affinities
Recommended GPU configurations:
-
DeepSeek-R1-Distill-32B
- Distributed across 4 GPUs
- ~15GB per GPU
- 4K context length
- Best for high-accuracy tasks
- Recommended for: Complex reasoning, code generation
-
DeepSeek-R1-Distill-14B
- Auto device mapping
- ~30GB total VRAM
- 8K context length
- Good balance of performance/memory
- Recommended for: General use, balanced performance
-
DeepSeek-R1-Distill-8B
- Single GPU
- ~15GB VRAM
- 16K context length
- Fastest inference
- Recommended for: High-throughput, real-time applications
curl http://localhost:8000/v1/modelscurl -X POST http://localhost:8000/v1/models/deepseek-r1-distil-8b/loadcurl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distil-8b",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7,
"stream": false
}'curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distil-8b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'The chat completion endpoint accepts the following parameters:
{
"model": "string", // Model ID to use
"messages": [ // Array of messages
{
"role": "string", // "system", "user", or "assistant"
"content": "string" // Message content
}
],
"temperature": 0.7, // 0.0 to 2.0 (default: 0.7)
"top_p": 0.95, // 0.0 to 1.0 (default: 0.95)
"max_tokens": 2048, // Maximum tokens to generate
"stream": false, // Enable streaming responses
"stop": ["string"] // Optional array of stop sequences
}The /health endpoint provides comprehensive monitoring:
curl http://localhost:8000/healthReturns:
- GPU memory usage per device
- System memory status
- Loaded models
- Device mapping
- Process statistics
- NUMA topology
Use the dedicated monitoring script:
./scripts/monitor.shFeatures:
- Real-time GPU metrics
- Memory usage tracking
- NUMA status
- PCIe bandwidth
- Process information
- Temperature monitoring
- Power consumption
-
Memory Allocation
- Use 8-bit quantization
- Enable gradient checkpointing
- Implement proper cleanup
-
Device Mapping
- Balance model layers across GPUs
- Consider NUMA topology
- Optimize for PCIe bandwidth
-
Inference Settings
- Use Flash Attention 2
- Enable torch.compile
- Implement proper batching
-
CPU Affinity
- Bind processes to NUMA nodes
- Align with GPU placement
- Optimize thread count
-
Memory Access
- Local memory allocation
- Minimize cross-node traffic
- Monitor bandwidth
-
GPU Memory Issues
- Check GPU memory allocation in config
- Use
nvidia-smito monitor usage - Consider using a smaller model
- Check device mapping configuration
-
Model Loading Failures
- Verify auto-load settings
- Check GPU memory availability
- Review server logs for errors
- Ensure correct device mapping
-
Performance Issues
- Run GPU optimization script
- Check NUMA configuration
- Monitor PCIe bandwidth
- Review thread settings
-
Memory Management
- Leave 1GB headroom per GPU
- Use appropriate device mapping
- Monitor memory usage
- Clean up unused models
-
GPU Optimization
- Run optimization script before starting
- Use NUMA-aware configurations
- Monitor GPU temperatures
- Check PCIe bandwidth
-
Model Selection
- Use 32B for high accuracy (4K context)
- Use 14B for balanced performance (8K context)
- Use 8B for speed (16K context)
- Consider auto-loading frequently used models
Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.
- Fork the repository
- Create a virtual environment
- Install development dependencies:
pip install -r requirements-dev.txt- Run tests:
pytest tests/MIT
- DeepSeek AI for the DeepSeek R1 models
- Hugging Face Transformers library
- FastAPI framework
- NVIDIA for GPU optimization guidance