GPU-accelerated LLaMA inference wrapper for legacy Vulkan-capable systems a Pythonic way to run AI with knowledge (Ilm) on fire (Vulkan).
-
Updated
Oct 14, 2025 - Python
GPU-accelerated LLaMA inference wrapper for legacy Vulkan-capable systems a Pythonic way to run AI with knowledge (Ilm) on fire (Vulkan).
GPU-aware inference mesh for large-scale AI serving
🚀 ClipServe: A fast API server for embedding text, images, and performing zero-shot classification using OpenAI’s CLIP model. Powered by FastAPI, Redis, and CUDA for lightning-fast, scalable AI applications. Transform texts and images into embeddings or classify images with custom labels—all through easy-to-use endpoints. 🌐📊
Docker based GPU inference of machine learning models
Generating images with diffusion models on a mobile device, with an intranet GPU box as backend
End-to-end scalable ML inference on EKS: KEDA-driven pod autoscaling with Prometheus custom metrics, Cluster Autoscaler for GPU node scaling, and NVIDIA GPU time-slicing to run multiple pods per GPU.
Instant setup scripts for cloud-based LLM development.
A tiny Python / ZMQ server to faciliate experimenting with DeepSeek-OCR
Secure, private LLM inference on RunPod GPUs with Tailscale networking. Zero public exposure.
A minimal, high-performance starter kit for running AI model inference on NVIDIA GPUs using CUDA. Includes environment setup, sample kernels, and guidance for integrating ONNX/TensorRT pipelines for fast, optimized inference on modern GPU hardware.
ModelSpec is an open, declarative specification for describing how AI models especially LLMs are deployed, served, and operated in production. It captures execution, serving, and orchestration intent to enable validation, reasoning, and automation across modern AI infrastructure.
Add a description, image, and links to the gpu-inference topic page so that developers can more easily learn about it.
To associate your repository with the gpu-inference topic, visit your repo's landing page and select "manage topics."