Splits LSTM inference into two gRPC microservices—“Prefill” (embedding) and “Decode”—with real-time Kafka → Flink → Grafana telemetry. This demo shows how to parallelize and scale the lightweight embedding stage separately from the heavier decode stage, track per-phase latency, and visualize metrics in Grafana.
- Prerequisites
- Installation
- Project Layout
- Usage
- Architecture Diagram
- Development Notes
- Contributing & Issues
- License
- Operating System: Linux, macOS, or Windows Subsystem for Linux
- Git (for cloning the repo)
- Python 3.8+
venvorvirtualenvfor isolated environments
- Docker & Docker Compose (if you choose to containerize Kafka, Flink, and Grafana)
- (Optional) Kafka & Flink CLI for local pipelines
- Grafana (for dashboard visualization)
- Clone the repository
git clone git@github.com:<your-username>/split-inference-grpc-demo.git cd split-inference-grpc-demo