A high-performance, terminal-integrated LLM inference and training engine built from scratch in Rust. This project implements a Transformer architecture with modern optimizations including NTK-Aware RoPE, Sliding Window Attention, Evicting KV Cache with Sink Tokens, RMSNorm, and support for up to 1M token context.
The system is organized as a workspace with specialized crates for modularity and performance:
- claude-core: The backbone of the system — Transformer blocks, NTK-Aware RoPE (Rotary Positional Embeddings), RMSNorm, Sliding Window Attention, Evicting KV Cache with Sink Token pinning, embedding layers, weight initialization, and sinusoidal/rotary position encodings.
- inference: A high-speed inference engine supporting SSE (Server-Sent Events) streaming via a web server, flexible token generation pipeline with unbounded context support, and configurable server/CLI modes.
- tokenizer: A custom Byte-Pair Encoding (BPE) implementation with advanced splitting rules, byte-level fallback encoding, and specialized training scripts.
- trainer: Full training pipeline with AdamW optimization, cosine LR scheduling with warmup, checkpoint save/resume, JSONL metrics logging, and streaming binary data loader.
- retrieval: RAG pipeline with text chunking, mean-pool embeddings, flat cosine similarity index, and FAISS-compatible configuration.
- quant: INT8 and 4-bit model quantization with per-group scaling (~4–8x compression) for consumer GPU inference.
- tensors: Backend abstraction layer with tensor operation utilities, broadcast checks, and CUDA/VRAM estimation.
- agent: A highly asynchronous orchestrator layer giving local models access to file system reading/writing and terminal execution natively.
- claude-tui: A terminal-based user interface built with Ratatui for real-time interaction with the models.
- utils: Shared config (YAML/JSON) loading, filesystem helpers, and logging setup.
| Feature | Description | Crate |
|---|---|---|
| NTK-Aware RoPE Scaling | Extends context beyond training length (e.g., 4K → 1M) by scaling base frequency while keeping local features intact. | claude-core |
| Sliding Window Attention | O(N × W) attention instead of O(N²) — each token attends to a local window + global sink tokens. | claude-core |
| Evicting KV Cache | Ring-buffer KV cache with sink token pinning — bounded memory, graceful eviction of middle tokens. | claude-core |
| Attention Sink Tokens | First N tokens are pinned as global context anchors, stabilizing attention quality over long contexts. | claude-core |
| Safetensors Support | Integration with the Safetensors format for lightweight, memory-mapped weight loading. | claude-core |
| Agentic Tool Calling | Intercepts <tool_call> generation to seamlessly read/write files and list directories. |
agent |
| Terminal Execution | Native sub-process execution allowing the model to run bash/powershell commands on the system. | agent |
| Streaming Inference | SSE combined with Ratatui for real-time, char-by-char output streaming. | inference / claude-tui |
| Custom BPE Logic | Fully internal tokenizer implementation without external Python dependencies. | tokenizer |
| INT8/Q4 Quantization | Symmetric INT8 and per-group 4-bit quantization for ~4–8x model compression. | quant |
| Training Pipeline | AdamW + cosine LR scheduler + checkpoint save/resume + JSONL metrics logging. | trainer |
| RAG Pipeline | Text chunking, embeddings, flat cosine similarity search for retrieval-augmented generation. | retrieval |
The system achieves 1M token context through a 4-component hybrid architecture:
flowchart TB
Input["Input Tokens"] --> Emb["Embedding"]
Emb --> Block
subgraph Block["Transformer Block × N"]
direction TB
LN1["RMSNorm"] --> Attn["Sliding Window Attention\n+ Sink Tokens"]
Attn --> Res1["+ Residual"]
Res1 --> LN2["RMSNorm"]
LN2 --> MLP["MLP (GeLU)"]
MLP --> Res2["+ Residual"]
end
ROPE["NTK-Aware RoPE\n(position scaling)"] -.-> Attn
MASK["Sliding Window Mask\n(local + sink)"] -.-> Attn
KVC["Evicting KV Cache\n(ring buffer + sink pinning)"] -.-> Attn
Block --> LNF["Final RMSNorm"]
LNF --> LMHead["LM Head"]
LMHead --> Output["Logits"]
| Component | Problem Solved | Without It |
|---|---|---|
| NTK RoPE Scaling | Model sees positions > training length | Garbage output past training context |
| Sliding Window Attention | O(N²) → O(N × W) compute | OOM at ~100K tokens |
| Evicting KV Cache | Memory bounded, constant VRAM | OOM at ~50K tokens |
| Sink Tokens | Attention quality stays stable | Model loses coherence over long contexts |
| Context Length | GPU VRAM | Speed |
|---|---|---|
| 32K | ~24 GB | Fast |
| 128K | ~28 GB | Moderate |
| 512K | ~32 GB | Slow |
| 1M | ~40–48 GB | Very slow, feasible |
- Rust 1.70+
- LibTorch (Managed automatically via the tch-rs crate)
Model configuration supports both standard and long-context setups:
# configs/model_config.yaml
n_embd: 768
n_head: 12
n_layer: 12
vocab_size: 50257
max_seq_len: 2048
original_max_seq_len: 2048 # Training context length
rope_base: 10000.0 # RoPE base frequency
window_size: 2048 # Sliding window size
sink_tokens: 4 # Global context anchors
kv_cache_capacity: 2048 # Max KV cache tokensFor 1M context, see configs/million_context.toml.
Use the provided PowerShell script to train the BPE tokenizer on a raw text corpus:
./scripts/train_tokenizer.ps1Start the Axum-based inference server:
cargo run -p inference --bin inference-serverLaunch the terminal chat interface:
cargo run -p claude-tui- Long-Context Architecture: NTK RoPE, Sliding Window Attention, Evicting KV Cache, Sink Tokens
- Quantization: INT8 and 4-bit per-group quantization with calibration stats.
- Training Pipeline: Checkpoint save/resume, cosine LR scheduler, JSONL metrics, streaming data loader.
- RAG Pipeline: Text chunking, embedding, flat cosine similarity index.
- GQA (Grouped-Query Attention): Adding support for GQA to further reduce memory bandwidth usage.
- Continuous Batching: Refactoring the generator for high-throughput multi-request processing.
- Markdown Rendering: Integration of rich text formatting within the terminal interface.
This project is licensed under the MIT License.
