docs: Add ADR-017 and DDD for Craftsman Ultra 30b 1bit BitNet integration#151
Open
docs: Add ADR-017 and DDD for Craftsman Ultra 30b 1bit BitNet integration#151
Conversation
…tion Research and architecture documentation for integrating BitNet b1.58 ternary quantization with GLM-4.7-Flash 30B-A3B MoE architecture into the RuvLLM serving runtime. Includes phased approach (expert replacement → full distillation → native training), CPU inference kernel strategy (TL1/TL2/I2_S), domain model with 7 bounded contexts, and memory budget analysis targeting <10GB for 30B-class CPU-only inference. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
ADR-017 updates: - Add RLM Training Stack reuse section (GRPO, EWC++, ContrastiveTrainer, MemoryDistiller, PolicyStore — ~70% code reuse ratio) - Add AD-11: GRPO-guided distillation with per-expert reward scaling - Add AD-12: Contrastive pre-training for expert routing validation - Add AD-13: EWC++ cross-expert stability during sequential distillation - Add AD-14: PolicyStore TernaryScale per-layer policy persistence - Add AD-15: MemoryDistiller trajectory tracking for distillation quality - Add AD-16: Full pipeline composition with expert-parallel distillation - Update Options C/D with RLM component mapping tables - Update consequences, risks, and validation criteria DDD v2.0 updates: - Add bounded context 3.8: RLM Training Orchestration (70% reused) - Add 13 ubiquitous language terms for RLM concepts - Update context map with RLM relationships - Update Quantization Pipeline to delegate to RLM Training - Add 7 new domain events for GRPO, EWC, and distillation lifecycle - Update module structure with reused vs new file annotations - Add 6 RLM-specific integration tests - Add 3 new open questions for RLM scaling https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
…SIMD) ADR-017: Add AD-17 with detailed memory budget analysis showing per-expert distillation fits in A100 40GB (~15.5GB), full model requires 4×A100 80GB (~430GB). CPU SIMD training infeasible at 200B+ tokens (~65 years on AVX2). Recommend GCP 4×A100 spot instances (~$1,300 for Phase 1) or DataCrunch H100 ($1.99/hr). Includes cost comparison across 6 platforms, per-phase infrastructure mapping, and required CUDA device dispatch code change for RealContrastiveTrainer. DDD: Add section 8.5 Training Infrastructure Model with expert-parallel GPU topology diagram, what-runs-where matrix, and required code change summary. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
Research post-training quantization feasibility for GLM-4.7-Flash as a low-cost ($100, 2-4 hrs) validation step before full distillation ($1,300+). ADR-017 changes: - Restructured Option A from "Rejected" to tiered PTQ analysis (0A-0D) - Added AD-18: PT-BitNet post-training quantization strategy - Updated phased decision to A(0C) → D → C → B - Added Phase 0 exit criteria and validation benchmarks - Documented existing community GGUFs (bartowski, unsloth, ngxson) - Identified RuvLLM IQ1_S dequant gap (type 19 parsed, not implemented) - Added PT-BitNet, BitDistill, and STBLLM references DDD v2.1 changes: - Added 6 Phase 0 ubiquitous language terms (PT-BitNet, BITNET_T158, etc.) - Updated Section 3.4 with dual-mode quantization pipeline (PTQ + distillation) - Updated compatibility matrix with Phase 0 vs Phase 1+ columns - Added 3 new open questions (calibration corpus, GGUF type, weight migration) Key finding: IQ1_S ≠ BitNet b1.58. Generic codebook PTQ produces garbled output; PT-BitNet absmean ternary quantization is viable for kernel validation. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
Update AD-17 and AD-18 to reflect that Phase 0 post-training quantization runs entirely on Mac Studio (Apple Silicon) at zero cost, eliminating the need for cloud GPU for the prototype phase. Key changes: - Phase 0 cost updated from ~$100 (cloud) to $0 (local Mac Studio) - AD-18 now includes Mac Studio config compatibility matrix (M4 Max 36-128GB, M3 Ultra 96-512GB) with wall time estimates per config - Added mmap strategy: FP16 weights demand-paged from disk, per-tensor quantization uses ~2-4MB working memory regardless of model size - Metal GPU calibration via existing Candle integration (use_metal: true) - ARM NEON for TL1 kernel validation (same ISA as production target) - Updated throughput table with Mac Studio entries and Phase 0 column - PtBitnetConfig gains use_mmap, use_metal_calibration, max_memory_gb fields - Phase 0 exit criteria updated for Mac Studio local execution - Updated infrastructure table: Phase 0 + router validation both $0 local Mac Studio is ideal for Phase 0 (PTQ in hours, $0) but still infeasible for Phase 1+ training (200B tokens at 500-1000 tok/s = 6.5 years). This separation validates the phased cloud-for-training approach. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
Add Phase 0.5: RLM Post-Quantization Refinement — a $0 Mac Studio approach that uses the existing RLM stack (MicroLoRA, GRPO, EWC++, ContrastiveTrainer, MemoryDistiller, PolicyStore) to refine the Phase 0 PTQ model by training only FP16 components (~1-2% of params). ADR-017 changes: - Added Phase 0.5 to phased decision: A(0C) → RLM Refinement → D → C → B - Added AD-19: RLM Post-Quantization Refinement architecture - Frozen ternary weights + trainable FP16 (LoRA, router, scales) - ~200-400M trainable params (1-2% of 30B), 100-500M training tokens - 100% RLM code reuse, 0% new training code - 2-12 days on Mac Studio Metal, $0 cost - Expected quality: ~70-80% of FP16 (up from 55-65% Phase 0 PTQ) - Full pipeline diagram: Router repair → MicroLoRA injection → Scale opt - Memory budget analysis: ~12-20 GB active RAM (fits any Mac Studio) - Training schedule: 3-14 days total wall time - Added Phase 0.5 exit criteria (11 items) - Updated infrastructure table with Phase 0.5 row - Updated consequences with RLM refinement benefits DDD v2.2 changes: - Added Section 3.8.1: Phase 0.5 RLM Refinement Mode - Added 5 ubiquitous language terms (RLM Refinement, Frozen Ternary, LoRA Correction, Router Repair) - Added 3 open questions (LoRA rank, GGUF persistence, Phase continuity) Key insight: RLM trains ~1% of parameters → needs ~0.25% of the data (100-500M vs 200B tokens) → Mac Studio Metal is sufficient → $0 cost. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
Analyze RLM training stack GPU dependencies and document that Phase 0.5 runs entirely on pure CPU SIMD (NEON on aarch64) without Metal GPU. MicroLoRA, TrainingPipeline, EwcRegularizer, GrpoOptimizer are all pure ndarray; ContrastiveTrainer has explicit CPU fallback. Only ~2-3x slower than Metal. Extends platform support to Linux ARM64 and x86 (scalar). https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
Add bitnet/ module with absmean ternary quantizer, TernaryTensor type, BITNET_T158 dequantization, and comprehensive test suite (~1600 lines). Components: - quantizer.rs: PtBitnetConfig, absmean_ternary(), quantize_tensor() - ternary_tensor.rs: TernaryTensor, pack/unpack 2-bit ternary encoding - dequantize.rs: dequantize_bitnet_t158(), block dequant, error metrics - tests.rs: Packing roundtrips, quantization correctness, edge cases - gguf/quantization.rs: BitnetT158 = 30 enum variant, block_size, bytes Implements AD-1 (weight representation), AD-5 (GGUF extension), AD-18 (PT-BitNet quantization) from ADR-017. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
…lter tests Wire dequantize_bitnet_t158 into gguf/quantization.rs dequantize_block() and dequantize_tensor() match arms. Add block wrapper that extracts FP16 scale from interleaved GGUF format. Add 179 lines of layer filter tests validating AD-2 (router/embed/head stay FP16, expert FFN quantized). https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
Generated test coverage analysis for the PT-BitNet quantizer module, documenting coverage across quantization, packing, dequantization, error metrics, and layer filtering. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
Research bitnet.cpp Rust port strategy: R3-Engine proves 100% Safe Rust with dual-target (native AVX-512 + WASM SIMD128) achieving 80-117 tok/s. Recommend Approach C (reference R3-Engine patterns) over Python codegen. WASM SIMD128 maps TL1 LUT to v128.swizzle for ~20-40 tok/s in browser. Resolves open question #5 (WASM viability). Adds 6 new references, 5 new DDD terms, 3 new open questions. DDD updated to v2.4. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
…port, RLM refiner Phase 0 + 0.5 implementation (4,283 lines across 6 new files): - tl1_kernel.rs (879L): TL1 ternary GEMV with NEON SIMD + scalar fallback, INT8 activation quantization (absmax), LUT generation, 17 tests - backend.rs (1,179L): Full BitNetBackend implementing LlmBackend trait, GGUF model loading, MoE router (softmax gate + top-K), expert FFN (SwiGLU via TL1 GEMV), RMSNorm, embedding/LM head, 12 tests - gguf_export.rs (662L): GGUF v3 writer for BITNET_T158, FP16 conversion, model export with BitNet metadata, validation, 8 tests - rlm_refiner.rs (696L): Phase 0.5 orchestrator wiring MicroLoRA + EWC++ + GRPO + ContrastiveTrainer, SIMD-only mode (AD-20), checkpointing, 10 tests - tl1_avx2.rs (414L): AVX2 SIMD kernel variant (x86_64 conditional) - tl1_wasm.rs (453L): WASM SIMD128 kernel variant (wasm32 conditional) All 72 bitnet tests pass. Fixed 2 pre-existing compilation errors in autodetect.rs and kernels/mod.rs. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
Agent refinements to tl1_avx2.rs and tl1_wasm.rs — cleanup of unused imports and linter warnings. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
Defines three ship/no-ship gates: - Gate 1: Routing correctness (>= 85% teacher agreement) - Gate 2: Citation correctness (precision >= 90%, recall >= 70%) - Gate 3: Refusal calibration (F1 >= 0.85) Includes JSONL trace schema, auto-labeling strategy using RuVector signals (redundancy, cluster disagreement, mincut fragility), and go/no-go rule requiring all gates to pass on same prompt suite run. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
AD-23: Phase-1 Distillation via External GPU Teacher Artifacts - One-time GPU job produces behavioral artifacts (routing traces, sparse logits, preference labels) — not trained weights - CPU-only refinement: router repair, LoRA correction, EWC++, policy optimization using teacher artifacts - Acceptance criteria: 200-prompt suite, all 3 behavioral gates, stability under 10% corpus perturbation expert_cache.rs: MoE expert hot-set caching (new file) - ExpertCache with LRU/LFU/Adaptive eviction policies - MoeBatchScheduler: reorder token execution by expert for cache reuse - Prefetcher trait for future platform-specific prefetch intrinsics - 12 tests (92/92 bitnet tests pass) DDD v2.5: 6 new ubiquitous language terms (Teacher Artifact, Behavioral Distillation, Router Repair, Sparse Logits, Corpus Perturbation) and 4 new open questions (#27-30) for Phase-1 operability. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
…rity hardening New modules (4 files, 2,359 lines): - rlm_embedder.rs (743L): RLM-style recursive sentence transformer with 3 variants (query-conditioned, corpus-conditioned, contradiction-aware twin), merge rule, BaseEmbedder/NeighborRetriever traits, 14 tests - tokenizer.rs (418L): BPE tokenizer with GGUF vocab loading, encode/decode, special token handling, 10 tests - trace.rs (554L): JSONL trace writer for routing, citation, refusal decisions, jaccard similarity, manual JSON serialization, 10 tests - eval.rs (644L): Three behavioral gates (routing correctness >= 0.85, citation precision >= 0.90, refusal F1 >= 0.85), EvalSuite, 12 tests Documentation: - AD-24: RLM-Style Recursive Sentence Transformer Embedder — 3 variants, merge rule, training strategy, evaluation criteria, appliance fit - DDD v2.6: 8 new ubiquitous language terms, 4 new open questions (#31-34) - 3 new positive consequences (#31-33) for RLM embeddings Security hardening (across 6 existing files): - Path traversal validation in GGUF export - Division-by-zero epsilon guards in quantizer - Bounds validation on public function inputs - NaN-safe softmax with -inf handling 138 tests pass, 0 compilation errors. Total bitnet module: 9,632 lines across 16 files. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
Implements AD-25 appliance deployment optimizations for the RLM recursive sentence transformer embedder targeting Raspberry Pi 5 + 7 STM32 coprocessors: - Pi 5 config presets: pi5_optimized() (2-iter, 3-neighbor) and pi5_streaming() (1-iter) - STM32 offload protocol: ComputeHash, FilterNeighbors, GateCheck, WatchdogPing, ScheduleReorder - NullStm32 software fallback for development/cloud environments - Batch embedding with per-chunk latency tracking and STM32 gate-checking - Priority-scheduled batch embedding via STM32-driven reordering - HashEmbedder: lightweight FNV-1a pseudo-embedder for testing/baseline - FlatNeighborStore: in-memory neighbor retriever for small corpora (<100K chunks) - EmbedderBenchmark: throughput, P95/P99 latency, peak memory reporting - NEON-optimizable math: 4-element unrolled cosine_similarity, l2_normalize - vec_accumulate_weighted and mean_embedding helpers - 41 tests (27 new): STM32 protocol, batch, HashEmbedder, FlatNeighborStore, benchmark, integration All 165 bitnet module tests pass. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
…kend Resolves the three blocking gaps that prevented end-to-end inference: 1. **Real attention layer** (was pass-through placeholder): - AttentionWeights struct with Q/K/V/O ternary projections - GQA (Grouped Query Attention) with configurable num_heads / num_kv_heads - Pre-computed RoPE cos/sin tables (apply_rope) - Per-layer KV cache for autoregressive generation - forward_token() for efficient single-token inference with cache - forward_layer_cached() with full attention computation - forward_layer_nocache() legacy path for backwards compatibility 2. **Tokenizer integration** (was raw bytes → token IDs): - load_tokenizer_from_gguf() extracts vocab + merges from GGUF metadata - Byte-level fallback tokenizer (260 tokens) when GGUF has no vocab - TokenizerBridge implements crate-level Tokenizer trait - tok() accessor for direct tokenizer access 3. **generate() uses tokenizer** (was returning [token_id] strings): - Encodes prompt via BPE tokenizer before forward pass - Decodes generated tokens back to text - generate_cached() for KV-cached autoregressive generation - get_embeddings() now uses tokenizer for text encoding - reset_cache() to clear KV state between sequences Tests: 174/174 bitnet tests pass (9 new: RoPE, KV cache, tokenizer roundtrip, attention weights, byte-level fallback, cache operations) https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
… validation - TensorNameMapper resolves both llama.cpp (blk.*) and HuggingFace (model.layers.*) naming - MLA (Multi-Head Latent Attention) with low-rank Q/KV compression (DeepSeek-V2 style) - Stacked 3D expert tensor support (ffn_gate_exps → per-expert slicing) - Shared expert + dense layer-0 support (MoeWithShared/Dense/Moe layer types) - Updated BitNetModelConfig defaults to match GLM-4.7-Flash architecture - Tensor discovery and model validation harness for GGUF files - 188 passing tests (14 new) https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
…pressed MLA KV cache - Streaming generation API (generate_streaming) with per-token callback, early stopping, and GenerationStats for throughput metrics - ExpertPredictor: transition-matrix based predictor that learns from routing history to predict next experts with Laplace smoothing - CompressedMlaCache: stores compressed latents (c_kv + k_pe) instead of full K/V, achieving ~17.8x memory reduction for GLM-4.7-Flash - 15 new tests (203 total bitnet tests, all passing) https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
…ed SwiGLU, and zero-alloc paths - Wire AVX2 TL1 GEMV SIMD dispatch into backend hot path via tl1_avx2 module with scalar LUT fallback for non-x86_64 platforms - Add ScratchPool with 17 pre-allocated FP32 buffers for zero-alloc forward pass - Fuse SwiGLU gate+up projections with 4-wide unrolled loop and unsafe indexing - Optimize RMSNorm with 4-way parallel accumulator and fused scale pass - Optimize softmax with reciprocal multiply instead of per-element division - Optimize fp32_matvec_transposed with 4-wide unrolled dot product - Optimize GQA attention with 4-wide unrolled score computation and skip for negligible weights - Add routing history tracking via Mutex<Vec<Vec<usize>>> for expert prediction (interior mutability preserves LlmBackend Send+Sync trait compatibility) - Pre-allocate KV caches (512 positions) in load_gguf() - Add tl1_gemv_into() for zero-allocation GEMV into caller-provided buffers - All 203 bitnet tests pass https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
… tests
- Wire ExpertPredictor into MoE forward path: predicts likely-next experts
from routing history and issues software prefetch hints (volatile read of
first cache line of predicted expert gate_proj weights) before routing runs
- Rebuild predictor every 16 tokens from routing history (amortized cost)
- Fix routing history tracking to target first MoE layer (config.first_k_dense_replace)
instead of hardcoded layer_idx==0 (layer 0 is Dense in GLM-4.7-Flash)
- Integrate CompressedMlaCache as configurable mode (set_compressed_kv):
stores only c_kv + k_pe (576 dims) instead of full K/V (10240 dims) per
position (~17.8x memory reduction), recomputing K_nope and V during attention
- Add mla_caches field initialized per-layer in load_gguf(), cleared in reset_cache()
- Add 13 new tests (216 total, all passing):
- E2E: forward produces logits, forward_token with KV cache, determinism,
different tokens give different logits, expert predictor builds from inference,
cache reset, compressed KV toggle, scratch pool allocation
- Benchmarks: forward_token throughput, TL1 GEMV dispatch, RMSNorm, softmax,
expert_forward performance
https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Research and architecture documentation for integrating BitNet b1.58
ternary quantization with GLM-4.7-Flash 30B-A3B MoE architecture into
the RuvLLM serving runtime. Includes phased approach (expert replacement
→ full distillation → native training), CPU inference kernel strategy
(TL1/TL2/I2_S), domain model with 7 bounded contexts, and memory budget
analysis targeting <10GB for 30B-class CPU-only inference.
https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK