GlassBoxAI-Transformer

Large Language Model Inference Suite

GPU-Accelerated Transformer Implementation with Formal Verification

Overview

GlassBoxAI-Transformer is a comprehensive, production-ready Large Language Model (LLM) inference implementation suite featuring:

Multiple GPU backends: CUDA, OpenCL, and Metal (in development) acceleration
Multiple language implementations: C++ and Rust
GGUF model format support: Load quantized models from llama.cpp ecosystem
Quantization support: Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0 formats with GPU-accelerated dequantization
Facade pattern architecture: Clean API separation with deep introspection capabilities
Formal verification: 196 Kani-verified proof harnesses for memory safety guarantees
CISA/NSA Secure by Design compliance: Built following government cybersecurity standards

This project demonstrates enterprise-grade software engineering practices including comprehensive testing, formal verification, cross-platform compatibility, and security-first development.

Features

Core Capabilities

Feature	Description
GGUF Model Loading	Native support for llama.cpp GGUF format
Quantized Inference	GPU-accelerated Q2_K through Q8_0 dequantization
Multi-Head Attention	Grouped Query Attention (GQA) support
RoPE Embeddings	Rotary Position Embeddings with scaling
KV Cache	Efficient key-value caching for autoregressive generation
BPE Tokenization	Byte-Pair Encoding with chat template support
Sampling Methods	Temperature, Top-K, Top-P (nucleus) sampling
Streaming Output	Token-by-token generation output

Distributed Inference (Layer 2 Ethernet)

Feature	Description
DTX Protocol	Custom Layer 2 Ethernet protocol (EtherType 0x9998)
Raw Socket Communication	Direct Ethernet frame transmission for minimal latency
Distributed Layer Offloading	Offload transformer layers to remote GPU nodes
Server/Client Architecture	TransformerServer and TransformerClient components
Multi-Client Support	Handle multiple concurrent inference clients
Chunked Tensor Transfer	Efficient large tensor transmission with checksums

CPU/GPU Offloading

Feature	Description
Mixed Device Execution	Run specific layers on CPU while others run on GPU
Layer-by-Layer Control	Specify exactly which layers to offload
Memory Optimization	Offload layers when GPU memory is limited
Automatic Fallback	CPU execution for unsupported operations

GPU Acceleration

Backend	Implementation	Performance	Status
CUDA	Native CUDA kernels with fused operations	Optimal for NVIDIA GPUs	✅ Stable
OpenCL	Cross-platform GPU kernels	AMD, Intel, NVIDIA support	✅ Stable
Metal	Apple GPU acceleration	M1/M2/M3 Mac support	🚧 In Development

Quantization Formats

Format	Bits/Weight	Description	Status
Q8_0	8.5	Simple 8-bit quantization	✅ Full support
Q6_K	6.6	6-bit K-quant	✅ Full support
Q5_K	5.5	5-bit K-quant	✅ Full support
Q4_K	4.5	4-bit K-quant (recommended)	✅ Full support
Q3_K	3.4	3-bit K-quant	✅ Full support
Q2_K	2.6	2-bit K-quant	⚠️ Experimental

Supported Models

GlassBoxAI-Transformer supports GGUF models from the llama.cpp ecosystem. Compatible model families include:

Model Family	Versions	Notes
Llama	Llama 2, Llama 3, Llama 3.1, Llama 3.2	Full support including all quantizations
Mistral	Mistral 7B, Mixtral 8x7B	MoE architectures supported
Qwen	Qwen 1.5, Qwen 2, Qwen 2.5	Including coder and chat variants
DeepSeek	DeepSeek v1, DeepSeek v2	⚠️ DeepSeek v3 not yet supported
Unsloth	All Unsloth fine-tuned models	Optimized GGUF exports from Unsloth

Note: Models must be in GGUF format. Convert other formats using llama.cpp's convert.py or download pre-quantized models from HuggingFace.

Safety & Security

Feature	Technology
Memory Safety	Rust ownership model
Formal Verification	196 Kani proof harnesses
Bounds Checking	Verified array access
Input Validation	CLI argument validation
CISA Compliance	12 of 15 requirements verified

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                         GlassBoxAI-Transformer                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌─────────────────┐  │
│  │  C++ CUDA    │  │ C++ OpenCL   │  │  C++ Metal   │  │   Rust CUDA     │  │
│  ├──────────────┤  ├──────────────┤  ├──────────────┤  ├─────────────────┤  │
│  │ transformer  │  │ transformer- │  │ transformer_ │  │ rust_cuda/      │  │
│  │   .cu        │  │ opencl.cpp   │  │ metal.mm     │  │                 │  │
│  │ facaded_     │  │ facaded-     │  │ transformer_ │  │ facaded_rust_   │  │
│  │ transformer  │  │ transformer- │  │ kernels      │  │ cuda/           │  │
│  │   .cu        │  │ opencl.cpp   │  │ .metal       │  │  ├─ kani/       │  │
│  │              │  │              │  │              │  │  │  (99 proofs) │  │
│  │              │  │              │  │ 🚧 In Dev    │  │  └─ facade.rs   │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  └─────────────────┘  │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                        Shared Features                                  ││
│  │  • GGUF model format parsing with full metadata support                 ││
│  │  • K-quant dequantization (Q2_K through Q8_0)                           ││
│  │  • BPE tokenization with chat templates (Llama, ChatML, etc.)           ││
│  │  • Grouped Query Attention with RoPE                                    ││
│  │  • Consistent CLI interface across all implementations                  ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

File Structure

GlassBoxAI-Transformer/
│
├── transformer.cu                  # C++ CUDA Transformer implementation
├── facaded_transformer.cu          # C++ CUDA Transformer with Facade pattern
├── agentic_transformer.cu          # C++ CUDA Transformer with agentic interface & CPU offloading
├── transformer-opencl.cpp          # C++ OpenCL Transformer implementation
├── facaded-transformer-opencl.cpp  # C++ OpenCL Transformer with Facade pattern
│
├── transformer_metal.mm            # C++ Metal Transformer (🚧 In Development)
├── transformer_kernels.metal       # Metal shader kernels (🚧 In Development)
│
├── transformer_gui.py              # Python/Qt GUI for interactive inference
│
├── rust_cuda/                      # Rust CUDA Transformer implementation
│   ├── Cargo.toml
│   └── src/
│       ├── main.rs                 # CLI entry point
│       ├── generator.rs            # Text generation logic
│       ├── model.rs                # Transformer model implementation
│       ├── gguf.rs                 # GGUF file parser
│       ├── tokenizer.rs            # BPE tokenizer
│       ├── quant.rs                # Dequantization routines
│       ├── kernels.rs              # CUDA kernel definitions
│       ├── error.rs                # Error types
│       ├── network.rs              # TransformerServer/Client for distributed inference
│       └── protocol.rs             # DTX Layer 2 Ethernet protocol implementation
│
├── facaded_rust_cuda/              # Rust CUDA Transformer with Facade pattern
│   ├── Cargo.toml
│   └── src/
│       ├── main.rs                 # CLI entry point
│       ├── facade.rs               # Introspection facade API
│       ├── model.rs                # Transformer model
│       ├── gguf.rs                 # GGUF file parser
│       ├── tokenizer.rs            # BPE tokenizer
│       ├── quant.rs                # Dequantization routines
│       ├── kernels.rs              # CUDA kernel definitions
│       ├── lora.rs                 # LoRA adapter implementation
│       ├── trainer.rs              # GPU training infrastructure
│       ├── error.rs                # Error types
│       └── kani/                   # Formal verification proofs (196 total)
│           ├── mod.rs              # Module index
│           ├── bounds.rs           # Bounds checking proofs (8)
│           ├── arithmetic.rs       # Arithmetic safety proofs (11)
│           ├── memory.rs           # Memory safety proofs (9)
│           ├── panics.rs           # No-panic proofs (12)
│           ├── enums.rs            # Enum exhaustion proofs (8)
│           ├── floats.rs           # Float safety proofs (11)
│           ├── tokenizer.rs        # Tokenizer proofs (12)
│           ├── quant.rs            # Quantization proofs (15)
│           ├── model.rs            # Model proofs (13)
│           ├── trainer.rs          # Training proofs (25)
│           ├── lora.rs             # LoRA proofs (70+)
│           └── README.md           # Verification documentation
│
├── models/                         # Model storage directory
│   └── *.gguf                      # GGUF model files
│
├── transformer_tests_cuda.sh       # CUDA test suite
├── transformer_tests_opencl.sh     # OpenCL test suite
│
├── license.md                      # MIT License
└── README.md                       # This file

Prerequisites

Required

Dependency	Version	Purpose
GCC/G++	11+	C++ compilation
CUDA Toolkit	12.0+	CUDA compilation
Rust	1.75+	Rust compilation

Optional

Dependency	Version	Purpose
OpenCL SDK	3.0	OpenCL compilation
Xcode	15+	Metal compilation (macOS)
Kani	0.67+	Formal verification

Installation & Compilation

C++ CUDA Implementation

# Standard Transformer
nvcc -O2 -arch=sm_86 -o transformer-cuda transformer.cu

# Facade Transformer
nvcc -O2 -arch=sm_86 -o facaded-transformer-cuda facaded_transformer.cu

Note: Adjust -arch=sm_XX to match your GPU architecture (e.g., sm_75 for Turing, sm_80 for Ampere, sm_86 for RTX 3000 series).

C++ OpenCL Implementation

# Standard Transformer
g++ -O2 -std=c++17 -o transformer-opencl transformer-opencl.cpp -lOpenCL

# Facade Transformer
g++ -O2 -std=c++17 -o facaded-transformer-opencl facaded-transformer-opencl.cpp -lOpenCL

C++ Metal Implementation (🚧 In Development)

# Requires macOS with Xcode command line tools
clang++ -O2 -std=c++17 -framework Metal -framework Foundation \
    -o transformer-metal transformer_metal.mm

Note: The Metal implementation is currently in active development. Basic functionality works but some features may be incomplete.

Rust CUDA Implementation

# Standard Transformer
cd rust_cuda
cargo build --release

# Facade Transformer
cd facaded_rust_cuda
cargo build --release

Build All

# Build all CUDA implementations
nvcc -O2 -arch=sm_86 -o transformer-cuda transformer.cu
nvcc -O2 -arch=sm_86 -o facaded-transformer-cuda facaded_transformer.cu
(cd rust_cuda && cargo build --release)
(cd facaded_rust_cuda && cargo build --release)

# Build all OpenCL implementations
g++ -O2 -std=c++17 -o transformer-opencl transformer-opencl.cpp -lOpenCL
g++ -O2 -std=c++17 -o facaded-transformer-opencl facaded-transformer-opencl.cpp -lOpenCL

CLI Reference

Standard Transformer Commands

The standard transformer implementations provide core LLM inference functionality with distributed inference support.

Usage

transformer-cuda <command> [options]
transformer-opencl <command> [options]
rust_cuda/target/release/glassbox-transformer <command> [options]

Commands

Command	Description
`server`	Run as distributed inference server
`client`	Run as distributed inference client
`benchmark`	Run network/inference benchmarks
`test`	Run built-in tests
`help`	Show help information

Server Mode Options

Option	Description
`-i, --interface <name>`	Network interface (default: eth0)
`-l, --layers <n>`	Total transformer layers (default: 12)
`-e, --embed <dim>`	Embedding dimension (default: 768)
`-f, --ffn <dim>`	FFN hidden dimension (default: 3072)
`-a, --heads <n>`	Number of attention heads (default: 12)
`-k, --kvheads <n>`	Number of KV heads for GQA (default: 12)
`-q, --seq-len <n>`	Sequence length (default: 512)
`-v, --vocab-size <n>`	Vocabulary size (default: 50257)
`-x, --max-seq-len <n>`	Maximum sequence length (default: 2048)
`-m, --messages <n>`	Max messages to process (default: 100)
`-g, --gpu <yes/no>`	GPU availability (default: yes)
`-c, --clients <n>`	Max concurrent clients (default: 4)
`--quant <type>`	Quantization type: none\|q4_0\|q4_1\|q5_0\|q5_1\|q8_0\|q2_k\|q3_k\|q4_k\|q5_k\|q6_k
`--rope-base <n>`	RoPE base frequency (default: 10000.0)
`--rope-scale <n>`	RoPE scaling factor (default: 1.0)
`--eps <n>`	Layer norm epsilon (default: 1e-5)
`--dropout <n>`	Dropout rate 0.0-1.0 (default: 0.0)
`--verbose`	Enable verbose output

Client Mode Options

Option	Description
`-i, --interface <name>`	Network interface (default: eth0)
`-s, --server <mac>`	Server MAC address (required, XX:XX:XX:XX:XX:XX)
`-l, --layers <n>`	Total transformer layers (default: 12)
`-r, --remote <n>`	Remote layers to execute (default: 6)
`--start-layer <n>`	Starting layer for remote execution (default: auto)
`-e, --embed <dim>`	Embedding dimension (default: 768)
`-f, --ffn <dim>`	FFN hidden dimension (default: 3072)
`-a, --heads <n>`	Number of attention heads (default: 12)
`-k, --kvheads <n>`	KV heads for GQA (default: 12)
`-q, --seq-len <n>`	Sequence length (default: 512)
`-v, --vocab-size <n>`	Vocabulary size (default: 50257)
`-x, --max-seq-len <n>`	Maximum sequence length (default: 2048)
`--quant <type>`	Quantization type (see server options)
`--rope-base <n>`	RoPE base frequency (default: 10000.0)
`--rope-scale <n>`	RoPE scaling factor (default: 1.0)
`--eps <n>`	Layer norm epsilon (default: 1e-5)
`--no-cache`	Disable activation caching
`--no-grad-cache`	Disable gradient caching
`--timeout <ms>`	Connection timeout (default: 5000ms)
`--retries <n>`	Connection retry count (default: 3)
`--verbose`	Enable verbose output

Benchmark Mode Options

Option	Description
`-i, --interface <name>`	Network interface (default: eth0)
`-s, --server <mac>`	Server MAC address (required)
`-n, --iterations <n>`	Benchmark iterations (default: 10)
`-l, --layers <n>`	Transformer layers to benchmark (default: 12)
`-e, --embed <dim>`	Embedding dimension (default: 768)
`-q, --seq-len <n>`	Sequence length (default: 512)
`--batch-size <n>`	Batch size for benchmarking (default: 1)
`--warmup <n>`	Warmup iterations (default: 2)
`--output <file>`	Output results to CSV file
`--verbose`	Enable verbose output

Test Mode Options

Option	Description
`--all`	Run all tests
`--protocol`	Test protocol handling
`--config`	Test configuration
`--quant`	Test quantization/dequantization
`--kernels`	Test CUDA kernels (requires GPU)
`--network`	Test network layer
`--verbose`	Enable verbose test output

Examples

# Generate text from a prompt
./transformer-cuda generate -m models/tinyllama.gguf -p "Hello, world" -n 100

# Interactive chat mode with GPU acceleration
./transformer-cuda generate -m models/llama-7b.Q4_K_M.gguf -i -g

# Show model information
./transformer-cuda info -m models/tinyllama.gguf

# Run tests
./transformer-cuda test --all

# CPU offloading - run layers 0,1,2 on CPU, rest on GPU
./transformer-cuda generate -m models/llama.gguf -p "Hello" --cpu-layers 0,1,2

# All CPU execution (for systems without GPU)
./transformer-cuda generate -m models/tinyllama.gguf -p "Hello" --all-cpu

# Distributed inference - start server on remote GPU node
sudo ./transformer-cuda --server --interface eth0

# Distributed inference - connect client to server
sudo ./transformer-cuda generate -m models/llama.gguf -p "Hello" \
    --client --interface eth0 --remote-mac aa:bb:cc:dd:ee:ff -r 6

Facade Transformer Commands

The facade implementations add deep introspection capabilities for analyzing model internals.

Commands

Command	Description
`server`	Run as distributed inference server
`client`	Run as distributed inference client
`facade`	Run facade mode with introspection
`generate`	Text generation from GGUF model
`benchmark`	Run network/inference benchmarks
`test`	Run built-in tests

Facade Mode Options

Option	Description
`--model <path>`	GGUF model file path (required)
`--tokenizer <path>`	Tokenizer JSON file path
`--prompt <text>`	Text prompt for generation
`--max-tokens <n>`	Maximum tokens to generate (default: 100)
`--temperature <n>`	Sampling temperature (default: 1.0)
`--top-k <n>`	Top-K sampling (default: 40)
`--top-p <n>`	Top-P nucleus sampling (default: 0.9)
`--inspect`	Enable introspection mode
`--show-attention`	Display attention weights
`--show-hidden <layer>`	Display hidden states for layer
`--show-qkv <layer>`	Display Q/K/V vectors for layer
`--show-logits`	Display output logits
`--show-entropy`	Display attention entropy per layer
`--show-saliency <pos>`	Display saliency map for token position
`--show-weights <layer>`	Display weight matrices for layer
`--show-tensors`	List all tensor names in model
`--dump-hidden <file>`	Dump hidden states to CSV file
`--dump-attention <file>`	Dump attention weights to CSV file
`--layer <n>`	Specific layer for inspection (default: all)
`--head <n>`	Specific attention head (default: all)
`--position <n>`	Specific token position (default: all)
`--verbose`	Enable verbose output

Generate Mode Options

Option	Description
`-m, --model <path>`	Path to GGUF model file (required)
`-p, --prompt <text>`	Text prompt for generation
`-n, --tokens <n>`	Max tokens to generate (default: 256)
`-t, --temperature <n>`	Sampling temperature (default: 0.7)
`--top-k <n>`	Top-K sampling (default: 40)
`--top-p <n>`	Top-P/nucleus sampling (default: 0.9)
`-i, --interactive`	Interactive chat mode
`-g, --gpu`	Use GPU-accelerated inference

Test Mode Options (Facade)

Option	Description
`--all`	Run all tests
`--protocol`	Test protocol handling
`--config`	Test configuration
`--quant`	Test quantization/dequantization
`--kernels`	Test CUDA kernels (requires GPU)
`--network`	Test network layer
`--facade`	Test facade introspection functions
`--tokenizer`	Test tokenizer encode/decode
`--gguf`	Test GGUF model loading
`--verbose`	Enable verbose test output

Facade API Methods

The facade provides programmatic access to internal states:

Method	Description
`getHiddenState(layer, pos)`	Get hidden state vector
`getAttentionScores(layer, head)`	Get attention weights
`getQKV(layer, pos, type)`	Get Q/K/V vectors
`getLogits()`	Get output logits
`getLayerNormStats(layer)`	Get normalization statistics
`getTokenProbabilities(topK)`	Get token probabilities

Facade Examples

# Run facade with introspection
./facaded-transformer-cuda facade --model model.gguf --tokenizer tok.json \
    --prompt "Hello" --inspect

# Show attention weights for layer 0
./facaded-transformer-cuda facade --model model.gguf --prompt "Test" \
    --show-attention --layer 0

# Dump hidden states to CSV
./facaded-transformer-cuda facade --model model.gguf --prompt "Test" \
    --dump-hidden hidden.csv

# Generate text with GPU acceleration
./facaded-transformer-cuda generate -m models/llama.gguf -p "Hello" -g

# Interactive chat mode
./facaded-transformer-cuda generate -m models/llama.gguf -i -g

Rust Transformer Commands

The Rust implementations provide memory-safe inference with formal verification.

Commands (rust_cuda)

Command	Description
`generate`	Text generation from GGUF model
`server`	Run as distributed inference server
`client`	Run as distributed inference client
`info`	Display model information

Generate Mode Options (Rust)

Option	Description
`-m, --model <path>`	Path to GGUF model file (required)
`-p, --prompt <text>`	Text prompt for generation
`-n, --tokens <n>`	Max tokens to generate (default: 256)
`-t, --temperature <n>`	Sampling temperature (default: 0.7)
`--top-k <n>`	Top-K sampling (default: 40)
`--top-p <n>`	Top-P/nucleus sampling (default: 0.9)
`--rep-penalty <n>`	Repetition penalty (default: 1.1)
`-i, --interactive`	Interactive chat mode

Server Mode Options (Rust)

Option	Description
`-i, --interface <name>`	Network interface (default: eth0)
`--server-id <n>`	Server ID for multi-server setups

Client Mode Options (Rust)

Option	Description
`-s, --server <mac>`	Server MAC address (required, XX:XX:XX:XX:XX:XX)
`-i, --interface <name>`	Network interface (default: eth0)
`-l, --layers <n>`	Total transformer layers (default: 12)
`-r, --remote <n>`	Remote layers to offload (default: 6)
`-e, --embed <dim>`	Embedding dimension (default: 768)
`--timeout <ms>`	Connection timeout (default: 5000)

Commands (facaded_rust_cuda)

Command	Description
`generate`	Text generation with introspection
`analyze`	Analyze model internals for a prompt
`inspect`	Interactive inspection mode
`info`	Display model information

Generate Mode Options (Rust Facade)

Option	Description
`-m, --model <path>`	Path to GGUF model file (required)
`-p, --prompt <text>`	Text prompt for generation
`-n, --tokens <n>`	Max tokens to generate (default: 256)
`-t, --temperature <n>`	Sampling temperature (default: 0.7)
`--top-k <n>`	Top-K sampling (default: 40)
`--top-p <n>`	Top-P/nucleus sampling (default: 0.9)
`-i, --interactive`	Interactive chat mode
`--show-hidden`	Show hidden state statistics
`--show-entropy`	Show attention entropy

Analyze Mode Options (Rust Facade)

Option	Description
`-m, --model <path>`	Path to GGUF model file (required)
`-p, --prompt <text>`	Text prompt to analyze (required)
`--layer <n>`	Layer to inspect (default: last)
`--head <n>`	Attention head to inspect (default: 0)
`--show-qkv`	Show Q/K/V vectors
`--show-logits`	Show top-k logits
`--show-saliency`	Show saliency map

Rust Examples

# Generate text (rust_cuda)
./rust_cuda/target/release/glassbox-transformer generate \
    -m models/tinyllama.gguf -p "Hello world" -n 100

# Interactive mode with repetition penalty
./rust_cuda/target/release/glassbox-transformer generate \
    -m models/llama.gguf -i --rep-penalty 1.2

# Start distributed server
sudo ./rust_cuda/target/release/glassbox-transformer server -i eth0

# Connect as client with layer offloading
sudo ./rust_cuda/target/release/glassbox-transformer client \
    -s aa:bb:cc:dd:ee:ff -i eth0 -r 6

# Analyze with Q/K/V inspection (facaded)
./facaded_rust_cuda/target/release/glassbox-transformer-facaded analyze \
    -m models/llama.gguf -p "What is AI?" --show-qkv --layer 0

# Generate with hidden state display (facaded)
./facaded_rust_cuda/target/release/glassbox-transformer-facaded generate \
    -m models/llama.gguf -p "Hello" --show-hidden --show-entropy

Agentic Transformer Commands

The agentic transformer provides an interactive agent interface with CPU offloading support.

Usage Modes

Mode	Description
Interactive	Run with no arguments for REPL mode
Script	`--script <file>` to run commands from file
Stdin	`--stdin` to read commands from pipe
Single	Direct prompt with `-p` flag

Generation Options

Option	Description
`-p, --prompt <text>`	Input prompt for generation
`-n, --max-tokens <n>`	Maximum tokens to generate (default: 5)
`-t, --temperature <n>`	Sampling temperature 0.0-2.0 (default: 1.0)
`--top-k <k>`	Top-K sampling (disable with -1)
`--top-p <p>`	Nucleus/Top-P sampling 0.0-1.0 (default: 1.0)
`--repetition-penalty <p>`	Penalize repeated tokens (default: 1.0)
`--context-length <n>`	Max context window size (default: 1024)
`--seed <s>`	Random seed for reproducibility

Scripting & Batch Options

Option	Description
`--script <file>`	Load and execute commands from script file
`--stdin`	Read commands from stdin (for piping)
`--log <file>`	Write session log to file (default: agent_history.log)
`--no-log`	Disable session logging
`-o, --output <file>`	Save generated text to file
`--json-output`	Format output as JSON

Inspection & Diagnostics

Option	Description
`--list-tensors`	List all tensors in model and exit
`--show-quant-stats`	Display quantization statistics (default: yes)
`--no-quant-stats`	Skip quantization statistics output
`--fp32-only`	Only load F32 tensors, skip quantized
`--test-dequant`	Test dequantization on all quantized tensors

Device & Performance Options

Option	Description
`--device <id>`	Select GPU device ID (default: 0)
`--batch-size <n>`	Batch size for processing (default: 1)
`--memory-limit <MB>`	Limit GPU memory usage in MB (0=unlimited)
`--benchmark`	Run benchmark tests after generation

CPU Offloading Options

Option	Description
`--cpu-layers <list>`	Run specified layers on CPU (e.g., `0,2,4`)
`--all-cpu`	Run all transformer layers on CPU (RAM)
`--all-gpu`	Run all transformer layers on GPU (default)

Interactive Agent Commands

Command	Description
`load <model.gguf> <tok.json>`	Load model and tokenizer
`run <prompt> [tokens] [temp]`	Run inference/generation
`info`	Display model architecture
`inspect [type]`	Inspect model (summary/performance/layers)
`list-tensors`	List all model tensors
`quant-stats`	Show quantization statistics
`save <filename>`	Save last output to file
`history`	Show action history
`help`	Show agent commands
`quit/exit`	Exit agent

Agentic Examples

# Interactive mode
./agentic_transformer

# Single generation with CPU offloading
./agentic_transformer model.gguf tokenizer.json \
    -p "Once upon a time" -n 50 -t 0.9 --cpu-layers 0,1,2

# Batch mode from script
./agentic_transformer model.gguf tokenizer.json \
    --script commands.txt --log batch.log

# Piped batch mode
echo "load model.gguf tok.json" | ./agentic_transformer --stdin

# List tensors and quantization stats
./agentic_transformer model.gguf tokenizer.json --list-tensors
./agentic_transformer model.gguf tokenizer.json --show-quant-stats

Distributed Inference (Layer 2 Ethernet)

Overview

GlassBoxAI-Transformer supports distributed inference over Layer 2 Ethernet using the DTX Protocol (Distributed Tensor eXchange). This enables offloading transformer layers to remote GPU nodes with minimal latency by bypassing the TCP/IP stack entirely.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Distributed Transformer Inference                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────┐                    ┌─────────────────────┐        │
│  │   Client Node       │    Layer 2         │   Server Node       │        │
│  │                     │    Ethernet        │                     │        │
│  │  ┌───────────────┐  │   (DTX Protocol)   │  ┌───────────────┐  │        │
│  │  │ Local Layers  │  │  ───────────────►  │  │ Remote Layers │  │        │
│  │  │   0..N-1      │  │                    │  │     N..M      │  │        │
│  │  │   (GPU/CPU)   │  │  ◄───────────────  │  │    (GPU)      │  │        │
│  │  └───────────────┘  │   Tensor Results   │  └───────────────┘  │        │
│  │         │           │                    │         │           │        │
│  │         ▼           │                    │         ▼           │        │
│  │  ┌───────────────┐  │                    │  ┌───────────────┐  │        │
│  │  │   RawSocket   │  │                    │  │   RawSocket   │  │        │
│  │  │ EtherType:    │◄─┼────────────────────┼─►│ EtherType:    │  │        │
│  │  │   0x9998      │  │                    │  │   0x9998      │  │        │
│  │  └───────────────┘  │                    │  └───────────────┘  │        │
│  └─────────────────────┘                    └─────────────────────┘        │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

DTX Protocol Messages

Message Type	Description
`HandshakeReq/Ack`	Connection establishment with capability exchange
`LayerConfig`	Configure which layers to process remotely
`ForwardStart/Chunk/Done`	Forward pass tensor transmission
`ForwardResult/Complete`	Forward pass results return
`BackwardStart/Chunk/Done`	Backward pass gradient transmission
`BackwardResult/Complete`	Backward pass results return
`Ping/Pong`	Connection health monitoring
`Disconnect`	Clean connection termination

Requirements

Root privileges: Raw socket access requires sudo
Same Layer 2 network: Client and server must be on the same Ethernet segment
Network interface: Specify the interface name (e.g., eth0, enp0s3)

Example: Two-Node Setup

Server Node (has powerful GPU):

sudo ./transformer-cuda --server --interface eth0
# Server listening on interface eth0 (aa:bb:cc:dd:ee:ff)
# Waiting for connections...

Client Node (limited GPU memory):

sudo ./transformer-cuda generate \
    -m models/llama-7b.Q4_K_M.gguf \
    -p "Explain quantum computing" \
    --client --interface eth0 \
    --remote-mac aa:bb:cc:dd:ee:ff \
    -r 16  # Offload 16 layers to remote server

CPU/GPU Offloading

Overview

For systems with limited GPU memory, GlassBoxAI-Transformer supports mixed device execution where specific transformer layers can be offloaded to CPU while the rest run on GPU.

Use Cases

Large models on small GPUs: Run 7B+ models on 4-6GB GPUs
Memory pressure: Avoid out-of-memory errors during inference
Debugging: Isolate CPU vs GPU computation issues
CPU-only systems: Run inference without any GPU

Configuration

# Offload first 4 layers to CPU
./transformer-cuda generate -m model.gguf -p "Hello" --cpu-layers 0,1,2,3

# Offload every other layer
./transformer-cuda generate -m model.gguf -p "Hello" --cpu-layers 0,2,4,6,8

# All CPU (no GPU required)
./transformer-cuda generate -m model.gguf -p "Hello" --all-cpu

# All GPU (default, maximum performance)
./transformer-cuda generate -m model.gguf -p "Hello" --all-gpu

Performance Considerations

Configuration	Performance	Memory Usage
All GPU	Fastest	Highest GPU memory
Mixed (few CPU layers)	~80-90% of GPU	Reduced GPU memory
Mixed (many CPU layers)	~30-50% of GPU	Minimal GPU memory
All CPU	Slowest	No GPU memory

Training

Training Features

Feature	Description
Backpropagation	Full gradient computation through all transformer layers
Adam Optimizer	Adaptive learning rate with bias correction (β1=0.9, β2=0.999)
Gradient Clipping	Norm-based gradient clipping for training stability
Activation Caching	Efficient caching for backward pass computation
Cross-Entropy Loss	Fused softmax + cross-entropy loss computation
Learning Rate Control	Configurable learning rate with warmup support
LoRA Fine-Tuning	Parameter-efficient adaptation with low-rank matrices

Train Command Options

Option	Description
`-m, --model <path>`	Path to GGUF model file (required)
`--lr <n>`	Learning rate (default: 1e-4)
`--epochs <n>`	Number of training epochs (default: 1)
`--batch-size <n>`	Batch size (default: 1)
`--grad-clip <n>`	Gradient clipping norm (default: 1.0)
`--train-text <text>`	Training text for fine-tuning
`--train-file <path>`	Load training text from file (whitespace-delimited)
`--verbose`	Show detailed training progress
`--help`	Show training help

Implementation Status

Implementation	Training Status
facaded_transformer.cu	✅ Full CLI support
transformer.cu	✅ Full CLI support
facaded-transformer-opencl.cpp	✅ Full CLI support
transformer-opencl.cpp	✅ Full CLI support
Rust implementations	✅ GpuTrainer class available

Example Usage

# Basic training with inline text
./facaded-transformer-cuda train -m models/llama.gguf --train-text "Hello world"

# Training from a text file
./facaded-transformer-cuda train -m models/llama.gguf --train-file corpus.txt --epochs 10

# Training with custom parameters
./facaded-transformer-cuda train -m models/llama.gguf \
    --lr 0.0001 --epochs 10 --batch-size 4 --grad-clip 1.0 \
    --train-text "The quick brown fox" --verbose

# Fine-tuning from file with verbose output
./facaded_transformer_cuda train -m models/tinyllama.gguf \
    --epochs 100 --verbose --train-file training_data.txt

LoRA (Low-Rank Adaptation)

Overview

LoRA enables parameter-efficient fine-tuning by injecting trainable low-rank matrices into transformer layers. Instead of updating all model weights, LoRA freezes the base model and trains small adapter matrices, reducing memory usage by 10-100x while maintaining performance.

Original: Y = W × X
LoRA:     Y = W × X + (B × A) × X × scaling
          where A: (rank × in_dim), B: (out_dim × rank)

Key Benefits

Benefit	Description
Memory Efficient	Only ~0.1-1% of parameters are trainable
Fast Training	Smaller gradient computation and optimizer state
Composable	Multiple LoRA adapters can be trained and swapped
Mergeable	Adapters can be merged into base weights for zero overhead inference
Reversible	Original model preserved; adapters can be removed

LoRA Configuration

Parameter	Default	Description
`rank`	16	Low-rank dimension (r). Higher = more capacity, more memory
`alpha`	32.0	Scaling factor. Effective scale = alpha/rank
`dropout`	0.05	Dropout between A and B matrices (training only)
`enableQ`	true	Apply LoRA to attention Q projection
`enableK`	true	Apply LoRA to attention K projection
`enableV`	true	Apply LoRA to attention V projection
`enableO`	true	Apply LoRA to attention output projection
`enableGate`	true	Apply LoRA to FFN gate/w1 projection
`enableUp`	true	Apply LoRA to FFN up/w3 projection
`enableDown`	true	Apply LoRA to FFN down/w2 projection
`freezeBase`	true	Freeze base model weights (recommended)

LoRA Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         LoRA Adapter                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Input (dim)                                                       │
│       │                                                             │
│       ├────────────────────────────────┐                            │
│       │                                │                            │
│       ▼                                ▼                            │
│   ┌───────────┐                   ┌─────────┐                       │
│   │  Base W   │                   │ A matrix │ (rank × in_dim)      │
│   │  (frozen) │                   │ (trained)│                      │
│   └─────┬─────┘                   └────┬────┘                       │
│         │                              │                            │
│         │                              ▼                            │
│         │                        ┌──────────┐                       │
│         │                        │ Dropout  │ (training only)       │
│         │                        └────┬─────┘                       │
│         │                              │                            │
│         │                              ▼                            │
│         │                        ┌─────────┐                        │
│         │                        │ B matrix │ (out_dim × rank)      │
│         │                        │ (trained)│                       │
│         │                        └────┬────┘                        │
│         │                              │                            │
│         │                              ▼                            │
│         │                        × scaling (alpha/rank)             │
│         │                              │                            │
│         ▼                              ▼                            │
│       ┌─────────────────────────────────┐                           │
│       │            Output = Base + LoRA │                           │
│       └─────────────────────────────────┘                           │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Per-Layer Adapters

Each transformer layer can have up to 7 LoRA adapters:

Adapter	Projection	Typical Dimensions
Q	Query projection	dim → dim
K	Key projection	dim → kv_dim
V	Value projection	dim → kv_dim
O	Output projection	dim → dim
Gate	FFN gate/w1	dim → ffn_dim
Up	FFN up/w3	dim → ffn_dim
Down	FFN down/w2	ffn_dim → dim

Memory Comparison

For a 7B parameter model with rank=16:

Configuration	Trainable Parameters	Memory (FP32)
Full Fine-Tuning	~7,000,000,000	~28 GB
LoRA (all layers)	~8,000,000	~32 MB
LoRA (attention only)	~4,000,000	~16 MB

Implementation Status

Implementation	LoRA Status	Features
facaded_transformer.cu	✅ Full support	CUDA kernels, Adam optimizer
facaded-transformer-opencl.cpp	✅ Full support	CPU functions, backward pass
facaded_rust_cuda/	✅ Full support	Kani-verified, CISA compliant

CUDA Kernels (facaded_transformer.cu)

// LoRA CUDA Kernels
loraInitAKernel()      // Initialize A with small random values
loraInitBKernel()      // Initialize B to zeros
loraForwardAKernel()   // temp = A @ input
loraForwardBKernel()   // out += scaling * B @ temp
loraDropoutKernel()    // Apply inverted dropout

OpenCL/CPU Functions (facaded-transformer-opencl.cpp)

// LoRA CPU Functions
loraInitA()           // Initialize A matrix
loraInitB()           // Initialize B matrix (zeros)
loraForwardA()        // Forward: temp = A @ input
loraForwardB()        // Forward: out += scaling * B @ temp
loraDropout()         // Apply dropout (training)
loraBackwardB()       // Gradient w.r.t. B
loraBackwardTemp()    // Gradient w.r.t. temp
loraBackwardA()       // Gradient w.r.t. A
loraMerge()           // Merge adapter into base weights

Rust Implementation (facaded_rust_cuda/)

The Rust implementation includes 70+ LoRA-specific Kani proofs for CISA/NSA compliance:

// Configuration validation (CISA #1, #5)
LoRAConfig::validate()      // Reject invalid configs
LoRAConfig::try_scaling()   // Safe scaling factor computation

// Memory safety (CISA #4, #15)
calculate_adapter_memory()  // Checked arithmetic
validate_memory_budget()    // 1GB budget enforcement
safe_add_params()           // Overflow-safe accumulation

// File parsing security (CISA #1, #15)
load()                      // Validates all untrusted file fields

Security Constants (CISA #15)

Constant	Value	Purpose
`MAX_LORA_RANK`	256	Prevent excessive memory allocation
`MAX_MODEL_DIM`	65,536	Bound dimension sizes
`MAX_FFN_DIM`	131,072	Bound FFN dimensions
`MAX_LAYERS`	256	Limit layer count
`MAX_LORA_NAME_LEN`	1,024	Prevent DoS via long names
`MAX_LORA_MEMORY_BUDGET`	1 GB	Total LoRA memory limit

Example Usage

# Train with LoRA (all adapters enabled by default)
./facaded_transformer_cuda train -m models/llama.gguf \
    --lora-rank 16 --lora-alpha 32 --train-file data.txt

# Train with attention-only LoRA (faster, less memory)
./facaded_transformer_cuda train -m models/llama.gguf \
    --lora-layers q,k,v,o --train-text "Training data"

# Train with FFN-only LoRA
./facaded_transformer_cuda train -m models/llama.gguf \
    --lora-layers gate,up,down --epochs 10

# Save and load LoRA adapters
./facaded_transformer_cuda train -m model.gguf \
    --lora-save adapter.lora --train-file corpus.txt
./facaded_transformer_cuda generate -m model.gguf \
    --lora-load adapter.lora -p "Hello"

Testing

Running All Tests

# Run CUDA tests
./transformer_tests_cuda.sh

# Run OpenCL tests
./transformer_tests_opencl.sh

# Run Rust tests
cd rust_cuda && cargo test
cd facaded_rust_cuda && cargo test

Test Categories

Each test suite covers:

Category	Tests
Help & Usage	Command-line interface verification
Model Loading	GGUF parsing and validation
Quantization	Dequantization accuracy
Tokenization	BPE encoding/decoding
Generation	End-to-end inference
Introspection	Facade API functionality
Error Handling	Invalid input handling

Test Output Example

=========================================
Transformer CUDA Comprehensive Test Suite
=========================================

Group: Quantization Tests
Test: FP16 conversion
  ✓ FP16 to FP32 conversion passed
Test: FP16 zero conversion
  ✓ FP16 zero conversion passed
Test: Quantization type enum
  ✓ Quantization type enum values correct

Group: Facade Tests
Test: Facade initialization
  ✓ Facade starts unloaded
Test: Facade getters
  ✓ Unloaded facade returns zeros

=== Test Results ===
Passed: 15
Failed: 0
Total:  15
====================

Formal Verification with Kani

Overview

The Rust Facade implementation includes 196 Kani formal verification proof harnesses that mathematically prove the absence of certain classes of bugs. This goes beyond traditional testing to provide mathematical guarantees about code correctness.

Verification Categories

The test suite covers 12 of 15 CISA security verification requirements:

#	Requirement	Module(s)	Status
1	Strict Bound Checks	`bounds.rs`, `quant.rs`, `model.rs`	✅
2	Pointer Validity Proofs	`memory.rs`	✅
3	No-Panic Guarantee	`panics.rs`	✅
4	Integer Overflow Prevention	`arithmetic.rs`, `model.rs`	✅
5	Division-by-Zero Exclusion	`arithmetic.rs`	✅
6	Global State Consistency	N/A (no shared mutable state)	⚪
7	Deadlock-Free Logic	N/A (no locks in verified code)	⚪
8	Input Sanitization Bounds	`tokenizer.rs`	✅
9	Result Coverage Audit	`panics.rs`, `enums.rs`	✅
10	Memory Leak/Leakage Proofs	`memory.rs`	✅
11	Constant-Time Execution	N/A (no cryptographic secrets)	⚪
12	State Machine Integrity	`enums.rs`, `model.rs`	✅
13	Enum Exhaustion	`enums.rs`	✅
14	Floating-Point Sanity	`floats.rs`	✅
15	Resource Limit Compliance	`memory.rs`, `model.rs`	✅

Module Proof Counts

Module	Harnesses	Purpose
`bounds.rs`	8	Array/slice bounds checking
`arithmetic.rs`	11	Overflow/division-by-zero prevention
`memory.rs`	9	Memory safety & resource limits
`panics.rs`	12	No-panic guarantees
`enums.rs`	8	Exhaustive enum matching
`floats.rs`	11	Floating-point safety
`tokenizer.rs`	12	Input sanitization
`quant.rs`	15	Quantization arithmetic
`model.rs`	13	Model loading safety
`trainer.rs`	25	Training infrastructure safety
`lora.rs`	70	LoRA adapter safety & CISA compliance
Total	196

Key Kani Proofs

Bounds Checking Proofs

verify_get_scale_min_k4_bounds ✓
verify_q8_0_dequant_bounds ✓
verify_tokenizer_decode_bounds ✓

Quantization Safety Proofs

verify_q4k_scale_extraction ✓
verify_q6k_bit_reconstruction ✓
verify_bytes_calculation ✓

Arithmetic Safety Proofs

verify_block_count_no_overflow ✓
verify_scale_arithmetic_no_overflow ✓
verify_qk_k_division_safe ✓

Memory Safety Proofs

verify_bytemuck_alignment ✓
verify_block_struct_sizes ✓
verify_tensor_security_budget ✓

LoRA Safety Proofs

verify_validate_rejects_zero_rank ✓ (CISA #5: Division-by-zero)
verify_lora_scaling_factor ✓ (CISA #5: Safe division)
verify_adapter_memory_checked ✓ (CISA #4: Overflow prevention)
verify_memory_budget_enforcement ✓ (CISA #15: Resource limits)
verify_load_rejects_excessive_name_len ✓ (CISA #15: DoS prevention)
verify_try_new_rejects_zero_heads ✓ (CISA #5: Division-by-zero)
verify_backward_pass_safe ✓ (CISA #14: Floating-point safety)

Running Kani Verification

# Run all proofs
cd facaded_rust_cuda
cargo kani

# Run specific proof
cargo kani --harness verify_q8_0_dequant_bounds

# Run proofs for a specific module
cargo kani --harness "verify_q*"

Why Formal Verification Matters

Traditional testing can only verify specific test cases. Formal verification with Kani:

Exhaustively checks all possible inputs within defined bounds
Mathematically proves absence of panics, buffer overflows, and undefined behavior
Catches edge cases that random testing might miss
Provides cryptographic-level assurance for safety-critical code

CISA/NSA Compliance

Secure by Design

This project follows CISA (Cybersecurity and Infrastructure Security Agency) and NSA (National Security Agency) Secure by Design principles:

Principle	Implementation
Memory Safety	Rust ownership model eliminates buffer overflows, use-after-free, and data races
Formal Verification	196 Kani proofs mathematically verify absence of critical bugs
Input Validation	All CLI inputs validated before processing
Defense in Depth	Multiple layers of safety (language, compiler, runtime checks)
Secure Defaults	Safe default configurations throughout
Transparency	Open source with full code visibility

Compliance Checklist

Attestation

This codebase has been developed following secure software development lifecycle (SSDLC) practices and demonstrates:

196 formal verification proofs passed (Kani proofs across 11 modules including LoRA)
Zero warnings compilation across all implementations
Consistent API across all language/backend combinations
Production-ready code quality

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Author

Matthew Abbott
Email: mattbachg@gmail.com

Built with precision. Verified with rigor. Secured by design.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
facaded_rust_cuda		facaded_rust_cuda
rust_cuda		rust_cuda
README.md		README.md
agentic_transformer.cu		agentic_transformer.cu
facaded-transformer-opencl.cpp		facaded-transformer-opencl.cpp
facaded_transformer.cu		facaded_transformer.cu
license.md		license.md
test_results.md		test_results.md
transformer-opencl.cpp		transformer-opencl.cpp
transformer.cu		transformer.cu
transformer_gui.py		transformer_gui.py
transformer_kernels.metal		transformer_kernels.metal
transformer_metal.mm		transformer_metal.mm
transformer_tests_cuda.sh		transformer_tests_cuda.sh
transformer_tests_opencl.sh		transformer_tests_opencl.sh

Folders and files

Latest commit

History

Repository files navigation

GlassBoxAI-Transformer

Large Language Model Inference Suite

GPU-Accelerated Transformer Implementation with Formal Verification

Overview

Table of Contents

Features

Core Capabilities

Distributed Inference (Layer 2 Ethernet)

CPU/GPU Offloading

GPU Acceleration

Quantization Formats

Supported Models

Safety & Security

Architecture

File Structure

Prerequisites

Required

Optional

Installation & Compilation

C++ CUDA Implementation

C++ OpenCL Implementation

C++ Metal Implementation (🚧 In Development)

Rust CUDA Implementation

Build All

CLI Reference

Standard Transformer Commands

Usage

Commands

Server Mode Options

Client Mode Options

Benchmark Mode Options

Test Mode Options

Examples

Facade Transformer Commands

Commands

Facade Mode Options

Generate Mode Options

Test Mode Options (Facade)

Facade API Methods

Facade Examples

Rust Transformer Commands

Commands (rust_cuda)

Generate Mode Options (Rust)

Server Mode Options (Rust)

Client Mode Options (Rust)

Commands (facaded_rust_cuda)

Generate Mode Options (Rust Facade)

Analyze Mode Options (Rust Facade)

Rust Examples

Agentic Transformer Commands

Usage Modes

Generation Options

Scripting & Batch Options

Inspection & Diagnostics

Device & Performance Options

CPU Offloading Options

Interactive Agent Commands

Agentic Examples

Distributed Inference (Layer 2 Ethernet)

Overview

Architecture

DTX Protocol Messages

Requirements

Example: Two-Node Setup

CPU/GPU Offloading

Overview

Use Cases

Configuration

Performance Considerations

Training

Training Features

Train Command Options

Implementation Status

Example Usage

LoRA (Low-Rank Adaptation)

Overview

Packages