You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Requires macOS with Xcode command line tools
clang++ -O2 -std=c++17 -framework Metal -framework Foundation \
-o transformer-metal transformer_metal.mm
Note: The Metal implementation is currently in active development. Basic functionality works but some features may be incomplete.
Starting layer for remote execution (default: auto)
-e, --embed <dim>
Embedding dimension (default: 768)
-f, --ffn <dim>
FFN hidden dimension (default: 3072)
-a, --heads <n>
Number of attention heads (default: 12)
-k, --kvheads <n>
KV heads for GQA (default: 12)
-q, --seq-len <n>
Sequence length (default: 512)
-v, --vocab-size <n>
Vocabulary size (default: 50257)
-x, --max-seq-len <n>
Maximum sequence length (default: 2048)
--quant <type>
Quantization type (see server options)
--rope-base <n>
RoPE base frequency (default: 10000.0)
--rope-scale <n>
RoPE scaling factor (default: 1.0)
--eps <n>
Layer norm epsilon (default: 1e-5)
--no-cache
Disable activation caching
--no-grad-cache
Disable gradient caching
--timeout <ms>
Connection timeout (default: 5000ms)
--retries <n>
Connection retry count (default: 3)
--verbose
Enable verbose output
Benchmark Mode Options
Option
Description
-i, --interface <name>
Network interface (default: eth0)
-s, --server <mac>
Server MAC address (required)
-n, --iterations <n>
Benchmark iterations (default: 10)
-l, --layers <n>
Transformer layers to benchmark (default: 12)
-e, --embed <dim>
Embedding dimension (default: 768)
-q, --seq-len <n>
Sequence length (default: 512)
--batch-size <n>
Batch size for benchmarking (default: 1)
--warmup <n>
Warmup iterations (default: 2)
--output <file>
Output results to CSV file
--verbose
Enable verbose output
Test Mode Options
Option
Description
--all
Run all tests
--protocol
Test protocol handling
--config
Test configuration
--quant
Test quantization/dequantization
--kernels
Test CUDA kernels (requires GPU)
--network
Test network layer
--verbose
Enable verbose test output
Examples
# Generate text from a prompt
./transformer-cuda generate -m models/tinyllama.gguf -p "Hello, world" -n 100
# Interactive chat mode with GPU acceleration
./transformer-cuda generate -m models/llama-7b.Q4_K_M.gguf -i -g
# Show model information
./transformer-cuda info -m models/tinyllama.gguf
# Run tests
./transformer-cuda test --all
# CPU offloading - run layers 0,1,2 on CPU, rest on GPU
./transformer-cuda generate -m models/llama.gguf -p "Hello" --cpu-layers 0,1,2
# All CPU execution (for systems without GPU)
./transformer-cuda generate -m models/tinyllama.gguf -p "Hello" --all-cpu
# Distributed inference - start server on remote GPU node
sudo ./transformer-cuda --server --interface eth0
# Distributed inference - connect client to server
sudo ./transformer-cuda generate -m models/llama.gguf -p "Hello" \
--client --interface eth0 --remote-mac aa:bb:cc:dd:ee:ff -r 6
Facade Transformer Commands
The facade implementations add deep introspection capabilities for analyzing model internals.
Commands
Command
Description
server
Run as distributed inference server
client
Run as distributed inference client
facade
Run facade mode with introspection
generate
Text generation from GGUF model
benchmark
Run network/inference benchmarks
test
Run built-in tests
Facade Mode Options
Option
Description
--model <path>
GGUF model file path (required)
--tokenizer <path>
Tokenizer JSON file path
--prompt <text>
Text prompt for generation
--max-tokens <n>
Maximum tokens to generate (default: 100)
--temperature <n>
Sampling temperature (default: 1.0)
--top-k <n>
Top-K sampling (default: 40)
--top-p <n>
Top-P nucleus sampling (default: 0.9)
--inspect
Enable introspection mode
--show-attention
Display attention weights
--show-hidden <layer>
Display hidden states for layer
--show-qkv <layer>
Display Q/K/V vectors for layer
--show-logits
Display output logits
--show-entropy
Display attention entropy per layer
--show-saliency <pos>
Display saliency map for token position
--show-weights <layer>
Display weight matrices for layer
--show-tensors
List all tensor names in model
--dump-hidden <file>
Dump hidden states to CSV file
--dump-attention <file>
Dump attention weights to CSV file
--layer <n>
Specific layer for inspection (default: all)
--head <n>
Specific attention head (default: all)
--position <n>
Specific token position (default: all)
--verbose
Enable verbose output
Generate Mode Options
Option
Description
-m, --model <path>
Path to GGUF model file (required)
-p, --prompt <text>
Text prompt for generation
-n, --tokens <n>
Max tokens to generate (default: 256)
-t, --temperature <n>
Sampling temperature (default: 0.7)
--top-k <n>
Top-K sampling (default: 40)
--top-p <n>
Top-P/nucleus sampling (default: 0.9)
-i, --interactive
Interactive chat mode
-g, --gpu
Use GPU-accelerated inference
Test Mode Options (Facade)
Option
Description
--all
Run all tests
--protocol
Test protocol handling
--config
Test configuration
--quant
Test quantization/dequantization
--kernels
Test CUDA kernels (requires GPU)
--network
Test network layer
--facade
Test facade introspection functions
--tokenizer
Test tokenizer encode/decode
--gguf
Test GGUF model loading
--verbose
Enable verbose test output
Facade API Methods
The facade provides programmatic access to internal states:
Method
Description
getHiddenState(layer, pos)
Get hidden state vector
getAttentionScores(layer, head)
Get attention weights
getQKV(layer, pos, type)
Get Q/K/V vectors
getLogits()
Get output logits
getLayerNormStats(layer)
Get normalization statistics
getTokenProbabilities(topK)
Get token probabilities
Facade Examples
# Run facade with introspection
./facaded-transformer-cuda facade --model model.gguf --tokenizer tok.json \
--prompt "Hello" --inspect
# Show attention weights for layer 0
./facaded-transformer-cuda facade --model model.gguf --prompt "Test" \
--show-attention --layer 0
# Dump hidden states to CSV
./facaded-transformer-cuda facade --model model.gguf --prompt "Test" \
--dump-hidden hidden.csv
# Generate text with GPU acceleration
./facaded-transformer-cuda generate -m models/llama.gguf -p "Hello" -g
# Interactive chat mode
./facaded-transformer-cuda generate -m models/llama.gguf -i -g
Rust Transformer Commands
The Rust implementations provide memory-safe inference with formal verification.
Commands (rust_cuda)
Command
Description
generate
Text generation from GGUF model
server
Run as distributed inference server
client
Run as distributed inference client
info
Display model information
Generate Mode Options (Rust)
Option
Description
-m, --model <path>
Path to GGUF model file (required)
-p, --prompt <text>
Text prompt for generation
-n, --tokens <n>
Max tokens to generate (default: 256)
-t, --temperature <n>
Sampling temperature (default: 0.7)
--top-k <n>
Top-K sampling (default: 40)
--top-p <n>
Top-P/nucleus sampling (default: 0.9)
--rep-penalty <n>
Repetition penalty (default: 1.1)
-i, --interactive
Interactive chat mode
Server Mode Options (Rust)
Option
Description
-i, --interface <name>
Network interface (default: eth0)
--server-id <n>
Server ID for multi-server setups
Client Mode Options (Rust)
Option
Description
-s, --server <mac>
Server MAC address (required, XX:XX:XX:XX:XX:XX)
-i, --interface <name>
Network interface (default: eth0)
-l, --layers <n>
Total transformer layers (default: 12)
-r, --remote <n>
Remote layers to offload (default: 6)
-e, --embed <dim>
Embedding dimension (default: 768)
--timeout <ms>
Connection timeout (default: 5000)
Commands (facaded_rust_cuda)
Command
Description
generate
Text generation with introspection
analyze
Analyze model internals for a prompt
inspect
Interactive inspection mode
info
Display model information
Generate Mode Options (Rust Facade)
Option
Description
-m, --model <path>
Path to GGUF model file (required)
-p, --prompt <text>
Text prompt for generation
-n, --tokens <n>
Max tokens to generate (default: 256)
-t, --temperature <n>
Sampling temperature (default: 0.7)
--top-k <n>
Top-K sampling (default: 40)
--top-p <n>
Top-P/nucleus sampling (default: 0.9)
-i, --interactive
Interactive chat mode
--show-hidden
Show hidden state statistics
--show-entropy
Show attention entropy
Analyze Mode Options (Rust Facade)
Option
Description
-m, --model <path>
Path to GGUF model file (required)
-p, --prompt <text>
Text prompt to analyze (required)
--layer <n>
Layer to inspect (default: last)
--head <n>
Attention head to inspect (default: 0)
--show-qkv
Show Q/K/V vectors
--show-logits
Show top-k logits
--show-saliency
Show saliency map
Rust Examples
# Generate text (rust_cuda)
./rust_cuda/target/release/glassbox-transformer generate \
-m models/tinyllama.gguf -p "Hello world" -n 100
# Interactive mode with repetition penalty
./rust_cuda/target/release/glassbox-transformer generate \
-m models/llama.gguf -i --rep-penalty 1.2
# Start distributed server
sudo ./rust_cuda/target/release/glassbox-transformer server -i eth0
# Connect as client with layer offloading
sudo ./rust_cuda/target/release/glassbox-transformer client \
-s aa:bb:cc:dd:ee:ff -i eth0 -r 6
# Analyze with Q/K/V inspection (facaded)
./facaded_rust_cuda/target/release/glassbox-transformer-facaded analyze \
-m models/llama.gguf -p "What is AI?" --show-qkv --layer 0
# Generate with hidden state display (facaded)
./facaded_rust_cuda/target/release/glassbox-transformer-facaded generate \
-m models/llama.gguf -p "Hello" --show-hidden --show-entropy
Agentic Transformer Commands
The agentic transformer provides an interactive agent interface with CPU offloading support.
Usage Modes
Mode
Description
Interactive
Run with no arguments for REPL mode
Script
--script <file> to run commands from file
Stdin
--stdin to read commands from pipe
Single
Direct prompt with -p flag
Generation Options
Option
Description
-p, --prompt <text>
Input prompt for generation
-n, --max-tokens <n>
Maximum tokens to generate (default: 5)
-t, --temperature <n>
Sampling temperature 0.0-2.0 (default: 1.0)
--top-k <k>
Top-K sampling (disable with -1)
--top-p <p>
Nucleus/Top-P sampling 0.0-1.0 (default: 1.0)
--repetition-penalty <p>
Penalize repeated tokens (default: 1.0)
--context-length <n>
Max context window size (default: 1024)
--seed <s>
Random seed for reproducibility
Scripting & Batch Options
Option
Description
--script <file>
Load and execute commands from script file
--stdin
Read commands from stdin (for piping)
--log <file>
Write session log to file (default: agent_history.log)
--no-log
Disable session logging
-o, --output <file>
Save generated text to file
--json-output
Format output as JSON
Inspection & Diagnostics
Option
Description
--list-tensors
List all tensors in model and exit
--show-quant-stats
Display quantization statistics (default: yes)
--no-quant-stats
Skip quantization statistics output
--fp32-only
Only load F32 tensors, skip quantized
--test-dequant
Test dequantization on all quantized tensors
Device & Performance Options
Option
Description
--device <id>
Select GPU device ID (default: 0)
--batch-size <n>
Batch size for processing (default: 1)
--memory-limit <MB>
Limit GPU memory usage in MB (0=unlimited)
--benchmark
Run benchmark tests after generation
CPU Offloading Options
Option
Description
--cpu-layers <list>
Run specified layers on CPU (e.g., 0,2,4)
--all-cpu
Run all transformer layers on CPU (RAM)
--all-gpu
Run all transformer layers on GPU (default)
Interactive Agent Commands
Command
Description
load <model.gguf> <tok.json>
Load model and tokenizer
run <prompt> [tokens] [temp]
Run inference/generation
info
Display model architecture
inspect [type]
Inspect model (summary/performance/layers)
list-tensors
List all model tensors
quant-stats
Show quantization statistics
save <filename>
Save last output to file
history
Show action history
help
Show agent commands
quit/exit
Exit agent
Agentic Examples
# Interactive mode
./agentic_transformer
# Single generation with CPU offloading
./agentic_transformer model.gguf tokenizer.json \
-p "Once upon a time" -n 50 -t 0.9 --cpu-layers 0,1,2
# Batch mode from script
./agentic_transformer model.gguf tokenizer.json \
--script commands.txt --log batch.log
# Piped batch modeecho"load model.gguf tok.json"| ./agentic_transformer --stdin
# List tensors and quantization stats
./agentic_transformer model.gguf tokenizer.json --list-tensors
./agentic_transformer model.gguf tokenizer.json --show-quant-stats
Distributed Inference (Layer 2 Ethernet)
Overview
GlassBoxAI-Transformer supports distributed inference over Layer 2 Ethernet using the DTX Protocol (Distributed Tensor eXchange). This enables offloading transformer layers to remote GPU nodes with minimal latency by bypassing the TCP/IP stack entirely.
For systems with limited GPU memory, GlassBoxAI-Transformer supports mixed device execution where specific transformer layers can be offloaded to CPU while the rest run on GPU.
Use Cases
Large models on small GPUs: Run 7B+ models on 4-6GB GPUs
Memory pressure: Avoid out-of-memory errors during inference
Debugging: Isolate CPU vs GPU computation issues
CPU-only systems: Run inference without any GPU
Configuration
# Offload first 4 layers to CPU
./transformer-cuda generate -m model.gguf -p "Hello" --cpu-layers 0,1,2,3
# Offload every other layer
./transformer-cuda generate -m model.gguf -p "Hello" --cpu-layers 0,2,4,6,8
# All CPU (no GPU required)
./transformer-cuda generate -m model.gguf -p "Hello" --all-cpu
# All GPU (default, maximum performance)
./transformer-cuda generate -m model.gguf -p "Hello" --all-gpu
Performance Considerations
Configuration
Performance
Memory Usage
All GPU
Fastest
Highest GPU memory
Mixed (few CPU layers)
~80-90% of GPU
Reduced GPU memory
Mixed (many CPU layers)
~30-50% of GPU
Minimal GPU memory
All CPU
Slowest
No GPU memory
Training
Training Features
Feature
Description
Backpropagation
Full gradient computation through all transformer layers
Adam Optimizer
Adaptive learning rate with bias correction (β1=0.9, β2=0.999)
Gradient Clipping
Norm-based gradient clipping for training stability
Activation Caching
Efficient caching for backward pass computation
Cross-Entropy Loss
Fused softmax + cross-entropy loss computation
Learning Rate Control
Configurable learning rate with warmup support
LoRA Fine-Tuning
Parameter-efficient adaptation with low-rank matrices
Train Command Options
Option
Description
-m, --model <path>
Path to GGUF model file (required)
--lr <n>
Learning rate (default: 1e-4)
--epochs <n>
Number of training epochs (default: 1)
--batch-size <n>
Batch size (default: 1)
--grad-clip <n>
Gradient clipping norm (default: 1.0)
--train-text <text>
Training text for fine-tuning
--train-file <path>
Load training text from file (whitespace-delimited)
--verbose
Show detailed training progress
--help
Show training help
Implementation Status
Implementation
Training Status
facaded_transformer.cu
✅ Full CLI support
transformer.cu
✅ Full CLI support
facaded-transformer-opencl.cpp
✅ Full CLI support
transformer-opencl.cpp
✅ Full CLI support
Rust implementations
✅ GpuTrainer class available
Example Usage
# Basic training with inline text
./facaded-transformer-cuda train -m models/llama.gguf --train-text "Hello world"# Training from a text file
./facaded-transformer-cuda train -m models/llama.gguf --train-file corpus.txt --epochs 10
# Training with custom parameters
./facaded-transformer-cuda train -m models/llama.gguf \
--lr 0.0001 --epochs 10 --batch-size 4 --grad-clip 1.0 \
--train-text "The quick brown fox" --verbose
# Fine-tuning from file with verbose output
./facaded_transformer_cuda train -m models/tinyllama.gguf \
--epochs 100 --verbose --train-file training_data.txt
LoRA (Low-Rank Adaptation)
Overview
LoRA enables parameter-efficient fine-tuning by injecting trainable low-rank matrices into transformer layers. Instead of updating all model weights, LoRA freezes the base model and trains small adapter matrices, reducing memory usage by 10-100x while maintaining performance.
Original: Y = W × X
LoRA: Y = W × X + (B × A) × X × scaling
where A: (rank × in_dim), B: (out_dim × rank)
Key Benefits
Benefit
Description
Memory Efficient
Only ~0.1-1% of parameters are trainable
Fast Training
Smaller gradient computation and optimizer state
Composable
Multiple LoRA adapters can be trained and swapped
Mergeable
Adapters can be merged into base weights for zero overhead inference
Reversible
Original model preserved; adapters can be removed
LoRA Configuration
Parameter
Default
Description
rank
16
Low-rank dimension (r). Higher = more capacity, more memory
Each transformer layer can have up to 7 LoRA adapters:
Adapter
Projection
Typical Dimensions
Q
Query projection
dim → dim
K
Key projection
dim → kv_dim
V
Value projection
dim → kv_dim
O
Output projection
dim → dim
Gate
FFN gate/w1
dim → ffn_dim
Up
FFN up/w3
dim → ffn_dim
Down
FFN down/w2
ffn_dim → dim
Memory Comparison
For a 7B parameter model with rank=16:
Configuration
Trainable Parameters
Memory (FP32)
Full Fine-Tuning
~7,000,000,000
~28 GB
LoRA (all layers)
~8,000,000
~32 MB
LoRA (attention only)
~4,000,000
~16 MB
Implementation Status
Implementation
LoRA Status
Features
facaded_transformer.cu
✅ Full support
CUDA kernels, Adam optimizer
facaded-transformer-opencl.cpp
✅ Full support
CPU functions, backward pass
facaded_rust_cuda/
✅ Full support
Kani-verified, CISA compliant
CUDA Kernels (facaded_transformer.cu)
// LoRA CUDA KernelsloraInitAKernel() // Initialize A with small random values
loraInitBKernel() // Initialize B to zeros
loraForwardAKernel() // temp = A @ input
loraForwardBKernel() // out += scaling * B @ temp
loraDropoutKernel() // Apply inverted dropout
# Run CUDA tests
./transformer_tests_cuda.sh
# Run OpenCL tests
./transformer_tests_opencl.sh
# Run Rust testscd rust_cuda && cargo testcd facaded_rust_cuda && cargo test
Test Categories
Each test suite covers:
Category
Tests
Help & Usage
Command-line interface verification
Model Loading
GGUF parsing and validation
Quantization
Dequantization accuracy
Tokenization
BPE encoding/decoding
Generation
End-to-end inference
Introspection
Facade API functionality
Error Handling
Invalid input handling
Test Output Example
=========================================
Transformer CUDA Comprehensive Test Suite
=========================================
Group: Quantization Tests
Test: FP16 conversion
✓ FP16 to FP32 conversion passed
Test: FP16 zero conversion
✓ FP16 zero conversion passed
Test: Quantization type enum
✓ Quantization type enum values correct
Group: Facade Tests
Test: Facade initialization
✓ Facade starts unloaded
Test: Facade getters
✓ Unloaded facade returns zeros
=== Test Results ===
Passed: 15
Failed: 0
Total: 15
====================
Formal Verification with Kani
Overview
The Rust Facade implementation includes 196 Kani formal verification proof harnesses that mathematically prove the absence of certain classes of bugs. This goes beyond traditional testing to provide mathematical guarantees about code correctness.
Verification Categories
The test suite covers 12 of 15 CISA security verification requirements:
# Run all proofscd facaded_rust_cuda
cargo kani
# Run specific proof
cargo kani --harness verify_q8_0_dequant_bounds
# Run proofs for a specific module
cargo kani --harness "verify_q*"
Why Formal Verification Matters
Traditional testing can only verify specific test cases. Formal verification with Kani:
Exhaustively checks all possible inputs within defined bounds
Mathematically proves absence of panics, buffer overflows, and undefined behavior
Catches edge cases that random testing might miss
Provides cryptographic-level assurance for safety-critical code
CISA/NSA Compliance
Secure by Design
This project follows CISA (Cybersecurity and Infrastructure Security Agency) and NSA (National Security Agency) Secure by Design principles:
Principle
Implementation
Memory Safety
Rust ownership model eliminates buffer overflows, use-after-free, and data races
Formal Verification
196 Kani proofs mathematically verify absence of critical bugs
Input Validation
All CLI inputs validated before processing
Defense in Depth
Multiple layers of safety (language, compiler, runtime checks)
This codebase has been developed following secure software development lifecycle (SSDLC) practices and demonstrates:
196 formal verification proofs passed (Kani proofs across 11 modules including LoRA)
Zero warnings compilation across all implementations
Consistent API across all language/backend combinations
Production-ready code quality
License
MIT License
Copyright (c) 2025 Matthew Abbott
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.