Boat: Deep Learning Framework in C

Boat is a lightweight, high-performance deep learning framework written in pure C with CUDA GPU acceleration. Designed for inference, training, and fine-tuning of neural networks with support for common model formats.

Key Features

Pure C Implementation: Minimal dependencies, easy integration into existing C/C++ projects
Automatic Differentiation: Computational graph-based autodiff with gradient tracking
Comprehensive Data Type Support:
- Floating point: FP64, FP32, FP16, FP8, FP4, BFLOAT16
- Integer: INT64, INT32, INT8, UINT8
- Low-bit quantization: BITS2 (2-bit packed), BITS1 (1-bit binary networks)
- Boolean: BOOL type
Quantization Pipeline: UINT8/INT8 affine quantization, BITS2 (2-bit), FLOAT4 (4-bit), per-channel, and QAT fake quantization
Model Format Support: ONNX (load/export/runtime executor), PyTorch (via LibTorch), HuggingFace Safetensors, GGUF (Q4_0, Q4_1, Q5_0, Q8_0)
Data Pipeline: Dataset/DataLoader abstraction with batching, shuffling, multi-threaded prefetch, and transforms
Performance Optimizations: SIMD (AVX2/NEON), SGEMM micro-kernel (hand-tuned with packing), OpenBLAS backend for accelerated matrix multiplication, OpenMP parallelism, memory pooling
CUDA GPU Acceleration: cuBLAS matmul, cuDNN conv/batchnorm, fused attention kernels (flash attention, GQA decode), FP8/BF16 inference and training kernels, custom CUDA kernels for element-wise ops, activations, pooling, normalization, and optimizers
Memory Efficient: Explicit memory management with reference counting
Cross-Platform: Works on Linux, macOS, and Windows
Extensible Architecture: Modular design for adding new operations and layers

Design Principles

Minimal Dependencies: Pure C with optional CUDA backend
Memory Efficient: Explicit memory management with reference counting
Extensible: Modular architecture for adding new operations and layers
Portable: Works on Linux, macOS, and Windows
Performance: Optimized for both CPU and GPU computation
Quantization Ready: Native support for low-bit networks

Installation

Prerequisites

C compiler (GCC, Clang, or MSVC)
CMake 3.10+ (recommended)
Git

Building from Source

Using CMake (Recommended)

# Clone the repository
git clone https://github.com/xiaoshaoning/boat.git
cd boat

# Create build directory
mkdir build
cd build

# Configure with CMake
cmake ..

# Build the library
make

# (Optional) Install system-wide
sudo make install

Using Makefile

The project also includes a traditional Makefile for simpler builds:

# Clone the repository
git clone https://github.com/xiaoshaoning/boat.git
cd boat

# Build the library
make all

# Build with debug symbols
make dev

# Build optimized release version
make release

# Run tests
make test

# Clean build artifacts
make clean

The Makefile automatically compiles all source files and creates a shared library libboat.so (or boat.dll on Windows) in the build/lib/ directory.

Build Options

-DBOAT_WITH_TESTS=ON: Build test suite
-DBOAT_WITH_EXAMPLES=ON: Build example programs
-DBOAT_WITH_ONNX=ON: Enable ONNX support (requires protobuf)
-DBOAT_WITH_CUDA=ON: Enable CUDA GPU acceleration (requires CUDA Toolkit and NVIDIA GPU)
-DBOAT_WITH_CUDNN=ON: Enable cuDNN integration (requires cuDNN)
-DBOAT_WITH_OPENBLAS=ON: Enable OpenBLAS backend for accelerated matrix multiplication
- Set -DBOAT_OPENBLAS_ROOT=/path/to/openblas if not in a standard location
-DBOAT_WITH_OPENMP=ON: Enable OpenMP parallelism
-DBOAT_WITH_SIMD=ON: Enable SIMD vectorization (AVX2/NEON)
-DBOAT_WITH_ONNXRUNTIME=ON: Enable ONNX Runtime executor

Build Configurations

Debug: Default build with debug symbols and assertions
Release: Optimized build (-O2 -DNDEBUG)
MinSizeRel: Size-optimized build
RelWithDebInfo: Release with debug symbols

Tests are enabled by default in Debug builds and disabled in Release/MinSizeRel builds.

Quick Start

Basic Tensor Operations

#include <boat/boat.h>
#include <boat/tensor.h>

int main() {
    boat_init();

    // Create a tensor
    int64_t shape[] = {2, 3};
    boat_tensor_t* tensor = boat_tensor_create(shape, 2, BOAT_DTYPE_FLOAT32);

    // Access tensor properties
    size_t ndim = boat_tensor_ndim(tensor);
    int64_t* tensor_shape = boat_tensor_shape(tensor);
    boat_dtype_t dtype = boat_tensor_dtype(tensor);

    // Perform operations
    boat_tensor_t* transposed = boat_tensor_transpose(tensor, NULL, 0);

    // Cleanup
    boat_tensor_unref(tensor);
    boat_tensor_unref(transposed);
    boat_cleanup();

    return 0;
}

Neural Network Training Example

#include <boat/boat.h>
#include <boat/layers.h>
#include <boat/optimizers.h>
#include <boat/loss.h>

int main() {
    boat_init();

    // Create a simple feedforward network
    boat_sequential_model_t* model = boat_sequential_create();

    // Add layers
    boat_layer_t* dense1 = boat_dense_layer_create(784, 128, true);
    boat_layer_t* relu1 = boat_relu_layer_create();
    boat_layer_t* dense2 = boat_dense_layer_create(128, 10, true);
    boat_layer_t* softmax = boat_softmax_layer_create();

    boat_sequential_add(model, dense1);
    boat_sequential_add(model, relu1);
    boat_sequential_add(model, dense2);
    boat_sequential_add(model, softmax);

    // Create optimizer
    boat_optimizer_t* optimizer = boat_adam_optimizer_create(0.001f, 0.9f, 0.999f, 1e-8f);

    // Create loss function
    boat_loss_t* loss = boat_cross_entropy_loss_create();

    // Training loop (simplified)
    for (int epoch = 0; epoch < 10; epoch++) {
        // Forward pass
        boat_tensor_t* output = boat_model_forward(model, input);

        // Compute loss
        float loss_value = boat_loss_compute(loss, output, target);

        // Backward pass
        boat_tensor_t* grad = boat_loss_backward(loss);
        boat_model_backward(model, grad);

        // Update parameters
        boat_optimizer_step(optimizer);
        boat_optimizer_zero_grad(optimizer);

        printf("Epoch %d, Loss: %f\n", epoch, loss_value);
    }

    // Cleanup
    boat_optimizer_free(optimizer);
    boat_loss_free(loss);
    boat_model_free(model);
    boat_cleanup();

    return 0;
}

Automatic Differentiation Example

#include <boat/boat.h>
#include <boat/autodiff.h>

int main() {
    boat_init();

    // Create variables with gradient tracking
    boat_tensor_t* tensor_a = boat_tensor_from_data((int64_t[]){2, 2}, 2, BOAT_DTYPE_FLOAT32, data_a);
    boat_tensor_t* tensor_b = boat_tensor_from_data((int64_t[]){2, 2}, 2, BOAT_DTYPE_FLOAT32, data_b);

    boat_variable_t* a = boat_variable_create(tensor_a, true);
    boat_variable_t* b = boat_variable_create(tensor_b, true);

    // Perform operations with gradient tracking
    boat_variable_t* c = boat_add(a, b);
    boat_variable_t* d = boat_mul(c, a);
    boat_variable_t* e = boat_relu(d);

    // Compute gradients
    boat_backward(e);

    // Access gradients
    boat_tensor_t* grad_a = boat_variable_grad(a);
    boat_tensor_t* grad_b = boat_variable_grad(b);

    // Cleanup
    boat_variable_free(a);
    boat_variable_free(b);
    boat_variable_free(c);
    boat_variable_free(d);
    boat_variable_free(e);
    boat_cleanup();

    return 0;
}

MNIST Example

Boat includes a complete MNIST digit recognition example that demonstrates the framework's capabilities for computer vision tasks.

Model Architecture

A convolutional neural network (CNN) for MNIST classification:

Input: 1x28x28 (channels x height x width)
├── Conv2D(32, kernel_size=3x3, padding=1)
├── ReLU()
├── MaxPool2D(kernel_size=2x2, stride=2)
├── Conv2D(64, kernel_size=3x3, padding=1)
├── ReLU()
├── MaxPool2D(kernel_size=2x2, stride=2)
├── Flatten()
├── Dense(128)
├── ReLU()
├── Dense(10)
└── Softmax()

Running the MNIST Example

# Navigate to the MNIST example directory
cd examples/mnist

# Prepare the data (requires Python 3.x)
python mnist_data.py

# Build and run via CMake (from project root)
cd ../..
mkdir -p build && cd build
cmake .. -DBOAT_WITH_EXAMPLES=ON
make
./examples/mnist/mnist

Automatic Differentiation Version

Boat also includes an advanced MNIST example using automatic differentiation (mnist_autodiff.c) that demonstrates:

Dynamic computation graph with gradient tracking
Learning rate schedulers (cosine annealing, step LR)
Gradient clipping and monitoring
Memory optimization with pooling strategies
Auto-tuning of hyperparameters during training
Comprehensive logging and progress tracking

The autodiff version is built automatically alongside the rest of the framework via CMake. Both mnist and mnist_autodiff are compiled when building with examples enabled:

mkdir build && cd build
cmake .. -DBOAT_WITH_EXAMPLES=ON
make
# Run either version:
./examples/mnist/mnist
./examples/mnist/mnist_autodiff

The autodiff version provides more detailed training metrics and automatic hyperparameter tuning capabilities.

Key Code Snippets

Model Creation:

// Create a convolutional neural network for MNIST
boat_sequential_model_t* model = boat_sequential_create();

// Add convolutional layers
boat_layer_t* conv1 = boat_conv_layer_create(1, 32, 3, 1, 1, 1);
boat_layer_t* relu1 = boat_relu_layer_create();
boat_layer_t* pool1 = boat_pool_layer_create(2, 2, 0);

boat_layer_t* conv2 = boat_conv_layer_create(32, 64, 3, 1, 1, 1);
boat_layer_t* relu2 = boat_relu_layer_create();
boat_layer_t* pool2 = boat_pool_layer_create(2, 2, 0);

// Add fully connected layers
boat_layer_t* flatten = boat_flatten_layer_create();
boat_layer_t* fc1 = boat_dense_layer_create(7*7*64, 128, true);
boat_layer_t* relu3 = boat_relu_layer_create();
boat_layer_t* fc2 = boat_dense_layer_create(128, 10, true);
boat_layer_t* softmax = boat_softmax_layer_create(-1);

// Build the sequential model
boat_sequential_add(model, conv1);
boat_sequential_add(model, relu1);
boat_sequential_add(model, pool1);
boat_sequential_add(model, conv2);
boat_sequential_add(model, relu2);
boat_sequential_add(model, pool2);
boat_sequential_add(model, flatten);
boat_sequential_add(model, fc1);
boat_sequential_add(model, relu3);
boat_sequential_add(model, fc2);
boat_sequential_add(model, softmax);

Training Loop:

// Create optimizer and loss function
boat_optimizer_t* optimizer = boat_adam_optimizer_create(0.001f, 0.9f, 0.999f, 1e-8f);
boat_loss_t* loss = boat_cross_entropy_loss_create();

// Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
    float epoch_loss = 0.0f;
    int correct = 0;

    for (int batch = 0; batch < num_batches; batch++) {
        // Get batch data
        boat_tensor_t* batch_images = get_batch_images(batch);
        boat_tensor_t* batch_labels = get_batch_labels(batch);

        // Forward pass
        boat_tensor_t* predictions = boat_model_forward(model, batch_images);

        // Compute loss
        float batch_loss = boat_loss_compute(loss, predictions, batch_labels);
        epoch_loss += batch_loss;

        // Compute accuracy
        correct += compute_correct_predictions(predictions, batch_labels);

        // Backward pass
        boat_tensor_t* grad = boat_loss_backward(loss);
        boat_model_backward(model, grad);

        // Update parameters
        boat_optimizer_step(optimizer);
        boat_optimizer_zero_grad(optimizer);

        // Cleanup
        boat_tensor_unref(predictions);
        boat_tensor_unref(grad);
    }

    // Compute epoch statistics
    float accuracy = (float)correct / (num_batches * batch_size);
    printf("Epoch %d: Loss = %.4f, Accuracy = %.2f%%\n",
           epoch + 1, epoch_loss / num_batches, accuracy * 100.0f);
}

Expected Results

With the Adam optimizer and proper data standardization, the MNIST example achieves:

Training accuracy: >99% (converges within 10 epochs with default settings)
Test accuracy: >96% (verified on held-out test set)
Training time: ~11 minutes on CPU (1000 samples, 10 epochs, batch size 32)

Both the manual gradient and automatic differentiation (mnist_autodiff) versions achieve comparable results.

Data Preparation

The mnist_data.py script downloads and preprocesses the MNIST dataset:

import mnist
import numpy as np
import struct

# Load MNIST data
train_images = mnist.train_images()
train_labels = mnist.train_labels()
test_images = mnist.test_images()
test_labels = mnist.test_labels()

# Normalize to [0, 1] range
train_images = train_images.astype(np.float32) / 255.0
test_images = test_images.astype(np.float32) / 255.0

# Reshape for Boat (N, C, H, W format)
train_images = train_images.reshape(-1, 1, 28, 28)
test_images = test_images.reshape(-1, 1, 28, 28)

# Save as binary files for C consumption
save_tensor_binary("train_images.bin", train_images)
save_tensor_binary("train_labels.bin", train_labels.reshape(-1, 1))
save_tensor_binary("test_images.bin", test_images)
save_tensor_binary("test_labels.bin", test_labels.reshape(-1, 1))

For more details, see the MNIST example documentation.

NanoChat Example

NanoChat is a GPT LLM example (d34 2.2B parameters) with CUDA-accelerated inference, training, and an OpenAI-compatible HTTP server.

Chat CLI

# Build with CUDA enabled
mkdir build && cd build
cmake .. -DBOAT_WITH_CUDA=ON -DBOAT_WITH_EXAMPLES=ON
make

# Run interactive chat
./examples/nanochat/nanochat_cli <model_dir>

The chat CLI supports token-by-token streaming, markdown rendering (Windows console), and conversation history.

HTTP Server

# Start the server
./examples/nanochat/server <model_dir>

# Query via curl (OpenAI-compatible API)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}]}'

Training

# Run training (supports pretraining, SFT, and GRPO)
./examples/nanochat/nanochat_train <model_dir> <data_dir>

The training pipeline includes:

Muon and AdamW optimizers
FP8 dynamic tensorwise scaling
BOS-aligned best-fit batching
GRPO (Group Relative Policy Optimization) for RL fine-tuning

Architecture

NanoChat implements the d34 architecture: 34 transformer layers, GQA (16 heads, 2 KV heads), RoPE, ReLU² activation, sliding window attention, value residual, and logit softcap. All attention and FFN operations are accelerated with custom fused CUDA kernels.

For detailed design, see docs/nanochat_plan.md.

Core Components

Tensor Operations

Creation and manipulation of multi-dimensional arrays
Reshape, transpose, slice operations
Arithmetic operations (add, sub, mul, div)
Linear algebra operations (matmul, dot product)
Reduction operations (sum, mean, max, min)

Neural Network Layers

Dense: Fully connected layer
Conv2D: 2D convolutional layer
Pooling: MaxPool2D, AvgPool2D
Normalization: BatchNorm, LayerNorm
Activation: ReLU, PReLU, Sigmoid, Tanh, Softmax
Attention: Multi-head self-attention
RNN Layers: LSTM, GRU

Optimization Algorithms

Stochastic Gradient Descent (SGD)
Adam optimizer
RMSprop optimizer
Adagrad optimizer

Loss Functions

Mean Squared Error (MSE)
Cross Entropy Loss
Huber Loss

Model Management

Sequential model API
Graph-based model definition
Model serialization and loading

Data Types

The framework supports a comprehensive range of data types for efficient computation:

Floating Point Types

FP64 (double): 64-bit double precision floating point
FP32 (float): 32-bit single precision floating point
FP16: 16-bit half precision floating point
BFLOAT16: 16-bit brain floating point (same exponent range as FP32)
FP8: 8-bit custom floating point format
FP4: 4-bit custom floating point format

Integer Types

INT64: 64-bit signed integer
INT32: 32-bit signed integer
INT8: 8-bit signed integer
UINT8: 8-bit unsigned integer

Low-Bit Quantization Types

BITS2: 2-bit packed values (4 values per byte)
BITS1: 1-bit packed values (8 values per byte, binary networks)

Special Types

BOOL: Boolean values (1 byte per element)

Quantization Capabilities

Feature	Bit-width	Type
Per-tensor affine	8-bit	UINT8, INT8
Per-channel affine	8-bit	UINT8, INT8
BITS2 packed	2-bit	Asymmetric affine
FLOAT4 custom float	4-bit	Direct (no affine)
QAT fake quantization	any	Simulates quantization noise during training

Examples

The repository includes several comprehensive examples:

MNIST Classification: Complete training pipeline for digit recognition
CIFAR-10: CNN image classification with data pipeline and transforms
Transformer: End-to-end transformer with tokenization, training, and autoregressive decoding
Translator: English-to-French MarianMT (Helsinki-NLP) inference engine using Safetensors weights
InsightFace: Face recognition model (ResNet50-based) inference via ONNX runtime executor, producing 512-dim embeddings
Automatic Differentiation: Gradient computation with dynamic computation graphs
Scheduler Usage: Learning rate scheduling with cosine annealing, step LR, and lambda LR
ONNX Export: Export trained boat models to ONNX format
NanoChat: GPT LLM inference and training (d34 2.2B) with CUDA acceleration
- Interactive chat CLI with token streaming
- OpenAI-compatible HTTP server (JSON API)
- Training loop with Muon/AdamW optimizers and FP8 support
- Fused GQA attention kernels for fast decode

Project Structure

boat/
├── include/                  # Public headers
│   ├── boat/                # Framework headers
│   │   ├── tensor.h         # Tensor operations
│   │   ├── ops.h            # Mathematical operations
│   │   ├── autodiff.h       # Automatic differentiation
│   │   ├── graph.h          # Computational graph
│   │   ├── layers.h         # Neural network layers
│   │   ├── optimizers.h     # Optimization algorithms
│   │   ├── loss.h           # Loss functions
│   │   ├── model.h          # Model definition and serialization
│   │   ├── data.h           # Data loading and preprocessing
│   │   ├── prune.h          # Model pruning
│   │   ├── quantize.h       # Quantization
│   │   ├── sampling.h       # Token sampling utilities
│   │   ├── cuda_runtime.h   # CUDA runtime API
│   │   └── format/          # Model format loaders
│   │       ├── onnx.h       # ONNX format support
│   │       ├── onnxruntime.h# ONNX Runtime executor
│   │       ├── pytorch.h    # PyTorch format support
│   │       ├── tensorflow.h # TensorFlow format support
│   │       └── huggingface.h# HuggingFace format support
│   └── boat.h               # Main include file
├── src/                     # Implementation
│   ├── core/               # Core functionality
│   ├── ops/                # Operations (with device dispatch)
│   ├── graph/              # Computational graph
│   ├── layers/             # Neural network layers
│   ├── optimizers/         # Optimization algorithms (with CUDA paths)
│   ├── schedulers/         # Learning rate schedulers
│   ├── loss/               # Loss functions (with CUDA paths)
│   ├── model/              # Model management
│   └── format/             # Model format loaders
├── cuda/                   # CUDA backend
│   ├── kernels/            # CUDA kernels (basic, conv, dense, fused, norm, pool, optimizer, FP8, BF16)
│   ├── ops/                # CUDA ops (activation, arithmetic, linear)
│   ├── tensor.cu           # CUDA tensor copy
│   ├── cublas_handle.cu    # cuBLAS handle manager
│   ├── cudnn_handle.cu     # cuDNN handle manager
│   ├── graph/              # CUDA graph executor
│   └── autodiff/           # CUDA autodiff
├── bindings/js/            # Node.js N-API bindings
├── examples/               # Example programs
│   ├── mnist/             # MNIST classification
│   ├── cifar10/           # CIFAR-10 image classification
│   ├── common/            # Shared utilities (JSON, safetensors)
│   ├── nanochat/          # NanoChat GPT LLM (inference, training, server)
│   ├── transformer/       # Transformer end-to-end example
│   └── translator/        # English-French MarianMT translator
├── tests/                 # Test suite
│   ├── unit/              # Unit tests
│   └── archive/           # Archived/legacy tests
├── benchmarks/            # Performance benchmarks
├── docs/                  # Documentation
└── scripts/               # Utility scripts

For detailed API documentation and development guidelines, see CLAUDE.md.

Development Status

Current Features (Implemented)

Core tensor operations with multiple data types
Automatic differentiation with computational graph
Neural network layers (dense, conv, attention, LSTM, GRU, etc.)
Optimizers (Adam, RMSprop, SGD, Adagrad) with CUDA update paths
Learning rate schedulers (cosine annealing, step LR, lambda LR)
Loss functions (MSE, cross-entropy, Huber) with CUDA backward paths
Data pipeline (Dataset, DataLoader with multi-threaded prefetch)
Post-training quantization (UINT8, INT8, BITS2, FLOAT4, per-channel)
Quantization-aware training (QAT) with fake quantization
Model pruning (magnitude-based, structured channel/filter pruning)
Model format loaders (ONNX, PyTorch, TensorFlow, HuggingFace, GGUF)
ONNX Runtime executor (graph-based direct inference for complex ONNX models)
CUDA GPU acceleration (cuBLAS matmul, cuDNN conv/batchnorm, fused attention kernels, FP8/BF16 inference and training, custom kernels for element-wise ops, activations, pooling, normalization, and optimizers)
Group/depthwise convolution with cuDNN acceleration
PReLU activation layer (Parametric ReLU for modern CNN architectures)
InsightFace face recognition model inference (ResNet50, 512-dim embeddings)
Model serialization (custom binary format, v3 with per-channel metadata)
Performance optimizations (SIMD, SGEMM with optional OpenBLAS backend, OpenMP, memory pool)
Node.js N-API bindings (Tensor and Model operations)
Cross-platform build with CMake
Comprehensive test suite with CI (GitHub Actions: CPU matrix + CUDA build)
MNIST training example (manual and autodiff, both >96% test accuracy)
CIFAR-10 CNN training example
Transformer end-to-end example
English-French MarianMT translator (Safetensors-based inference)
InsightFace face recognition (ONNX Runtime, 130-node graph executor)
ONNX export (boat → ONNX serialization)
NanoChat GPT LLM (d34 2.2B):
- Interactive chat CLI with token streaming
- OpenAI-compatible HTTP server with JSON API
- Training pipeline (pretraining, SFT, GRPO) with Muon/AdamW optimizers
- FP8 dynamic tensorwise scaling for training
- Fused GQA decode attention with KV cache
- BF16 inference (avoids FP16 overflow)

Planned Features

WebAssembly backend for in-browser inference
Distributed training support

Code Quality

Boat follows strict code quality standards with comprehensive static analysis and const-correctness guidelines.

Const Correctness

The framework enforces const correctness throughout its API to improve safety, readability, and compiler optimization. See the Const Usage Guide for detailed guidelines on:

Function parameter constness
Return value constness
Structure field constness
Common patterns and examples

Static Analysis

The project uses cppcheck for static analysis to detect potential issues. Run the analysis with:

cppcheck --enable=warning,style --suppress=missingInclude -I include src

Recent Improvements: The codebase has been extensively analyzed and refined to achieve zero cppcheck warnings across all source files. This includes fixes for:

Const correctness issues (parameter and pointer constness)
Unused variables and functions
Variable shadowing
Memory management patterns
Type consistency and format strings

Static analysis reports are maintained in the repository (cppcheck_*.txt) to track code quality improvements over time.

Automated Testing

All code changes are validated through comprehensive unit and integration tests.

Testing

Run the test suite to verify the installation:

cd build
make test

Or run specific tests:

ctest -R test_tensor                   # Run tensor tests
ctest -R test_quantize                 # Run quantization tests
ctest -R test_serialization_integration # Run serialization roundtrip tests
ctest -R test_autodiff                 # Run autodiff tests
ctest -R test_layers                   # Run layer tests

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Fork the repository
Create a feature branch
Follow the code style guidelines
Write tests for new functionality
Submit a pull request

Coding Standards

Use clang-format with provided .clang-format file
Write descriptive commit messages
Add documentation for public APIs
Include unit tests for new features
Ensure no memory leaks (use Valgrind or AddressSanitizer)

License

Apache License 2.0. See LICENSE for details.

Acknowledgments

This framework is inspired by:

PyTorch: Dynamic computation graphs
TensorFlow: Strong production deployment
ONNX: Model interoperability
Caffe: C++ implementation simplicity

Contact

For questions, issues, or contributions, please use the GitHub Issues page.

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
bindings/js		bindings/js
cuda		cuda
docs		docs
examples		examples
include		include
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Boat: Deep Learning Framework in C

Key Features

Design Principles

Installation

Prerequisites

Building from Source

Using CMake (Recommended)

Using Makefile

Build Options

Build Configurations

Quick Start

Basic Tensor Operations

Neural Network Training Example

Automatic Differentiation Example

MNIST Example

Model Architecture

Running the MNIST Example

Automatic Differentiation Version

Key Code Snippets

Expected Results

Data Preparation

NanoChat Example

Chat CLI

HTTP Server

Training

Architecture

Core Components

Tensor Operations

Neural Network Layers

Optimization Algorithms

Loss Functions

Model Management

Data Types

Floating Point Types

Integer Types

Low-Bit Quantization Types

Special Types

Quantization Capabilities

Examples

Project Structure

Development Status

Current Features (Implemented)

Planned Features

Code Quality

Const Correctness

Static Analysis

Automated Testing

Testing

Contributing

Coding Standards

License

Acknowledgments

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages