Neural Network in x86-64 Assembly

A fully hand-written, from-scratch implementation of a multilayer perceptron trained on the MNIST handwritten digit dataset. Every component, from forward pass to backpropagation, is implemented directly in x86-64 NASM assembly, linking against libm solely for the exp function. No ML libraries are used. All other functionality uses Linux syscalls directly.

Architecture

Input       Hidden 1    Hidden 2    Hidden 3    Output
784    ->   256    ->   128    ->   64     ->   10
       ReLU        ReLU        ReLU        Softmax

Four fully connected layers. ReLU activations on the first three layers. Numerically stable Softmax with categorical cross-entropy loss on the output layer.

Training configuration

Parameter	Value
Optimizers	SGD, Momentum, Adam
Adam params	LR: 0.0005, β1: 0.9, β2: 0.999, ε: 1e-7
SGD params	LR: 0.005
Momentum params	LR: 0.001, β: 0.95
Batch size	64
Epochs	25
LR Decay	0.97 per epoch
Weight init	He (Kaiming) uniform via deterministic LCG
Loss	Negative log-likelihood
Validation set	MNIST test samples 9000-9999

Results

The assembly model fully matches a ground-truth Python numpy reference model on this architecture.

Optimizer	Assembly Test Accuracy	Python Reference Test Accuracy
Adam	`98.35%`	`98.06%`
Momentum	`97.56%`	~`97.50%`
SGD	`94.83%`	`94.81%`

Repository structure

.
├── main.asm              # Entry point: training loop, validation, final test
├── forward_pass.asm      # layer_forward: linear transform + optional ReLU
├── backward_pass.asm     # accumulate_gradients, relu_backward, softmax_ce_backward
├── activation.asm        # ReLU activation
├── softmax.asm           # Numerically stable softmax (with C ABI register preservation)
├── loss.asm              # Negative log-likelihood (neg_log)
├── optimizer.asm         # SGD, Momentum, and Adam weight updates
├── init_weights.asm      # He initialization via LCG random number generator
├── weights.asm           # Static weight and bias buffers (W1-W4, b1-b4)
├── matrix_ops.asm        # outer_product_add, matrix_vector_multiply (SIMD optimized)
├── dot_product.asm       # SIMD Dot product over float arrays
├── exp_double.asm        # Double-precision exp approximation wrapper
├── argmax.asm            # Argmax over float array
├── memory.asm            # Heap allocator / BSS buffers for activations and Adam moments
├── dataset.asm           # MNIST binary file parser (IDX format)
├── loader.asm            # load_mnist_image, load_mnist_label
├── logger.asm            # print_loss, print_epoch, print_accuracy
├── dataset/              # MNIST binary files (not included, see below)
├── reference_model.py    # NumPy reference implementation for cross-validation
├── build.sh              # NASM + ld build script
└── Dockerfile            # Reproducible build environment

Requirements

To build from source:

NASM 2.15 or later
GNU ld (binutils)
Linux x86-64 (ELF binary, uses Linux syscalls directly)

To run the reference model:

Python 3.8+
NumPy

Or use Docker (recommended, especially for macOS Apple Silicon / Windows):

docker buildx build --platform linux/amd64 --load -t neural-network-asm .
docker run --rm -it --platform linux/amd64 -v "$PWD":/app -w /app neural-network-asm /bin/bash

Dataset

dataset/
├── train-images.idx3-ubyte
├── train-labels.idx1-ubyte
├── t10k-images.idx3-ubyte
└── t10k-labels.idx1-ubyte

Build and run

# Assemble and link (choose your optimizer: sgd, momentum, or adam)
./build.sh adam

# Train on MNIST (outputs loss per batch, val accuracy per epoch, final test accuracy)
./mnist_deep

Training prints batch loss, per-epoch validation accuracy on the last 1000 test samples, and final test accuracy over all 10,000 test samples.

Reference model

reference_model.py is a NumPy implementation of the identical architecture. It supports three optimizers for comparison against the assembly implementation.

python3 reference_model.py --optimizer sgd
python3 reference_model.py --optimizer momentum
python3 reference_model.py --optimizer adam

Use this to verify gradient correctness: if the Python model converges and the assembly model does not, the discrepancy is in the assembly backprop or weight update.

Implementation notes

Calling convention: All assembly routines follow the System V AMD64 ABI. Integer arguments in rdi, rsi, rdx, rcx, r8, r9. Additional arguments on the stack. XMM0-XMM7 for floating-point arguments and return values. Caller-saved registers are preserved where required. A critical fix during development involved manually pushing/popping xmm6 and xmm7 around the C exp() function call in Softmax to prevent silent gradient corruption.

Floating-point: All activations, weights, gradients, and loss values are 32-bit single precision. We utilize 128-bit SSE packed instructions (xmm registers) for dot products and outer products. The exp approximation in exp_double.asm uses 64-bit intermediate precision to reduce rounding error.

Memory: Static buffers in the .bss section cover all weight matrices, bias vectors, intermediate activations (z-buffers and h-buffers), and Adam moment tracking accumulators (m and v). The entire MNIST dataset is loaded into RAM upfront to remove file I/O bottlenecks.

He initialization: Weights are drawn from Uniform(-bound, bound) where bound = sqrt(6 / fan_in), implemented with a 32-bit linear congruential generator (LCG) seeded at startup rather than the rdrand instruction to ensure the binary safely executes via Rosetta emulation on Apple Silicon.

Backpropagation: The gradient chain correctly gates ReLU backwards using pre-activation values (z-buffers). The chain is:

grad_o  (softmax + CE)
  -> dW4, db4, grad_h3
  -> relu_backward(z3) -> grad_z3
  -> dW3, db3, grad_h2
  -> relu_backward(z2) -> grad_z2
  -> dW2, db2, grad_h1
  -> relu_backward(z1) -> grad_z1
  -> dW1, db1

Gradients are accumulated across the batch in accumulate_gradients. The total gradient sum is correctly normalized by multiplying by 1/batch_size before being applied in update_weights via the chosen optimizer (SGD, Momentum, or Adam).

Known limitations

No dropout or batch normalization.
Training is strictly CPU-bound. While it utilizes 128-bit SSE SIMD vectorization for hot paths, it does not currently extend to AVX2 256-bit wide-vector instructions for further throughput gains.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Network in x86-64 Assembly

Architecture

Training configuration

Results

Repository structure

Requirements

Dataset

Build and run

Reference model

Implementation notes

Known limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
test		test
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
activation.asm		activation.asm
argmax.asm		argmax.asm
backward_pass.asm		backward_pass.asm
build.sh		build.sh
dataset.asm		dataset.asm
dot_product.asm		dot_product.asm
exp_double.asm		exp_double.asm
forward_pass.asm		forward_pass.asm
init_weights.asm		init_weights.asm
loader.asm		loader.asm
logger.asm		logger.asm
loss.asm		loss.asm
main.asm		main.asm
matrix_ops.asm		matrix_ops.asm
memory.asm		memory.asm
mnist		mnist
optimizer.asm		optimizer.asm
output.log		output.log
output_final.log		output_final.log
reference_model.py		reference_model.py
softmax.asm		softmax.asm
weights.asm		weights.asm

Folders and files

Latest commit

History

Repository files navigation

Neural Network in x86-64 Assembly

Architecture

Training configuration

Results

Repository structure

Requirements

Dataset

Build and run

Reference model

Implementation notes

Known limitations

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages