Split a single neural network into multiple smaller networks using weight splitting.
Even if the total number of operations (FLOPs) decreases, splitting a single large matrix multiplication (matmul) into several smaller ones often results in a slowdown on modern hardware like GPUs. This is because large matmuls are highly optimized for parallel processing; multiple small calls introduce "kernel launch overhead" and prevent the hardware from reaching peak throughput.
This project explores an approach to improve inference efficiency in neural networks by decomposing a large model into smaller sub-networks based on weight significance.
In many neural network tasks, not all inputs strongly influence all outputs. When certain weights are close to zero, their contribution to the final output becomes negligible.
The core idea is:
- Identify weights that have minimal impact (near zero values)
- Split the network into smaller sub-networks by grouping significant weights
- Reduce unnecessary computation during inference by ignoring weak connections
This can lead to more efficient output generation in trained models, especially in scenarios where sparsity naturally emerges.
- Train a standard neural network
- Analyze the learned weights
- Identify near-zero weights (low importance connections)
- Partition the network into smaller sub-networks
- Use these sub-networks independently or selectively during inference
- Reduced computation during inference
- Potential speed improvements
- Better utilization of sparsity in trained models
- Modular network structure
- Edge devices with limited compute
- Real-time inference systems
- Sparse neural network optimization

