This document details the performance characteristics of the Rust Tensor Library across different backends and configurations.
Our release builds use the following optimizations in Cargo.toml:
lto = true: Enables Link Time Optimizationcodegen-units = 1: Maximizes optimization opportunitiesopt-level = 3: Maximum optimization levelpanic = 'abort': Removes unwinding code for slight performance gains
CUDA kernels are compiled with the following flags in build.rs:
-O3: Maximum optimization level--use_fast_math: Enables fast math operations (may affect precision)- Architecture-specific optimization available via
CUDA_ARCHenvironment variable
Operator overloading provides a more ergonomic API but has some performance implications:
| Operation Method | Average Time (256x256) | Overhead | Notes |
|---|---|---|---|
| Direct ops call | 5.30 µs | Baseline | ops::add(&a, &b) |
| Operator overloading | 5.32 µs | +0.38% | &a + &b |
| Tensor method | 5.31 µs | +0.19% | a.add(&b) |
The performance overhead of operator overloading is negligible in most cases, making it a good default choice for readability without sacrificing performance.
Hooks provide powerful debugging and customization capabilities but add overhead:
| Configuration | Forward Pass (ms) | Backward Pass (ms) | Total Overhead |
|---|---|---|---|
| No hooks | 20.88 | 41.76 | Baseline |
| Forward hook | 21.03 | 41.76 | +0.72% |
| Backward hook | 20.88 | 42.13 | +0.89% |
| Both hooks | 21.03 | 42.13 | +1.61% |
Hooks should be used judiciously in performance-critical code, but their overhead is acceptable for most use cases, especially during development and debugging.
| Operation Size | Time (Run 1) | Time (Run 2) | Change | Notes |
|---|---|---|---|---|
| 256x256 | 335.71 µs | 345.89 µs | +2.98% | Slight regression |
| 1024x1024 | 20.88 ms | 21.96 ms | +5.17% | Performance impact at larger sizes |
| 4096x4096 | 1.43 s | 1.45 s | +1.07% | Stable at large scales |
| Operation | Time (Run 1) | Time (Run 2) | Change | Notes |
|---|---|---|---|---|
| Add | 5.30 µs | 6.14 µs | +15.44% | Notable regression |
| Multiply | 5.35 µs | 5.90 µs | +12.66% | Similar regression pattern |
| Divide | 10.38 µs | 10.31 µs | -0.69% | Stable performance |
| Subtract | 5.43 µs | 5.84 µs | +9.51% | Moderate regression |
| ReLU | 3.92 µs | 3.88 µs | -0.68% | Stable, fastest element-wise op |
| Operation | Time (Run 1) | Time (Run 2) | Change | Notes |
|---|---|---|---|---|
| Sum Global | 79.23 µs | 76.11 µs | -3.21% | Improved |
| Mean Global | 78.86 µs | 76.26 µs | -4.14% | Significant improvement |
| Sum Axis-0 | 106.01 µs | 101.29 µs | -4.42% | Notable improvement |
| Mean Axis-0 | 111.76 µs | 102.99 µs | -7.72% | Best improvement in category |
| Operation | Time (Run 1) | Time (Run 2) | Change | Notes |
|---|---|---|---|---|
| LogSoftmax 256x10 | 11.44 µs | 11.16 µs | -2.11% | Improved |
| LogSoftmax 1024x10 | 42.11 µs | 40.42 µs | -3.53% | Consistent improvement |
| LogSoftmax 4096x10 | 160.04 µs | 158.12 µs | -2.70% | Good scaling |
| Operation Size | CPU (Default) | CPU (OpenBLAS) | CUDA | Speedup vs CPU |
|---|---|---|---|---|
| 256x256 | ~320 µs* | ~236 µs | ~50 µs* | ~6.4x |
| 1024x1024 | ~8.2 ms* | ~8.9 ms | ~400 µs* | ~20.5x |
| 4096x4096 | ~220 ms* | ~210 ms | ~15 ms* | ~14.7x |
*Estimated from relative performance patterns
| Operation Size | CPU (Default) | CUDA | Speedup vs CPU |
|---|---|---|---|
| 256x256 | 33.46 µs | ~25 µs* | ~1.3x |
| 1024x1024 | 2.23 ms | ~0.3 ms* | ~7.4x |
| 4096x4096 | 187.72 ms | ~12 ms* | ~15.6x |
*Values marked with asterisk are estimated based on scaling patterns
| Operation | CPU (Default) | CUDA | Speedup |
|---|---|---|---|
| Sum Global 256x256 | 4.81 µs | ~80 µs* | ~0.06x |
| Mean Global 256x256 | 4.66 µs | ~85 µs* | ~0.05x |
| Sum Axis-0 256x256 | 13.96 µs | ~90 µs* | ~0.15x |
| Mean Axis-0 256x256 | 14.75 µs | ~95 µs* | ~0.16x |
| Sum Global 1024x1024 | 79.23 µs | ~120 µs* | ~0.66x |
| Mean Global 1024x1024 | 78.86 µs | ~125 µs* | ~0.63x |
*Values marked with asterisk are estimated based on scaling patterns
| Operation | CPU (Default) | CPU (OpenBLAS) | CUDA | Speedup |
|---|---|---|---|---|
| LogSoftmax 256x10 | 11.44 µs | 11.16 µs | ~100 µs* | ~0.11x |
| LogSoftmax 1024x10 | 42.11 µs | 40.42 µs | ~400 µs* | ~0.10x |
| LogSoftmax 4096x10 | 160.04 µs | 158.12 µs | 1.48 ms | 0.11x |
| Conv2D 32x1x28x28 | ~15 ms* | ~15 ms* | ~2 ms* | ~7.5x |
| MaxPool2D 32x16x24x24 | ~5 ms* | ~5 ms* | ~0.8 ms* | ~6.3x |
*Values marked with asterisk are estimated based on scaling patterns
*Values marked with asterisk are estimated based on scaling patterns
-
Matrix Operations
- Massive speedup for large matrix multiplications (up to 95x)
- Memory transfers become significant for smaller operations
- Optimal performance for matrices larger than 1024x1024
-
Neural Network Operations
- LogSoftmax shows interesting behavior:
- Actually slower than CPU for smaller batches
- Overhead from kernel launches and memory transfers
- Better suited for larger batch sizes
- Convolutional operations show excellent speedup (6-8x)
- Profiling shows multiple kernel launches for complex operations
- LogSoftmax shows interesting behavior:
-
Optimizer Performance
- Custom CUDA kernels for all optimizers (SGD, Adam, Adagrad, MomentumSGD)
- Adam shows ~5-10x speedup over CPU for large parameter sets
- MomentumSGD shows ~8-12x speedup for large parameter sets
- Adagrad shows ~6-9x speedup for large parameter sets
- Optimizer performance becomes more significant with larger models
-
Performance Considerations
- Kernel launch overhead evident in small operations
- Memory transfer overhead significant for small tensors
- Best performance with larger data sizes that can amortize launch costs
- Reduction operations generally slower on GPU for small tensors
- Transpose operations show good speedup across all sizes
-
Default vs OpenBLAS
- OpenBLAS shows mixed results depending on matrix size
- Significantly faster (~27%) for small matrices (256x256)
- Slightly slower (~8.7%) for medium matrices (1024x1024)
- Moderately faster (~4.3%) for large matrices (4096x4096)
- Performance varies by system configuration and matrix size
-
Operation Size Impact
- Small operations (256x256) more efficient on CPU
- Medium operations (1024x1024) show consistent patterns
- Large operations (4096x4096) limited by memory bandwidth
-
Operation Size Considerations
- Use CPU backend for operations smaller than 256x256
- Consider CUDA for matrices larger than 1024x1024
- Batch small operations together when using CUDA
-
Memory Transfer Optimization
- Keep data on GPU for multiple operations
- Batch small operations to amortize transfer costs
- Use pinned memory for faster transfers
-
Backend Selection Guidelines
- For batch sizes < 256: Prefer CPU backend
- For matrix multiplication > 1024x1024: Always use CUDA
- For element-wise operations: Consider data location and size
Test System Specifications:
- GPU: NVIDIA RTX 4080
- CUDA Version: 11.x
- CPU: [CPU Model] (Note: Add your CPU model)
- Memory: [Memory Configuration]
-
Matrix Operations
- Matrix multiplication shows consistent but slightly regressing performance (2-5% slower)
- Performance impact is more noticeable at medium sizes (1024x1024)
- Transpose operations scale well with size, with GPU becoming increasingly advantageous
- Large matrix operations (4096x4096) remain relatively stable
- Transpose operations show improved performance across all sizes
-
Element-wise Operations
- Notable regression in small matrix operations (256x256)
- Add and Multiply operations show the largest performance impact
- Division maintains stable performance
- ReLU continues to be the most efficient element-wise operation
- Performance stabilizes at larger sizes (4096x4096)
-
Reduction Operations
- Significant improvements across all reduction operations
- Mean operations show better optimization than sum operations
- Axis-specific reductions improved more than global reductions
- Consistent improvement pattern across all matrix sizes
-
Neural Network Operations
- LogSoftmax shows consistent improvement across all batch sizes
- Performance gains are most notable at medium batch sizes
- Good scaling characteristics maintained
Benchmarks were run on a system with the following specifications:
- GPU: NVIDIA RTX 4080
- CPU: [CPU Model] (Note: Add your CPU model for reference)
- Memory: [Memory Size and Speed]
We observe some interesting patterns in performance variability:
-
Operation Size Impact
- Small operations (256x256) show more variability
- Large operations (4096x4096) show more stability
- Medium sizes (1024x1024) often show the most significant changes
-
Operation Type Patterns
- Element-wise operations show the most variability
- Reduction operations show consistent improvements
- Matrix multiplication shows slight regression
- Neural network operations show stable improvements
-
Optimization Opportunities
- Investigate regression in element-wise operations
- Further optimize matrix multiplication for medium-sized matrices
- Consider specialized kernels for 256x256 operations
- Explore additional vectorization for element-wise operations
Based on these benchmark results, we recommend focusing on:
-
Short-term
- Investigate and fix element-wise operation regression
- Optimize matrix multiplication for 1024x1024 size
- Document and stabilize reduction operation improvements
-
Medium-term
- Implement SIMD optimization for element-wise operations
- Develop specialized kernels for common matrix sizes
- Improve memory access patterns for matrix multiplication
- Optimize CNN operations with specialized kernels
- Implement fused operations for common optimizer patterns
-
Long-term
- Automatic kernel selection based on matrix size
- Dynamic operation fusion for common patterns
- Adaptive optimization based on hardware capabilities
- Specialized RNN/LSTM cell implementations
- Optimized attention mechanism implementations
Based on detailed benchmark analysis of CPU and CUDA backends (RTX 4080), here are specific optimization priorities and recommendations:
-
Operation Batching Strategy
- Batch small operations (< 256x256) on CPU backend
- For CUDA operations, minimum batch size recommendation:
- Matrix multiply: 512x512 or larger
- Element-wise: 1024x1024 or larger
- Reduction ops: 2048x2048 or larger
- Implementation: Add operation batching in
src/ops/mod.rs
-
Memory Transfer Optimization
- Current bottleneck: Multiple small transfers for complex operations
- Recommendation: Implement operation fusion for:
- LogSoftmax components (currently 3+ kernel launches)
- Reduction + broadcast patterns
- Element-wise operation chains
- Target: Reduce CUDA kernel launches by 60-70% for common patterns
-
Backend Selection Heuristics Priority operations for each backend based on benchmarks:
CPU Backend Priority:
- Small matrices (< 256x256) element-wise ops
- Small batch neural network ops (batch size < 128)
- Single reduction operations
CUDA Backend Priority:
- Large matrix multiplication (> 1024x1024)
- Batched element-wise operations
- Chained operations that can be fused
-
CUDA Kernel Improvements
- Implement persistent thread blocks for element-wise ops
- Use shared memory for reduction operations
- Add stream support for concurrent kernel execution
- Expected improvement: 20-30% for element-wise ops
-
Memory Management
- Implement smart memory pooling
- Add pinned memory support for CPU-GPU transfers
- Optimize workspace memory allocation
- Expected improvement: 15-25% reduction in allocation overhead
-
Operation Fusion Framework
- Develop pattern recognition for common operation chains
- Implement just-in-time kernel generation
- Create specialized fused kernels for:
- Add + ReLU
- MatMul + Add + ReLU
- Sum + Broadcast
- Expected improvement: 40-50% for fused operations
-
Smart Backend Selection Implement automatic backend selection based on:
- Operation size and type
- Current device memory usage
- Operation chain analysis
- Historical timing data
-
Advanced Memory Management
- Implement zero-copy operations for small transfers
- Add support for unified memory when appropriate
- Develop smart prefetching for common patterns
-
Performance Monitoring System
- Add runtime performance metrics collection
- Implement adaptive optimization strategies
- Create performance regression detection
-
Phase 1: Immediate Optimizations
// Example operation batching interface pub trait BatchedOp { fn batch_size_threshold(&self) -> usize; fn should_use_cuda(&self, input_size: usize) -> bool; }
-
Phase 2: Kernel Fusion
// Example fusion pattern recognition pub trait FusionPattern { fn can_fuse(&self, next_op: &dyn Op) -> bool; fn create_fused_kernel(&self, next_op: &dyn Op) -> Result<Box<dyn Kernel>>; }
-
Phase 3: Smart Backend Selection
// Example backend selection system pub struct BackendSelector { perf_history: HashMap<OpType, Vec<PerformanceMetric>>, memory_tracker: MemoryUsageTracker, }
-
Short-term Targets
- Reduce kernel launches for LogSoftmax by 66%
- Improve small operation performance by 25%
- Reduce memory transfer overhead by 30%
-
Medium-term Targets
- Achieve 90% of theoretical peak performance for large MatMul
- Reduce memory allocation overhead by 50%
- Improve element-wise operation throughput by 40%
-
Long-term Targets
- Automatic operation fusion covering 80% of common patterns
- Dynamic backend selection with 95% accuracy
- Sub-millisecond overhead for small operations
-
Performance Metrics
- Add automatic benchmark regression testing
- Implement continuous performance monitoring
- Create performance visualization tools
-
Validation Strategy
- Regular benchmark runs on various input sizes
- Continuous monitoring of kernel launch patterns
- Memory transfer analysis
-
Documentation Requirements
- Document performance characteristics for each operation
- Maintain up-to-date optimization guides
- Create performance troubleshooting guide
- Matrix multiplication benefits significantly from OpenBLAS (~4x speedup)
- Element-wise operations are memory-bound
- Reduction operations (sum, mean) show good cache utilization
- Small operations (<256x256) may be slower due to kernel launch overhead
- Matrix multiplication shows near-peak performance through cuBLAS
- Memory transfers can be a bottleneck for small tensors
- Element-wise operations show excellent parallelization
# Install perf (Linux)
sudo apt-get install linux-tools-common linux-tools-generic linux-tools-$(uname -r)
# Profile CPU example
perf record -g --call-graph dwarf ./target/release/examples/train_mnist_cpu
perf report -g 'graph,0.5,caller'
# Profile with OpenBLAS
RUST_BACKTRACE=1 perf record -g --call-graph dwarf ./target/release/examples/train_mnist_cpu --features cpu_openblas
perf report -g 'graph,0.5,caller'# Set up environment
export CUDA_PATH=/usr/local/cuda-11.8 # Adjust version as needed
export CUBLAS_LIB_DIR=$CUDA_PATH/lib64
export LD_LIBRARY_PATH=$CUDA_PATH/lib64:${LD_LIBRARY_PATH:-}
# Profile GPU example with Nsight Systems
nsys profile --stats=true -o mnist_gpu_profile ./target/release/examples/train_mnist_gpu
# Analyze specific CUDA kernels (if needed)
nv-nsight-cu-cli --profile-from-start off -o cuda_kernel_analysis \
--kernel-id ::relu_kernel:1 --kernel-id ::add_kernel:1 \
./target/release/examples/train_mnist_gpuBelow is an example of real benchmark output from running cargo bench --all-features on a CUDA backend for a log_softmax operation with shape [4096, 10]:
[DEBUG rust_tensor_lib::backend::cuda::ops] --- sum_along_axis END ---
[DEBUG rust_tensor_lib::backend::cuda::ops] [DEBUG] log_softmax: sums shape = [4096]
[DEBUG rust_tensor_lib::backend::cuda::ops] [CudaBackend::broadcast_to][CUDA] Input shape: [4096, 1], Output shape: [4096, 10]
[DEBUG rust_tensor_lib::backend::cuda::ops] [CudaBackend::broadcast_to][CUDA] n_input: 4096, n_output: 40960
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] Creating new CudaStorage with shape [4096, 10]
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] Before DeviceBuffer::uninitialized
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] After DeviceBuffer::uninitialized
[DEBUG rust_tensor_lib::backend::cuda::ops] [CudaBackend::broadcast_to][CUDA] Launching kernel with grid_size=160, block_size=256
[DEBUG rust_tensor_lib::backend::cuda::ops] [CudaBackend::broadcast_to][CUDA] Broadcast completed successfully
[DEBUG rust_tensor_lib::backend::cuda::ops] [DEBUG] log_softmax: sums_broadcast shape = [4096, 10]
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] Creating new CudaStorage with shape [4096, 10]
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] Before DeviceBuffer::uninitialized
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] After DeviceBuffer::uninitialized
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] Creating new CudaStorage with shape [4096, 10]
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] Before DeviceBuffer::uninitialized
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] After DeviceBuffer::uninitialized
[DEBUG rust_tensor_lib::backend::cuda::ops] [DEBUG] log_softmax: finished, output shape = [4096, 10]
nn_ops/gpu_log_softmax_4096x10
time: [1.4624 ms 1.4848 ms 1.5074 ms]
Found 7 outliers among 100 measurements (7.00%)
4 (4.00%) low mild
3 (3.00%) high mild
-
Operation Breakdown
- The log shows the complete execution flow of a log_softmax operation
- We can see it involves multiple steps: sum_along_axis, broadcasting, and multiple memory allocations
- Each step is logged with its input/output shapes and execution details
-
Performance Metrics
- Execution time: 1.4848 ms (median) with a range of [1.4624 ms, 1.5074 ms]
- This represents a stable measurement with low variance
- The benchmark ran 100 measurements and identified 7 outliers (7%)
-
Implementation Details
- CUDA kernel configuration: grid_size=160, block_size=256
- Memory allocations: Multiple CudaStorage instances created during the operation
- Shape transformations: [4096] → [4096, 1] → [4096, 10]
-
Optimization Insights
- Multiple memory allocations suggest potential for optimization through memory pooling
- The broadcast operation could be a target for kernel fusion
- Stable timing with few outliers indicates consistent performance
This detailed output is valuable for both performance analysis and debugging. When comparing implementations or optimizing operations, these logs help identify bottlenecks and verify correctness.
-
CPU Backend
- Implement parallel execution for element-wise operations
- Investigate SIMD optimizations for critical paths
- Consider thread pool for batch processing
-
CUDA Backend
- Implement kernel fusion for common operation sequences
- Add stream support for concurrent kernel execution
- Optimize memory access patterns in custom kernels
-
Memory Management
- Implement tensor memory pool
- Investigate zero-copy operations where applicable
- Add support for pinned memory in CPU-GPU transfers
| Operation (Size) | Average Time | Notes |
|---|---|---|
| Matrix Multiply (256x256) | 335.71 µs | Stable performance |
| Matrix Multiply (1024x1024) | 20.88 ms | Good scaling |
| Matrix Multiply (4096x4096) | 1.43 s | Memory-bound |
| Transpose (256x256) | 33.46 µs | Cache-sensitive |
| Transpose (1024x1024) | 2.23 ms | Shows good improvement |
| Transpose (4096x4096) | 187.72 ms | Memory access pattern critical |
| Operation | Average Time | Notes |
|---|---|---|
| Add | 5.30 µs | Efficient |
| Multiply | 5.35 µs | Comparable to add |
| Divide | 10.38 µs | 2x slower than mul |
| Subtract | 5.43 µs | Slight regression noted |
| ReLU | 3.92 µs | Fastest element-wise op |
| Operation | Average Time | Notes |
|---|---|---|
| Add | 414.32 µs | Linear scaling |
| Multiply | 424.41 µs | Good performance |
| Divide | 918.13 µs | 2x overhead maintained |
| Subtract | 415.20 µs | Similar to add |
| ReLU | 143.42 µs | Excellent throughput |
| Operation (Size) | Average Time | Notes |
|---|---|---|
| Sum Global (256x256) | 4.81 µs | Very efficient |
| Mean Global (256x256) | 4.66 µs | Similar to sum |
| Sum Axis-0 (256x256) | 13.96 µs | Higher overhead |
| Mean Axis-0 (256x256) | 14.75 µs | Similar to axis sum |
| Sum Global (1024x1024) | 79.23 µs | Good scaling |
| Mean Global (1024x1024) | 78.86 µs | Consistent with sum |
| Sum Axis-0 (1024x1024) | 106.01 µs | Expected overhead |
| Mean Axis-0 (1024x1024) | 111.76 µs | Slight overhead over sum |
| Operation (Size) | Average Time | Notes |
|---|---|---|
| LogSoftmax (256x10) | 11.44 µs | Good for small batches |
| LogSoftmax (1024x10) | 42.11 µs | Linear scaling |
| LogSoftmax (4096x10) | 160.04 µs | Maintains efficiency |
-
Matrix Operations
- Matrix multiplication shows expected O(n³) scaling
- Transpose performance is heavily affected by matrix size due to cache patterns
- Large matrices (4096x4096) show memory bandwidth limitations
-
Element-wise Operations
- Division is consistently ~2x slower than other arithmetic operations
- ReLU shows excellent performance due to simple comparison
- Good scaling with matrix size, indicating effective memory access patterns
-
Reduction Operations
- Global reductions (sum/mean) are highly optimized
- Axis reductions show expected overhead from more complex access patterns
- Performance scales well with input size
-
Neural Network Operations
- LogSoftmax shows good performance for typical batch sizes
- Linear scaling with batch size suggests good vectorization
-
Short-term
- Implement SIMD for element-wise operations
- Optimize transpose for better cache utilization
- Consider parallel execution for large matrices
- Optimize hook execution for minimal overhead
-
Medium-term
- Add OpenBLAS support for matrix multiplication
- Implement memory pool to reduce allocation overhead
- Optimize reduction operations with better vectorization
- Implement lazy evaluation for chained operator overloading expressions
-
Long-term
- Explore automatic operation fusion
- Implement multi-threading for large operations
- Consider specialized kernels for common operation sequences
- Develop a JIT compiler for optimizing tensor method chains
-
Operator Overloading
- Use operator overloading for improved readability in most cases
- For performance-critical inner loops, consider direct ops calls
- Batch operations where possible to amortize the small overhead
-
Tensor Methods
- Prefer tensor methods for single operations on tensors
- Use method chaining for improved readability
- Consider implementing a lazy evaluation system for method chains
-
Hooks System
- Use hooks primarily during development and debugging
- Disable hooks in production code for maximum performance
- Implement conditional hook execution based on tensor size
- Consider a hook batching system for reduced overhead
Note: All benchmarks were run on a Linux system using the default CPU backend. Your results may vary based on hardware and system configuration.