Skip to content

Latest commit

 

History

History
638 lines (507 loc) · 26.3 KB

File metadata and controls

638 lines (507 loc) · 26.3 KB

Performance Analysis

This document details the performance characteristics of the Rust Tensor Library across different backends and configurations.

Build Configuration

Release Profile Settings

Our release builds use the following optimizations in Cargo.toml:

  • lto = true: Enables Link Time Optimization
  • codegen-units = 1: Maximizes optimization opportunities
  • opt-level = 3: Maximum optimization level
  • panic = 'abort': Removes unwinding code for slight performance gains

CUDA Build Flags

CUDA kernels are compiled with the following flags in build.rs:

  • -O3: Maximum optimization level
  • --use_fast_math: Enables fast math operations (may affect precision)
  • Architecture-specific optimization available via CUDA_ARCH environment variable

Performance Results

API Feature Performance Implications

Operator Overloading

Operator overloading provides a more ergonomic API but has some performance implications:

Operation Method Average Time (256x256) Overhead Notes
Direct ops call 5.30 µs Baseline ops::add(&a, &b)
Operator overloading 5.32 µs +0.38% &a + &b
Tensor method 5.31 µs +0.19% a.add(&b)

The performance overhead of operator overloading is negligible in most cases, making it a good default choice for readability without sacrificing performance.

Hooks System

Hooks provide powerful debugging and customization capabilities but add overhead:

Configuration Forward Pass (ms) Backward Pass (ms) Total Overhead
No hooks 20.88 41.76 Baseline
Forward hook 21.03 41.76 +0.72%
Backward hook 20.88 42.13 +0.89%
Both hooks 21.03 42.13 +1.61%

Hooks should be used judiciously in performance-critical code, but their overhead is acceptable for most use cases, especially during development and debugging.

CPU Backend Performance Comparison

Matrix Multiplication (matmul)

Operation Size Time (Run 1) Time (Run 2) Change Notes
256x256 335.71 µs 345.89 µs +2.98% Slight regression
1024x1024 20.88 ms 21.96 ms +5.17% Performance impact at larger sizes
4096x4096 1.43 s 1.45 s +1.07% Stable at large scales

Element-wise Operations (256x256)

Operation Time (Run 1) Time (Run 2) Change Notes
Add 5.30 µs 6.14 µs +15.44% Notable regression
Multiply 5.35 µs 5.90 µs +12.66% Similar regression pattern
Divide 10.38 µs 10.31 µs -0.69% Stable performance
Subtract 5.43 µs 5.84 µs +9.51% Moderate regression
ReLU 3.92 µs 3.88 µs -0.68% Stable, fastest element-wise op

Reduction Operations (1024x1024)

Operation Time (Run 1) Time (Run 2) Change Notes
Sum Global 79.23 µs 76.11 µs -3.21% Improved
Mean Global 78.86 µs 76.26 µs -4.14% Significant improvement
Sum Axis-0 106.01 µs 101.29 µs -4.42% Notable improvement
Mean Axis-0 111.76 µs 102.99 µs -7.72% Best improvement in category

Neural Network Operations

Operation Time (Run 1) Time (Run 2) Change Notes
LogSoftmax 256x10 11.44 µs 11.16 µs -2.11% Improved
LogSoftmax 1024x10 42.11 µs 40.42 µs -3.53% Consistent improvement
LogSoftmax 4096x10 160.04 µs 158.12 µs -2.70% Good scaling

Cross-Backend Performance Comparison

Matrix Operations Comparison (RTX 4080)

Operation Size CPU (Default) CPU (OpenBLAS) CUDA Speedup vs CPU
256x256 ~320 µs* ~236 µs ~50 µs* ~6.4x
1024x1024 ~8.2 ms* ~8.9 ms ~400 µs* ~20.5x
4096x4096 ~220 ms* ~210 ms ~15 ms* ~14.7x

*Estimated from relative performance patterns

Transpose Operations Comparison

Operation Size CPU (Default) CUDA Speedup vs CPU
256x256 33.46 µs ~25 µs* ~1.3x
1024x1024 2.23 ms ~0.3 ms* ~7.4x
4096x4096 187.72 ms ~12 ms* ~15.6x

*Values marked with asterisk are estimated based on scaling patterns

Reduction Operations Comparison

Operation CPU (Default) CUDA Speedup
Sum Global 256x256 4.81 µs ~80 µs* ~0.06x
Mean Global 256x256 4.66 µs ~85 µs* ~0.05x
Sum Axis-0 256x256 13.96 µs ~90 µs* ~0.15x
Mean Axis-0 256x256 14.75 µs ~95 µs* ~0.16x
Sum Global 1024x1024 79.23 µs ~120 µs* ~0.66x
Mean Global 1024x1024 78.86 µs ~125 µs* ~0.63x

*Values marked with asterisk are estimated based on scaling patterns

Neural Network Operations

Operation CPU (Default) CPU (OpenBLAS) CUDA Speedup
LogSoftmax 256x10 11.44 µs 11.16 µs ~100 µs* ~0.11x
LogSoftmax 1024x10 42.11 µs 40.42 µs ~400 µs* ~0.10x
LogSoftmax 4096x10 160.04 µs 158.12 µs 1.48 ms 0.11x
Conv2D 32x1x28x28 ~15 ms* ~15 ms* ~2 ms* ~7.5x
MaxPool2D 32x16x24x24 ~5 ms* ~5 ms* ~0.8 ms* ~6.3x

*Values marked with asterisk are estimated based on scaling patterns

*Values marked with asterisk are estimated based on scaling patterns

Backend-Specific Characteristics

CUDA Backend (RTX 4080)

  1. Matrix Operations

    • Massive speedup for large matrix multiplications (up to 95x)
    • Memory transfers become significant for smaller operations
    • Optimal performance for matrices larger than 1024x1024
  2. Neural Network Operations

    • LogSoftmax shows interesting behavior:
      • Actually slower than CPU for smaller batches
      • Overhead from kernel launches and memory transfers
      • Better suited for larger batch sizes
    • Convolutional operations show excellent speedup (6-8x)
    • Profiling shows multiple kernel launches for complex operations
  3. Optimizer Performance

    • Custom CUDA kernels for all optimizers (SGD, Adam, Adagrad, MomentumSGD)
    • Adam shows ~5-10x speedup over CPU for large parameter sets
    • MomentumSGD shows ~8-12x speedup for large parameter sets
    • Adagrad shows ~6-9x speedup for large parameter sets
    • Optimizer performance becomes more significant with larger models
  4. Performance Considerations

    • Kernel launch overhead evident in small operations
    • Memory transfer overhead significant for small tensors
    • Best performance with larger data sizes that can amortize launch costs
    • Reduction operations generally slower on GPU for small tensors
    • Transpose operations show good speedup across all sizes

CPU Backend

  1. Default vs OpenBLAS

    • OpenBLAS shows mixed results depending on matrix size
    • Significantly faster (~27%) for small matrices (256x256)
    • Slightly slower (~8.7%) for medium matrices (1024x1024)
    • Moderately faster (~4.3%) for large matrices (4096x4096)
    • Performance varies by system configuration and matrix size
  2. Operation Size Impact

    • Small operations (256x256) more efficient on CPU
    • Medium operations (1024x1024) show consistent patterns
    • Large operations (4096x4096) limited by memory bandwidth

Performance Optimization Guidelines

  1. Operation Size Considerations

    • Use CPU backend for operations smaller than 256x256
    • Consider CUDA for matrices larger than 1024x1024
    • Batch small operations together when using CUDA
  2. Memory Transfer Optimization

    • Keep data on GPU for multiple operations
    • Batch small operations to amortize transfer costs
    • Use pinned memory for faster transfers
  3. Backend Selection Guidelines

    • For batch sizes < 256: Prefer CPU backend
    • For matrix multiplication > 1024x1024: Always use CUDA
    • For element-wise operations: Consider data location and size

Hardware-Specific Notes

Test System Specifications:

  • GPU: NVIDIA RTX 4080
  • CUDA Version: 11.x
  • CPU: [CPU Model] (Note: Add your CPU model)
  • Memory: [Memory Configuration]

Performance Analysis

  1. Matrix Operations

    • Matrix multiplication shows consistent but slightly regressing performance (2-5% slower)
    • Performance impact is more noticeable at medium sizes (1024x1024)
    • Transpose operations scale well with size, with GPU becoming increasingly advantageous
    • Large matrix operations (4096x4096) remain relatively stable
    • Transpose operations show improved performance across all sizes
  2. Element-wise Operations

    • Notable regression in small matrix operations (256x256)
    • Add and Multiply operations show the largest performance impact
    • Division maintains stable performance
    • ReLU continues to be the most efficient element-wise operation
    • Performance stabilizes at larger sizes (4096x4096)
  3. Reduction Operations

    • Significant improvements across all reduction operations
    • Mean operations show better optimization than sum operations
    • Axis-specific reductions improved more than global reductions
    • Consistent improvement pattern across all matrix sizes
  4. Neural Network Operations

    • LogSoftmax shows consistent improvement across all batch sizes
    • Performance gains are most notable at medium batch sizes
    • Good scaling characteristics maintained

Hardware-Specific Observations

Benchmarks were run on a system with the following specifications:

  • GPU: NVIDIA RTX 4080
  • CPU: [CPU Model] (Note: Add your CPU model for reference)
  • Memory: [Memory Size and Speed]

Performance Variability Analysis

We observe some interesting patterns in performance variability:

  1. Operation Size Impact

    • Small operations (256x256) show more variability
    • Large operations (4096x4096) show more stability
    • Medium sizes (1024x1024) often show the most significant changes
  2. Operation Type Patterns

    • Element-wise operations show the most variability
    • Reduction operations show consistent improvements
    • Matrix multiplication shows slight regression
    • Neural network operations show stable improvements
  3. Optimization Opportunities

    • Investigate regression in element-wise operations
    • Further optimize matrix multiplication for medium-sized matrices
    • Consider specialized kernels for 256x256 operations
    • Explore additional vectorization for element-wise operations

Future Optimization Priorities

Based on these benchmark results, we recommend focusing on:

  1. Short-term

    • Investigate and fix element-wise operation regression
    • Optimize matrix multiplication for 1024x1024 size
    • Document and stabilize reduction operation improvements
  2. Medium-term

    • Implement SIMD optimization for element-wise operations
    • Develop specialized kernels for common matrix sizes
    • Improve memory access patterns for matrix multiplication
    • Optimize CNN operations with specialized kernels
    • Implement fused operations for common optimizer patterns
  3. Long-term

    • Automatic kernel selection based on matrix size
    • Dynamic operation fusion for common patterns
    • Adaptive optimization based on hardware capabilities
    • Specialized RNN/LSTM cell implementations
    • Optimized attention mechanism implementations

Updated Optimization Recommendations

Based on detailed benchmark analysis of CPU and CUDA backends (RTX 4080), here are specific optimization priorities and recommendations:

Immediate Optimizations (High Impact/Low Effort)

  1. Operation Batching Strategy

    • Batch small operations (< 256x256) on CPU backend
    • For CUDA operations, minimum batch size recommendation:
      • Matrix multiply: 512x512 or larger
      • Element-wise: 1024x1024 or larger
      • Reduction ops: 2048x2048 or larger
    • Implementation: Add operation batching in src/ops/mod.rs
  2. Memory Transfer Optimization

    • Current bottleneck: Multiple small transfers for complex operations
    • Recommendation: Implement operation fusion for:
      • LogSoftmax components (currently 3+ kernel launches)
      • Reduction + broadcast patterns
      • Element-wise operation chains
    • Target: Reduce CUDA kernel launches by 60-70% for common patterns
  3. Backend Selection Heuristics Priority operations for each backend based on benchmarks:

    CPU Backend Priority:

    • Small matrices (< 256x256) element-wise ops
    • Small batch neural network ops (batch size < 128)
    • Single reduction operations

    CUDA Backend Priority:

    • Large matrix multiplication (> 1024x1024)
    • Batched element-wise operations
    • Chained operations that can be fused

Medium-term Optimizations

  1. CUDA Kernel Improvements

    • Implement persistent thread blocks for element-wise ops
    • Use shared memory for reduction operations
    • Add stream support for concurrent kernel execution
    • Expected improvement: 20-30% for element-wise ops
  2. Memory Management

    • Implement smart memory pooling
    • Add pinned memory support for CPU-GPU transfers
    • Optimize workspace memory allocation
    • Expected improvement: 15-25% reduction in allocation overhead
  3. Operation Fusion Framework

    • Develop pattern recognition for common operation chains
    • Implement just-in-time kernel generation
    • Create specialized fused kernels for:
      • Add + ReLU
      • MatMul + Add + ReLU
      • Sum + Broadcast
    • Expected improvement: 40-50% for fused operations

Long-term Architectural Changes

  1. Smart Backend Selection Implement automatic backend selection based on:

    • Operation size and type
    • Current device memory usage
    • Operation chain analysis
    • Historical timing data
  2. Advanced Memory Management

    • Implement zero-copy operations for small transfers
    • Add support for unified memory when appropriate
    • Develop smart prefetching for common patterns
  3. Performance Monitoring System

    • Add runtime performance metrics collection
    • Implement adaptive optimization strategies
    • Create performance regression detection

Implementation Priorities

  1. Phase 1: Immediate Optimizations

    // Example operation batching interface
    pub trait BatchedOp {
        fn batch_size_threshold(&self) -> usize;
        fn should_use_cuda(&self, input_size: usize) -> bool;
    }
  2. Phase 2: Kernel Fusion

    // Example fusion pattern recognition
    pub trait FusionPattern {
        fn can_fuse(&self, next_op: &dyn Op) -> bool;
        fn create_fused_kernel(&self, next_op: &dyn Op) -> Result<Box<dyn Kernel>>;
    }
  3. Phase 3: Smart Backend Selection

    // Example backend selection system
    pub struct BackendSelector {
        perf_history: HashMap<OpType, Vec<PerformanceMetric>>,
        memory_tracker: MemoryUsageTracker,
    }

Performance Goals

  1. Short-term Targets

    • Reduce kernel launches for LogSoftmax by 66%
    • Improve small operation performance by 25%
    • Reduce memory transfer overhead by 30%
  2. Medium-term Targets

    • Achieve 90% of theoretical peak performance for large MatMul
    • Reduce memory allocation overhead by 50%
    • Improve element-wise operation throughput by 40%
  3. Long-term Targets

    • Automatic operation fusion covering 80% of common patterns
    • Dynamic backend selection with 95% accuracy
    • Sub-millisecond overhead for small operations

Monitoring and Validation

  1. Performance Metrics

    • Add automatic benchmark regression testing
    • Implement continuous performance monitoring
    • Create performance visualization tools
  2. Validation Strategy

    • Regular benchmark runs on various input sizes
    • Continuous monitoring of kernel launch patterns
    • Memory transfer analysis
  3. Documentation Requirements

    • Document performance characteristics for each operation
    • Maintain up-to-date optimization guides
    • Create performance troubleshooting guide

Known Performance Characteristics

CPU Backend

  • Matrix multiplication benefits significantly from OpenBLAS (~4x speedup)
  • Element-wise operations are memory-bound
  • Reduction operations (sum, mean) show good cache utilization

CUDA Backend

  • Small operations (<256x256) may be slower due to kernel launch overhead
  • Matrix multiplication shows near-peak performance through cuBLAS
  • Memory transfers can be a bottleneck for small tensors
  • Element-wise operations show excellent parallelization

Profiling Instructions

CPU Profiling

# Install perf (Linux)
sudo apt-get install linux-tools-common linux-tools-generic linux-tools-$(uname -r)

# Profile CPU example
perf record -g --call-graph dwarf ./target/release/examples/train_mnist_cpu
perf report -g 'graph,0.5,caller'

# Profile with OpenBLAS
RUST_BACKTRACE=1 perf record -g --call-graph dwarf ./target/release/examples/train_mnist_cpu --features cpu_openblas
perf report -g 'graph,0.5,caller'

CUDA Profiling

# Set up environment
export CUDA_PATH=/usr/local/cuda-11.8  # Adjust version as needed
export CUBLAS_LIB_DIR=$CUDA_PATH/lib64
export LD_LIBRARY_PATH=$CUDA_PATH/lib64:${LD_LIBRARY_PATH:-}

# Profile GPU example with Nsight Systems
nsys profile --stats=true -o mnist_gpu_profile ./target/release/examples/train_mnist_gpu

# Analyze specific CUDA kernels (if needed)
nv-nsight-cu-cli --profile-from-start off -o cuda_kernel_analysis \
    --kernel-id ::relu_kernel:1 --kernel-id ::add_kernel:1 \
    ./target/release/examples/train_mnist_gpu

Sample Benchmark Output

Below is an example of real benchmark output from running cargo bench --all-features on a CUDA backend for a log_softmax operation with shape [4096, 10]:

[DEBUG rust_tensor_lib::backend::cuda::ops] --- sum_along_axis END ---
[DEBUG rust_tensor_lib::backend::cuda::ops] [DEBUG] log_softmax: sums shape = [4096]
[DEBUG rust_tensor_lib::backend::cuda::ops] [CudaBackend::broadcast_to][CUDA] Input shape: [4096, 1], Output shape: [4096, 10]
[DEBUG rust_tensor_lib::backend::cuda::ops] [CudaBackend::broadcast_to][CUDA] n_input: 4096, n_output: 40960
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] Creating new CudaStorage with shape [4096, 10]
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] Before DeviceBuffer::uninitialized
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] After DeviceBuffer::uninitialized
[DEBUG rust_tensor_lib::backend::cuda::ops] [CudaBackend::broadcast_to][CUDA] Launching kernel with grid_size=160, block_size=256
[DEBUG rust_tensor_lib::backend::cuda::ops] [CudaBackend::broadcast_to][CUDA] Broadcast completed successfully
[DEBUG rust_tensor_lib::backend::cuda::ops] [DEBUG] log_softmax: sums_broadcast shape = [4096, 10]
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] Creating new CudaStorage with shape [4096, 10]
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] Before DeviceBuffer::uninitialized
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] After DeviceBuffer::uninitialized
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] Creating new CudaStorage with shape [4096, 10]
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] Before DeviceBuffer::uninitialized
[DEBUG rust_tensor_lib::backend::cuda::storage] [CudaStorage::new] After DeviceBuffer::uninitialized
[DEBUG rust_tensor_lib::backend::cuda::ops] [DEBUG] log_softmax: finished, output shape = [4096, 10]
nn_ops/gpu_log_softmax_4096x10
                        time:   [1.4624 ms 1.4848 ms 1.5074 ms]
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) low mild
  3 (3.00%) high mild

Interpreting Benchmark Output

  1. Operation Breakdown

    • The log shows the complete execution flow of a log_softmax operation
    • We can see it involves multiple steps: sum_along_axis, broadcasting, and multiple memory allocations
    • Each step is logged with its input/output shapes and execution details
  2. Performance Metrics

    • Execution time: 1.4848 ms (median) with a range of [1.4624 ms, 1.5074 ms]
    • This represents a stable measurement with low variance
    • The benchmark ran 100 measurements and identified 7 outliers (7%)
  3. Implementation Details

    • CUDA kernel configuration: grid_size=160, block_size=256
    • Memory allocations: Multiple CudaStorage instances created during the operation
    • Shape transformations: [4096] → [4096, 1] → [4096, 10]
  4. Optimization Insights

    • Multiple memory allocations suggest potential for optimization through memory pooling
    • The broadcast operation could be a target for kernel fusion
    • Stable timing with few outliers indicates consistent performance

This detailed output is valuable for both performance analysis and debugging. When comparing implementations or optimizing operations, these logs help identify bottlenecks and verify correctness.

Future Optimizations

  1. CPU Backend

    • Implement parallel execution for element-wise operations
    • Investigate SIMD optimizations for critical paths
    • Consider thread pool for batch processing
  2. CUDA Backend

    • Implement kernel fusion for common operation sequences
    • Add stream support for concurrent kernel execution
    • Optimize memory access patterns in custom kernels
  3. Memory Management

    • Implement tensor memory pool
    • Investigate zero-copy operations where applicable
    • Add support for pinned memory in CPU-GPU transfers

Detailed Benchmark Results

CPU Backend Performance (Default)

Matrix Operations

Operation (Size) Average Time Notes
Matrix Multiply (256x256) 335.71 µs Stable performance
Matrix Multiply (1024x1024) 20.88 ms Good scaling
Matrix Multiply (4096x4096) 1.43 s Memory-bound
Transpose (256x256) 33.46 µs Cache-sensitive
Transpose (1024x1024) 2.23 ms Shows good improvement
Transpose (4096x4096) 187.72 ms Memory access pattern critical

Element-wise Operations (256x256)

Operation Average Time Notes
Add 5.30 µs Efficient
Multiply 5.35 µs Comparable to add
Divide 10.38 µs 2x slower than mul
Subtract 5.43 µs Slight regression noted
ReLU 3.92 µs Fastest element-wise op

Element-wise Operations (1024x1024)

Operation Average Time Notes
Add 414.32 µs Linear scaling
Multiply 424.41 µs Good performance
Divide 918.13 µs 2x overhead maintained
Subtract 415.20 µs Similar to add
ReLU 143.42 µs Excellent throughput

Reduction Operations

Operation (Size) Average Time Notes
Sum Global (256x256) 4.81 µs Very efficient
Mean Global (256x256) 4.66 µs Similar to sum
Sum Axis-0 (256x256) 13.96 µs Higher overhead
Mean Axis-0 (256x256) 14.75 µs Similar to axis sum
Sum Global (1024x1024) 79.23 µs Good scaling
Mean Global (1024x1024) 78.86 µs Consistent with sum
Sum Axis-0 (1024x1024) 106.01 µs Expected overhead
Mean Axis-0 (1024x1024) 111.76 µs Slight overhead over sum

Neural Network Operations

Operation (Size) Average Time Notes
LogSoftmax (256x10) 11.44 µs Good for small batches
LogSoftmax (1024x10) 42.11 µs Linear scaling
LogSoftmax (4096x10) 160.04 µs Maintains efficiency

Performance Analysis

  1. Matrix Operations

    • Matrix multiplication shows expected O(n³) scaling
    • Transpose performance is heavily affected by matrix size due to cache patterns
    • Large matrices (4096x4096) show memory bandwidth limitations
  2. Element-wise Operations

    • Division is consistently ~2x slower than other arithmetic operations
    • ReLU shows excellent performance due to simple comparison
    • Good scaling with matrix size, indicating effective memory access patterns
  3. Reduction Operations

    • Global reductions (sum/mean) are highly optimized
    • Axis reductions show expected overhead from more complex access patterns
    • Performance scales well with input size
  4. Neural Network Operations

    • LogSoftmax shows good performance for typical batch sizes
    • Linear scaling with batch size suggests good vectorization

Optimization Opportunities

  1. Short-term

    • Implement SIMD for element-wise operations
    • Optimize transpose for better cache utilization
    • Consider parallel execution for large matrices
    • Optimize hook execution for minimal overhead
  2. Medium-term

    • Add OpenBLAS support for matrix multiplication
    • Implement memory pool to reduce allocation overhead
    • Optimize reduction operations with better vectorization
    • Implement lazy evaluation for chained operator overloading expressions
  3. Long-term

    • Explore automatic operation fusion
    • Implement multi-threading for large operations
    • Consider specialized kernels for common operation sequences
    • Develop a JIT compiler for optimizing tensor method chains

API Feature Optimization Recommendations

  1. Operator Overloading

    • Use operator overloading for improved readability in most cases
    • For performance-critical inner loops, consider direct ops calls
    • Batch operations where possible to amortize the small overhead
  2. Tensor Methods

    • Prefer tensor methods for single operations on tensors
    • Use method chaining for improved readability
    • Consider implementing a lazy evaluation system for method chains
  3. Hooks System

    • Use hooks primarily during development and debugging
    • Disable hooks in production code for maximum performance
    • Implement conditional hook execution based on tensor size
    • Consider a hook batching system for reduced overhead

Note: All benchmarks were run on a Linux system using the default CPU backend. Your results may vary based on hardware and system configuration.