Daily Perf Improver - Research and Plan #4
Replies: 16 comments
-
Daily Perf Improver - Progress Update (2025-10-12)SummaryI reviewed the current state of performance optimization work and found that 6 open PRs already cover the majority of high-priority optimization opportunities from the performance plan. No new PR was created this time to avoid duplicating existing work. Existing Open PRsThe following performance improvements are already in progress:
Coverage AnalysisFrom the original performance plan in Discussion #4: Phase 1 Items:
Phase 2 Items:
Phase 3 Items:
Observations
Recommendations
Next StepsWill monitor the existing PRs and wait for maintainer feedback before pursuing additional optimizations to avoid duplicate effort and ensure alignment with project goals. 🤖 Generated by Daily Perf Improver
|
Beta Was this translation helpful? Give feedback.
-
Daily Perf Improver - Progress Update (2025-10-12)SummaryAfter reviewing the current state of performance optimization work, I found that all major optimization opportunities from the performance plan are currently covered by the 6 open PRs. No new PR was created to avoid duplicating existing work. Current State AssessmentBuild Environment: ✅ Working correctly
Existing Open PRs (Performance-Related):
Coverage AnalysisFrom the original performance plan in Discussion #4: Phase 1 Items:
Phase 2 Items:
Phase 3 Items:
Key Observations
Recommendations
Next StepsWill monitor existing PRs and wait for maintainer feedback before pursuing additional performance optimizations to ensure alignment with project goals and avoid duplicate or unwanted work. 🤖 Generated by Daily Perf Improver
|
Beta Was this translation helpful? Give feedback.
-
Daily Perf Improver - Progress Update (2025-10-13)Reviewed current performance optimization state. Found 6 open PRs covering most Phase 1 & Phase 2 items. No new PR created to avoid duplicating work and allow maintainer review of existing contributions.
|
Beta Was this translation helpful? Give feedback.
-
Daily Perf Improver - Progress Update (2025-10-14)Reviewed current performance optimization state. Found 6 open PRs covering most Phase 1 & Phase 2 work. No new PR created to allow maintainer review of existing contributions and avoid duplicate effort.
|
Beta Was this translation helpful? Give feedback.
-
|
Optimized QR decomposition with SIMD Householder transformations, achieving 19-44% speedup. See PR for details.
|
Beta Was this translation helpful? Give feedback.
-
|
Optimized LU decomposition with SIMD, achieving 43-60% speedup for 30×30 and 50×50 matrices. Created draft PR for review.
|
Beta Was this translation helpful? Give feedback.
-
Daily Perf Improver - Investigation Report (2025-10-17)SummaryInvestigated SIMD optimization for Cholesky decomposition but found it causes performance regression across all tested matrix sizes. No PR created. This work helps document which optimizations are not beneficial. Work PerformedGoal Selected: Optimize Cholesky decomposition (Phase 3, Linear Algebra Optimizations) Rationale: Following the successful optimization of QR (PR #71) and LU (PR #75) decompositions, Cholesky was the next linear algebra target. The algorithm has clear dot product patterns that seemed amenable to SIMD optimization. Approach Taken:
Performance MeasurementsBaseline (Original Scalar Implementation):
After SIMD Optimization (Naive):
After Adding Threshold (simdThreshold = 8):
Why SIMD Doesn't Help HereRoot Causes:
Technical DetailsBenchmark Environment:
Commands Used: # Baseline
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*Cholesky*" --job short
# Created branch
git checkout -b "perf/optimize-cholesky-simd-20251017-030639-c59c9847"
# After optimization (tested twice with different thresholds)
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*Cholesky*" --job shortLessons Learned
Alternative Approaches (Not Pursued)Possible future optimizations that might work better:
However, given that Cholesky is already quite fast (< 17 μs for 50×50), these would be low-priority. Next StepsBased on the performance plan from Discussion #4, remaining Phase 3 work includes:
ConclusionWhile QR and LU decompositions benefited significantly from SIMD optimization (19-60% speedup), Cholesky decomposition shows that SIMD isn't universally beneficial. The algorithm's incremental vector growth pattern makes it a poor fit for SIMD optimization. The existing scalar implementation is already quite efficient for this algorithm. This investigation demonstrates the importance of benchmarking and validates the current implementation's approach. 🤖 Generated by Daily Perf Improver
|
Beta Was this translation helpful? Give feedback.
-
Daily Perf Improver - Investigation Report (2025-10-20)SummaryInvestigated SIMD optimization for Eigenvalue Decomposition (EVD) but determined it would likely cause performance regression similar to Cholesky. No PR created. This work helps document which Phase 3 optimizations are not beneficial for the current algorithm implementations. Work PerformedGoal Selected: Optimize EVD (Phase 3, Linear Algebra Optimizations) Rationale: Following the successful optimization of QR (PR #71, 19-44% speedup) and LU (PR #75, 43-60% speedup) decompositions, EVD was the next linear algebra target in Phase 3. However, after detailed analysis, the algorithm structure makes it a poor candidate for SIMD optimization. Approach Taken:
Performance BaselineTest Environment:
Current Performance (No Changes Made):
Why SIMD Optimization Is Not Beneficial for EVDAlgorithm Structure Analysis: EVD uses a two-phase approach:
Both phases have characteristics that make SIMD optimization problematic: 1. Incrementally Decreasing Loop BoundsThe tred2 phase iterates
Most iterations work with small vectors where SIMD setup overhead dominates any parallel processing benefit. This is the exact same pattern that caused regression in the Cholesky investigation (see comment from 2025-10-17). 2. Complex Iterative QL AlgorithmThe tql2 phase uses an iterative QL algorithm with:
Example from lines 166-219: The recursive 3. Strided Memory Access PatternsMany operations access the 2D array
4. Limited Full-Vector OperationsUnlike QR and LU where we have consistent row operations on full-length vectors, EVD's operations are mostly:
Comparison with Successful OptimizationsWhy QR and LU Worked:
Why Cholesky and EVD Don't Work:
Attempted Optimization Points Considered
None of these would provide net benefit given the algorithm structure. Alternative Approaches (Not Pursued)Possible future optimizations that might work better:
Current State AssessmentBased on the performance plan from Discussion #4, Phase 3 Linear Algebra work status:
Recommendations
Commands Used# Baseline benchmarking
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*EVD*" --job short
# Code analysis
cat src/FsMath/Algebra/EVD.fs
# Created and deleted test branch (no changes committed)
git checkout -b perf/optimize-evd-simd-20251020-032150-aaed4fc
git checkout main
git branch -D perf/optimize-evd-simd-20251020-032150-aaed4fcConclusionWhile QR and LU decompositions benefited significantly from SIMD optimization (19-60% speedup), EVD shows similar structural challenges to Cholesky that make SIMD optimization counter-productive. The algorithm's incrementally-decreasing vector lengths, complex branching in the iterative QL algorithm, and strided memory access patterns mean SIMD overhead would dominate any parallel processing benefits. The existing EVD implementation is reasonably efficient for this algorithm. Significant improvements would require either:
This investigation validates the current implementation's approach and helps narrow the focus for future Phase 3 work toward more promising optimization targets. 🤖 Generated by Daily Perf Improver
|
Beta Was this translation helpful? Give feedback.
-
|
Added SVD benchmarks to the benchmark suite. Established baseline performance metrics without attempting optimization (SVD has similar algorithmic structure to Cholesky/EVD, which showed regressions).
|
Beta Was this translation helpful? Give feedback.
-
Daily Perf Improver - Investigation Report (2025-10-22)SummaryInvestigated matrix inverse operation for potential optimization opportunities. The inverse operation already benefits from the optimized LU decomposition (PR #75) and is performing well. No PR created as the current implementation is already efficient. Work PerformedGoal Selected: Analyze matrix inverse performance (Phase 3 candidate) Rationale: The matrix inverse operation is fundamental to many linear algebra applications and was not explicitly analyzed in previous optimization efforts. Approach Taken:
Performance BaselineTest Environment:
Current Performance:
Implementation AnalysisThe current inverse implementation uses a standard LU-based approach: // 1) Factor A => P, L, U (uses optimized LU from PR #75)
let (P, L, U) = LinearAlgebra.luDecompose A
// 2) Build identity matrix I
let I = Matrix.identity n
// 3) Permute I's rows by P => P*I
let IPerm = I |> Matrix.permuteRowsBy P
// 4) Forward substitution: solve L * Y = IPerm
let Y = LinearAlgebra.solveTriangularLinearSystems L IPerm true
// 5) Back substitution: solve U * X = Y
let X = LinearAlgebra.solveTriangularLinearSystems U Y falseKey Observations:
Potential Optimizations Considered1. Exploit Identity Matrix StructureIdea: The permuted identity matrix Analysis: While this could theoretically allow skipping some operations, Decision: Not beneficial - the SIMD code paths are likely faster than adding conditional branches. 2. In-Place Identity PermutationIdea: Avoid allocating Analysis:
Decision: Minor benefit - would reduce allocations slightly but won't significantly impact runtime. 3. Specialized Small Matrix KernelsIdea: Hand-optimized kernels for 2×2, 3×3, 4×4 matrices commonly used in graphics/physics. Analysis:
Decision: Not pursued - specialized small matrix operations would be a separate feature enhancement rather than a general optimization. 4. Parallel Implementation for Large MatricesIdea: Use Analysis:
Decision: Not pursued - no evidence of need for matrices this large in current workloads. Why Current Implementation Is Already Good
Alternative Approaches (Not Implemented)For future consideration if inverse performance becomes a bottleneck:
Recommendations
Commands Used# Baseline benchmarking
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*Inverse*" --job short
# Code analysis
cat src/FsMath/Algebra/LinearAlgebra.fsConclusionThe matrix inverse operation is already well-optimized, leveraging the SIMD-optimized LU decomposition from PR #75 and using efficient batch triangular solving. No changes were made as the current implementation represents a good balance of performance, maintainability, and numerical stability for the target use cases. 🤖 Generated by Daily Perf Improver
|
Beta Was this translation helpful? Give feedback.
-
Daily Perf Improver - Status Update (2025-10-23)SummaryReviewed current performance optimization state. Found significant progress with 4 open PRs and extensive investigation work completed. Most Phase 1-3 optimization opportunities have been explored. No new PR created - waiting for maintainer review of existing contributions. Current Open PRs (Performance-Related)
Recent Investigation Work (from discussion history)
Phase 3 Status AssessmentFrom the original performance plan: Linear Algebra Optimizations:
Remaining Phase 3 Work:
Key Observations
Recommendations
Next StepsWill monitor existing PRs and await maintainer feedback before pursuing additional performance optimizations to ensure alignment with project goals and avoid duplicate or unwanted work. 🤖 Generated by Daily Perf Improver
|
Beta Was this translation helpful? Give feedback.
-
|
Optimized Vector.sum with hardware-accelerated horizontal reduction, achieving 15-47% speedup. Created draft PR with detailed measurements and benchmarks.
|
Beta Was this translation helpful? Give feedback.
-
Daily Perf Improver - Investigation Report (2025-10-27)SummaryInvestigated SIMD optimization for Work PerformedGoal Selected: Optimize Vector.product operation (Phase 2/3 continuation, following PR #83's pattern) Rationale: PR #83 successfully optimized Approach Taken:
Performance MeasurementsTest Environment:
Results Summary:
Detailed Benchmark Results: Baseline: After Optimization: Why SIMD Doesn't Help HereThe optimization provides no meaningful benefit due to missing hardware intrinsics: 1. No Vector.Product IntrinsicUnlike
// Vector.sum - hardware-accelerated horizontal reduction
let acc = Numerics.Vector.Sum(accVec) // Single SIMD instruction
// Vector.product - manual scalar horizontal reduction
let mutable acc = LanguagePrimitives.GenericOne<'T>
for i = 0 to simdWidth - 1 do
acc <- acc * accVec.[i] // Sequential scalar operations2. Horizontal Reduction Dominates for Small VectorsFor small vectors (size 10), the SIMD accumulation is fast but the horizontal reduction overhead negates most gains. For size 10 with SIMD width 4:
3. No Benefit for Large VectorsFor larger vectors (100+), the main SIMD loop already processes most elements efficiently, but:
4. Comparison with Vector.sum SuccessWhy sum worked (PR #83):
Why product doesn't work:
Technical Implementation DetailsOptimization Attempted: static member inline product<'T ...> (v:ReadOnlySpan<'T>) : 'T =
if v.Length = 0 then
LanguagePrimitives.GenericOne<'T>
elif Numerics.Vector.IsHardwareAccelerated && v.Length >= Numerics.Vector<'T>.Count then
let simdWidth = Numerics.Vector<'T>.Count
let simdCount = v.Length / simdWidth
let ceiling = simdWidth * simdCount
// SIMD accumulation
let mutable accVec = Numerics.Vector<'T>(LanguagePrimitives.GenericOne<'T>)
for i = 0 to simdCount - 1 do
let srcIndex = i * simdWidth
let vec = Numerics.Vector<'T>(v.Slice(srcIndex, simdWidth))
accVec <- accVec * vec
// Manual horizontal reduction (no Vector.Product available)
let mutable acc = LanguagePrimitives.GenericOne<'T>
for i = 0 to simdWidth - 1 do
acc <- acc * accVec.[i]
// Tail
for i = ceiling to v.Length - 1 do
acc <- acc * v.[i]
acc
else
// Scalar fallback
...Alternative Approaches (Not Pursued)Possible techniques that might provide marginal gains:
Current State AssessmentBased on recent work visible in the discussion: Phase 3 Linear Algebra Status:
Pattern Emerges:
Recommendations
Commands Used# Create branch
git checkout -b perf/optimize-product-horizontal-reduction-20251027-032524-56b16599
# Add benchmarks
# (edited benchmarks/FsMath.Benchmarks/Vector.fs)
# Baseline benchmarking
./build.sh
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*Product*" --job short > /tmp/gh-aw/agent/product_baseline.txt
# Implement optimization
# (edited src/FsMath/SpanMath.fs - specialized product implementation with SIMD)
# Build and test
./build.sh
dotnet test tests/FsMath.Tests/FsMath.Tests.fsproj -c Release --no-build
# Performance verification
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*Product*" --job short > /tmp/gh-aw/agent/product_optimized.txtConclusionWhile This investigation validates the current 🤖 Generated by Daily Perf Improver
|
Beta Was this translation helpful? Give feedback.
-
Daily Perf Improver - Status Update (2025-10-28)Reviewed current performance optimization state. Found 5 open PRs with substantial performance improvements awaiting maintainer review. No new PR created to allow maintainer feedback on existing contributions before pursuing additional work.
|
Beta Was this translation helpful? Give feedback.
-
|
Reviewed current performance optimization state. Found 5 open PRs with substantial improvements awaiting maintainer review. No new PR created to allow maintainer feedback on existing contributions before pursuing additional work.
|
Beta Was this translation helpful? Give feedback.
-
|
Reviewed current performance optimization state. Found 5 open PRs with substantial improvements awaiting maintainer review. No new PR created to allow maintainer feedback on existing contributions before pursuing additional work.
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Performance Improvement Research and Planning
Executive Summary
FsMath is a SIMD-accelerated numerical library for F# with a focus on performance and zero-friction interop. After comprehensive research, I've identified several performance optimization opportunities and created a phased plan for systematic improvements.
Current State Analysis
1. Performance Testing Infrastructure
Existing Setup:
benchmarks/FsMath.Benchmarks/[<MemoryDiagnoser>]Running Benchmarks:
cd benchmarks/FsMath.Benchmarks dotnet run -c Release2. Architecture Overview
Core Performance Strategy:
System.Numerics.Vector<'T>SpanPrimitives.fs- Low-level SIMD primitives with hardware detectionSpanMath.fs- Mid-level span operations (dot, add, multiply, etc.)Vector.fs/Matrix.fs- High-level user-facing APIKey Performance Features:
Numerics.Vector.IsHardwareAcceleratedNumerics.Vector<'T>.Count3. Typical Workloads
Based on the codebase analysis, typical performance-critical workloads include:
4. Performance Bottlenecks Identified
High Priority:
SpanMath.fs:320-353) - Partial SIMD implementation with nested loop issuesMedium Priority:
Low Priority:
5. Build and Development Commands
Standard Workflow:
CI Setup:
Performance Improvement Plan
Round 1: Foundation & Measurement (Quick Wins - 1-2 weeks)
Goals: Establish comprehensive benchmarking, identify low-hanging fruit
Tasks:
Expand benchmark coverage
Fix outer product SIMD implementation
Reduce allocations in hot paths
stackallocfor small temporary buffersOptimize dot product
Success Metrics:
Round 2: SIMD Optimization (Medium Effort - 2-3 weeks)
Goals: Maximize SIMD utilization, optimize memory access patterns
Tasks:
Matrix multiplication optimization
Improve vectorization quality
Vector.Dot()where available instead of manual multiply-addMemory layout optimizations
Benchmark different SIMD strategies
Success Metrics:
Round 3: Advanced Optimizations (Longer Term - 3-4 weeks)
Goals: Optimize complex algorithms, exploit advanced CPU features
Tasks:
Linear algebra optimizations
Exploit modern CPU instructions
System.Runtime.Intrinsicsfor critical kernelsParallel implementations
Parallel.Forwith work stealing for good load balanceSpecialized fast paths
Success Metrics:
Performance Engineering Process
Development Workflow:
Measurement Guidelines:
Before/After Template:
Repository-Specific Notes
Testing:
./build.sh runtestsCode Style:
SpanPrimitives.fsfor new SIMD codePR Guidelines:
Open Questions for Maintainers
References
Next Steps: Awaiting maintainer feedback before proceeding with Round 1 optimizations. Ready to start with expanded benchmark coverage and outer product fix as initial targets.
Beta Was this translation helpful? Give feedback.
All reactions