Description
We have a performance regression moving from the 6.1 to 6.2 compiler. My guess is that 6.2 is missing some inlining opportunities, so that at the end of day the core loop is not in a state such that LLVM's loop vectoriser can kick in.
Steps to reproduce
We can reproduce easily on the same machine using docker/podman. Here I am using my M4 Mac and the official swift:6.2.2-noble and swift:6.1.3-noble images. E.g. from the base directory of a fresh checkout:
podman run --rm -it -v $PWD:/(basename $PWD) -w /(basename $PWD) swift:6.1.3-noble
We can filter the benchmarks to one representative that highlights the difference. While not strictly necessary, make sure to disable jemalloc in package-benchmark, as that will distort the results. E.g. in the container just launched:
cd Benchmarks
env BENCHMARK_DISABLE_JEMALLOC=true swift package benchmark --filter '.*/move/1000000'
Expected behaviour
We should see that the multiarray benchmark is faster when using both 6.1 and 6.2. Instead we see a regression in 6.2; both in overall time for both benchmarks, and specifically for multi array (which no longer vectorises).
Results on my machine:
swift-6.1.3
root@3dca6f6b25c7:/swift-multiarray/Benchmarks# swift --version
Swift version 6.1.3 (swift-6.1.3-RELEASE)
Target: aarch64-unknown-linux-gnu
root@3dca6f6b25c7:/swift-multiarray/Benchmarks# export BENCHMARK_DISABLE_JEMALLOC=true
root@3dca6f6b25c7:/swift-multiarray/Benchmarks# swift package benchmark --filter '.*/move/1000000'
Building for debugging...
[13/13] Linking BenchmarkTool-tool
Build of product 'BenchmarkTool' complete! (2.54s)
Build complete!
Building BenchmarkTool in release mode...
Building benchmark targets in release mode for benchmark run...
Building Benchmarks
==================
Running Benchmarks
==================
100% [------------------------------------------------------------] ETA: 00:00:00 | Benchmarks:array/move/1000000
100% [------------------------------------------------------------] ETA: 00:00:00 | Benchmarks:multiarray/move/1000000
===================================================
Baseline 'Current_run'
===================================================
Host '3dca6f6b25c7' with 16 'aarch64' processors with 58 GB memory, running:
#1 SMP PREEMPT_DYNAMIC Sat Feb 8 20:30:50 UTC 2025
==========
Benchmarks
==========
array/move/1000000
╒════════════════════════════════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╕
│ Metric │ p0 │ p25 │ p50 │ p75 │ p90 │ p99 │ p100 │ Samples │
╞════════════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ Memory (allocated resident) │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 10000 │
├────────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
│ Time (wall clock) (μs) * │ 804 │ 903 │ 907 │ 914 │ 927 │ 996 │ 1173 │ 10000 │
╘════════════════════════════════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╛
multiarray/move/1000000
╒════════════════════════════════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╕
│ Metric │ p0 │ p25 │ p50 │ p75 │ p90 │ p99 │ p100 │ Samples │
╞════════════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ Memory (allocated resident) │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 10000 │
├────────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
│ Time (wall clock) (μs) * │ 435 │ 483 │ 489 │ 495 │ 503 │ 540 │ 645 │ 10000 │
╘════════════════════════════════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╛
swift-6.2.2
root@ed1fec80eca3:/swift-multiarray/Benchmarks# swift --version
Swift version 6.2.2 (swift-6.2.2-RELEASE)
Target: aarch64-unknown-linux-gnu
root@ed1fec80eca3:/swift-multiarray/Benchmarks# export BENCHMARK_DISABLE_JEMALLOC=true
root@ed1fec80eca3:/swift-multiarray/Benchmarks# swift package benchmark --filter '.*/move/1000000'
Building for debugging...
[7/7] Linking BenchmarkTool-tool
Build of product 'BenchmarkTool' complete! (1.24s)
Build complete!
Building BenchmarkTool in release mode...
Building benchmark targets in release mode for benchmark run...
Building Benchmarks
==================
Running Benchmarks
==================
100% [------------------------------------------------------------] ETA: 00:00:00 | Benchmarks:array/move/1000000
100% [------------------------------------------------------------] ETA: 00:00:00 | Benchmarks:multiarray/move/1000000
===================================================
Baseline 'Current_run'
===================================================
Host 'ed1fec80eca3' with 16 'aarch64' processors with 58 GB memory, running:
#1 SMP PREEMPT_DYNAMIC Sat Feb 8 20:30:50 UTC 2025
==========
Benchmarks
==========
array/move/1000000
╒════════════════════════════════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╕
│ Metric │ p0 │ p25 │ p50 │ p75 │ p90 │ p99 │ p100 │ Samples │
╞════════════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ Memory (allocated resident) │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 10000 │
├────────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
│ Time (wall clock) (μs) * │ 803 │ 868 │ 879 │ 890 │ 913 │ 1003 │ 1226 │ 10000 │
╘════════════════════════════════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╛
multiarray/move/1000000
╒════════════════════════════════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╕
│ Metric │ p0 │ p25 │ p50 │ p75 │ p90 │ p99 │ p100 │ Samples │
╞════════════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ Memory (allocated resident) │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 8890 │
├────────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
│ Time (wall clock) (μs) * │ 1020 │ 1109 │ 1120 │ 1133 │ 1158 │ 1290 │ 1526 │ 8890 │
╘════════════════════════════════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╛
Environment
- M4 Max, macOS 26.0.1
- podman 5.7.0
Additional information
Description
We have a performance regression moving from the 6.1 to 6.2 compiler. My guess is that 6.2 is missing some inlining opportunities, so that at the end of day the core loop is not in a state such that LLVM's loop vectoriser can kick in.
Steps to reproduce
We can reproduce easily on the same machine using docker/podman. Here I am using my M4 Mac and the official
swift:6.2.2-nobleandswift:6.1.3-nobleimages. E.g. from the base directory of a fresh checkout:We can filter the benchmarks to one representative that highlights the difference. While not strictly necessary, make sure to disable jemalloc in package-benchmark, as that will distort the results. E.g. in the container just launched:
Expected behaviour
We should see that the multiarray benchmark is faster when using both 6.1 and 6.2. Instead we see a regression in 6.2; both in overall time for both benchmarks, and specifically for multi array (which no longer vectorises).
Results on my machine:
swift-6.1.3
swift-6.2.2
Environment
Additional information