Skip to content

Conversation

@Shnatsel
Copy link
Collaborator

@Shnatsel Shnatsel commented Jan 18, 2026

Helps x86 a lot by switching away from unrolled impls. Fixes #49

Also papers over whatever memory subsystem quirk we're hitting at f32/8388608 and f64/4194304, possibly something to do with cache associativity.

preliminary benchmarks from zen4
$ RUSTFLAGS='-C target-cpu=native' cargo bench --profile=profiling --bench=bench 'PhastFT DIT' -- --baseline=main-native-rayonless
   Compiling phastft v0.3.0 (/home/shnatsel/Code/PhastFT)
    Finished `profiling` profile [optimized + debuginfo] target(s) in 15.19s
     Running benches/bench.rs (target/profiling/deps/bench-50215b1ccd36228b)
Forward f32/PhastFT DIT/64
                        time:   [104.01 ns 107.85 ns 114.78 ns]
                        thrpt:  [557.59 Melem/s 593.39 Melem/s 615.33 Melem/s]
                        thrpt:  [2.0772 GiB/s 2.2105 GiB/s 2.2923 GiB/s]
                 change:
                        time:   [−26.209% −24.947% −22.917%] (p = 0.00 < 0.05)
                        thrpt:  [+29.730% +33.240% +35.518%]
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe
Forward f32/PhastFT DIT/128
                        time:   [173.05 ns 187.54 ns 197.35 ns]
                        thrpt:  [648.60 Melem/s 682.50 Melem/s 739.66 Melem/s]
                        thrpt:  [2.4162 GiB/s 2.5425 GiB/s 2.7554 GiB/s]
                 change:
                        time:   [−22.524% −17.649% −12.193%] (p = 0.00 < 0.05)
                        thrpt:  [+13.887% +21.431% +29.073%]
                        Performance has improved.
Found 5 outliers among 20 measurements (25.00%)
  5 (25.00%) high mild
Forward f32/PhastFT DIT/256
                        time:   [350.00 ns 380.65 ns 399.54 ns]
                        thrpt:  [640.73 Melem/s 672.54 Melem/s 731.42 Melem/s]
                        thrpt:  [2.3869 GiB/s 2.5054 GiB/s 2.7248 GiB/s]
                 change:
                        time:   [−19.793% −12.819% −5.0184%] (p = 0.01 < 0.05)
                        thrpt:  [+5.2835% +14.703% +24.677%]
                        Performance has improved.
Forward f32/PhastFT DIT/512
                        time:   [795.28 ns 851.12 ns 883.40 ns]
                        thrpt:  [579.58 Melem/s 601.56 Melem/s 643.80 Melem/s]
                        thrpt:  [2.1591 GiB/s 2.2410 GiB/s 2.3983 GiB/s]
                 change:
                        time:   [−43.789% −40.029% −36.265%] (p = 0.00 < 0.05)
                        thrpt:  [+56.899% +66.746% +77.900%]
                        Performance has improved.
Forward f32/PhastFT DIT/1024
                        time:   [1.7906 µs 1.9352 µs 2.0186 µs]
                        thrpt:  [507.28 Melem/s 529.16 Melem/s 571.87 Melem/s]
                        thrpt:  [1.8898 GiB/s 1.9713 GiB/s 2.1304 GiB/s]
                 change:
                        time:   [−54.231% −50.475% −46.439%] (p = 0.00 < 0.05)
                        thrpt:  [+86.702% +101.92% +118.49%]
                        Performance has improved.
Forward f32/PhastFT DIT/2048
                        time:   [3.9192 µs 4.2363 µs 4.4196 µs]
                        thrpt:  [463.40 Melem/s 483.44 Melem/s 522.55 Melem/s]
                        thrpt:  [1.7263 GiB/s 1.8010 GiB/s 1.9467 GiB/s]
                 change:
                        time:   [−7.8121% +2.6225% +15.071%] (p = 0.65 > 0.05)
                        thrpt:  [−13.097% −2.5555% +8.4741%]
                        No change in performance detected.
Forward f32/PhastFT DIT/4096
                        time:   [7.9912 µs 8.6506 µs 9.0482 µs]
                        thrpt:  [452.69 Melem/s 473.49 Melem/s 512.56 Melem/s]
                        thrpt:  [1.6864 GiB/s 1.7639 GiB/s 1.9094 GiB/s]
                 change:
                        time:   [−9.4705% +0.2252% +11.027%] (p = 0.97 > 0.05)
                        thrpt:  [−9.9319% −0.2247% +10.461%]
                        No change in performance detected.
Forward f32/PhastFT DIT/8192
                        time:   [16.535 µs 17.807 µs 18.597 µs]
                        thrpt:  [440.49 Melem/s 460.03 Melem/s 495.42 Melem/s]
                        thrpt:  [1.6410 GiB/s 1.7137 GiB/s 1.8456 GiB/s]
                 change:
                        time:   [−6.9012% +1.6380% +10.871%] (p = 0.72 > 0.05)
                        thrpt:  [−9.8054% −1.6116% +7.4127%]
                        No change in performance detected.
Forward f32/PhastFT DIT/16384
                        time:   [51.341 µs 53.628 µs 55.178 µs]
                        thrpt:  [296.93 Melem/s 305.51 Melem/s 319.12 Melem/s]
                        thrpt:  [1.1062 GiB/s 1.1381 GiB/s 1.1888 GiB/s]
                 change:
                        time:   [+0.4621% +4.5827% +8.6429%] (p = 0.04 < 0.05)
                        thrpt:  [−7.9553% −4.3819% −0.4599%]
                        Change within noise threshold.
Forward f32/PhastFT DIT/32768
                        time:   [131.29 µs 135.53 µs 139.47 µs]
                        thrpt:  [234.95 Melem/s 241.79 Melem/s 249.58 Melem/s]
                        thrpt:  [896.28 MiB/s 922.34 MiB/s 952.06 MiB/s]
                 change:
                        time:   [+2.7519% +5.4840% +8.1965%] (p = 0.00 < 0.05)
                        thrpt:  [−7.5756% −5.1989% −2.6782%]
                        Performance has regressed.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high severe
Forward f32/PhastFT DIT/65536
                        time:   [264.21 µs 273.41 µs 280.12 µs]
                        thrpt:  [233.96 Melem/s 239.70 Melem/s 248.04 Melem/s]
                        thrpt:  [892.48 MiB/s 914.39 MiB/s 946.20 MiB/s]
                 change:
                        time:   [+1.4187% +4.7923% +7.9493%] (p = 0.01 < 0.05)
                        thrpt:  [−7.3639% −4.5731% −1.3988%]
                        Performance has regressed.
Forward f32/PhastFT DIT/131072
                        time:   [545.97 µs 570.53 µs 587.84 µs]
                        thrpt:  [222.97 Melem/s 229.74 Melem/s 240.07 Melem/s]
                        thrpt:  [850.57 MiB/s 876.38 MiB/s 915.81 MiB/s]
                 change:
                        time:   [+0.1627% +3.8568% +7.7238%] (p = 0.06 > 0.05)
                        thrpt:  [−7.1700% −3.7136% −0.1624%]
                        No change in performance detected.
Found 3 outliers among 20 measurements (15.00%)
  3 (15.00%) high mild
Forward f32/PhastFT DIT/262144
                        time:   [1.2303 ms 1.2746 ms 1.3068 ms]
                        thrpt:  [200.60 Melem/s 205.67 Melem/s 213.07 Melem/s]
                        thrpt:  [765.24 MiB/s 784.56 MiB/s 812.79 MiB/s]
                 change:
                        time:   [+4.4773% +8.2356% +12.218%] (p = 0.00 < 0.05)
                        thrpt:  [−10.888% −7.6089% −4.2854%]
                        Performance has regressed.
Found 4 outliers among 20 measurements (20.00%)
  4 (20.00%) high mild
Forward f32/PhastFT DIT/524288
                        time:   [2.4730 ms 2.5893 ms 2.6728 ms]
                        thrpt:  [196.15 Melem/s 202.48 Melem/s 212.01 Melem/s]
                        thrpt:  [748.27 MiB/s 772.42 MiB/s 808.75 MiB/s]
                 change:
                        time:   [−11.961% −8.0509% −4.0079%] (p = 0.00 < 0.05)
                        thrpt:  [+4.1753% +8.7559% +13.586%]
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild
Forward f32/PhastFT DIT/1048576
                        time:   [5.6683 ms 5.7808 ms 5.8622 ms]
                        thrpt:  [178.87 Melem/s 181.39 Melem/s 184.99 Melem/s]
                        thrpt:  [682.34 MiB/s 691.95 MiB/s 705.68 MiB/s]
                 change:
                        time:   [−3.8881% +0.8480% +5.7067%] (p = 0.75 > 0.05)
                        thrpt:  [−5.3986% −0.8409% +4.0454%]
                        No change in performance detected.
Forward f32/PhastFT DIT/2097152
                        time:   [12.284 ms 12.354 ms 12.422 ms]
                        thrpt:  [168.83 Melem/s 169.75 Melem/s 170.72 Melem/s]
                        thrpt:  [644.03 MiB/s 647.56 MiB/s 651.25 MiB/s]
                 change:
                        time:   [−1.0003% +3.4249% +7.8727%] (p = 0.14 > 0.05)
                        thrpt:  [−7.2981% −3.3115% +1.0104%]
                        No change in performance detected.
Found 5 outliers among 20 measurements (25.00%)
  5 (25.00%) low mild
Forward f32/PhastFT DIT/4194304
                        time:   [27.632 ms 27.892 ms 28.200 ms]
                        thrpt:  [148.73 Melem/s 150.38 Melem/s 151.79 Melem/s]
                        thrpt:  [567.38 MiB/s 573.65 MiB/s 579.04 MiB/s]
                 change:
                        time:   [+4.3555% +5.5461% +6.8994%] (p = 0.00 < 0.05)
                        thrpt:  [−6.4541% −5.2546% −4.1737%]
                        Performance has regressed.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
Forward f32/PhastFT DIT/8388608
                        time:   [61.795 ms 62.064 ms 62.342 ms]
                        thrpt:  [134.56 Melem/s 135.16 Melem/s 135.75 Melem/s]
                        thrpt:  [513.30 MiB/s 515.59 MiB/s 517.84 MiB/s]
                 change:
                        time:   [−66.985% −66.815% −66.675%] (p = 0.00 < 0.05)
                        thrpt:  [+200.08% +201.35% +202.89%]
                        Performance has improved.
Benchmarking Forward f32/PhastFT DIT/16777216: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 5.7s, or reduce sample count to 10.
Forward f32/PhastFT DIT/16777216
                        time:   [172.11 ms 172.82 ms 173.55 ms]
                        thrpt:  [96.672 Melem/s 97.080 Melem/s 97.478 Melem/s]
                        thrpt:  [368.77 MiB/s 370.33 MiB/s 371.85 MiB/s]
                 change:
                        time:   [+28.896% +29.582% +30.279%] (p = 0.00 < 0.05)
                        thrpt:  [−23.242% −22.829% −22.418%]
                        Performance has regressed.

Inverse f32/PhastFT DIT/64
                        time:   [120.27 ns 121.51 ns 122.86 ns]
                        thrpt:  [520.94 Melem/s 526.72 Melem/s 532.12 Melem/s]
                        thrpt:  [1.9406 GiB/s 1.9622 GiB/s 1.9823 GiB/s]
                 change:
                        time:   [−23.900% −23.038% −22.163%] (p = 0.00 < 0.05)
                        thrpt:  [+28.474% +29.934% +31.407%]
                        Performance has improved.
Inverse f32/PhastFT DIT/128
                        time:   [178.52 ns 179.41 ns 180.22 ns]
                        thrpt:  [710.25 Melem/s 713.45 Melem/s 716.99 Melem/s]
                        thrpt:  [2.6459 GiB/s 2.6578 GiB/s 2.6710 GiB/s]
                 change:
                        time:   [−20.029% −19.248% −18.515%] (p = 0.00 < 0.05)
                        thrpt:  [+22.722% +23.836% +25.045%]
                        Performance has improved.
Inverse f32/PhastFT DIT/256
                        time:   [345.17 ns 371.86 ns 396.76 ns]
                        thrpt:  [645.23 Melem/s 688.44 Melem/s 741.66 Melem/s]
                        thrpt:  [2.4037 GiB/s 2.5646 GiB/s 2.7629 GiB/s]
                 change:
                        time:   [−16.864% −13.161% −8.4709%] (p = 0.00 < 0.05)
                        thrpt:  [+9.2549% +15.155% +20.284%]
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high severe
Inverse f32/PhastFT DIT/512
                        time:   [781.00 ns 844.19 ns 889.99 ns]
                        thrpt:  [575.29 Melem/s 606.50 Melem/s 655.57 Melem/s]
                        thrpt:  [2.1431 GiB/s 2.2594 GiB/s 2.4422 GiB/s]
                 change:
                        time:   [−42.144% −39.820% −37.185%] (p = 0.00 < 0.05)
                        thrpt:  [+59.197% +66.169% +72.844%]
                        Performance has improved.
Found 3 outliers among 20 measurements (15.00%)
  1 (5.00%) high mild
  2 (10.00%) high severe
Inverse f32/PhastFT DIT/1024
                        time:   [1.7895 µs 1.9418 µs 2.0527 µs]
                        thrpt:  [498.85 Melem/s 527.34 Melem/s 572.23 Melem/s]
                        thrpt:  [1.8584 GiB/s 1.9645 GiB/s 2.1317 GiB/s]
                 change:
                        time:   [−53.200% −50.379% −47.750%] (p = 0.00 < 0.05)
                        thrpt:  [+91.388% +101.53% +113.67%]
                        Performance has improved.
Inverse f32/PhastFT DIT/2048
                        time:   [3.7963 µs 4.0930 µs 4.3074 µs]
                        thrpt:  [475.46 Melem/s 500.37 Melem/s 539.47 Melem/s]
                        thrpt:  [1.7712 GiB/s 1.8640 GiB/s 2.0097 GiB/s]
                 change:
                        time:   [−4.5160% +1.6052% +8.2630%] (p = 0.65 > 0.05)
                        thrpt:  [−7.6324% −1.5798% +4.7296%]
                        No change in performance detected.
Inverse f32/PhastFT DIT/4096
                        time:   [8.5431 µs 9.2371 µs 9.7503 µs]
                        thrpt:  [420.09 Melem/s 443.43 Melem/s 479.45 Melem/s]
                        thrpt:  [1.5650 GiB/s 1.6519 GiB/s 1.7861 GiB/s]
                 change:
                        time:   [+1.7468% +9.1440% +17.241%] (p = 0.03 < 0.05)
                        thrpt:  [−14.706% −8.3779% −1.7168%]
                        Performance has regressed.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
Inverse f32/PhastFT DIT/8192
                        time:   [17.243 µs 18.598 µs 19.600 µs]
                        thrpt:  [417.96 Melem/s 440.47 Melem/s 475.08 Melem/s]
                        thrpt:  [1.5570 GiB/s 1.6409 GiB/s 1.7698 GiB/s]
                 change:
                        time:   [+1.8568% +7.4913% +14.323%] (p = 0.02 < 0.05)
                        thrpt:  [−12.529% −6.9692% −1.8229%]
                        Performance has regressed.
Found 3 outliers among 20 measurements (15.00%)
  3 (15.00%) high severe
Inverse f32/PhastFT DIT/16384
                        time:   [51.464 µs 52.138 µs 52.658 µs]
                        thrpt:  [311.14 Melem/s 314.24 Melem/s 318.36 Melem/s]
                        thrpt:  [1.1591 GiB/s 1.1706 GiB/s 1.1860 GiB/s]
                 change:
                        time:   [+0.7268% +2.4691% +4.2250%] (p = 0.01 < 0.05)
                        thrpt:  [−4.0537% −2.4096% −0.7216%]
                        Change within noise threshold.
Inverse f32/PhastFT DIT/32768
                        time:   [135.59 µs 137.12 µs 138.10 µs]
                        thrpt:  [237.28 Melem/s 238.98 Melem/s 241.68 Melem/s]
                        thrpt:  [905.13 MiB/s 911.62 MiB/s 921.93 MiB/s]
                 change:
                        time:   [+2.5436% +4.3121% +6.0425%] (p = 0.00 < 0.05)
                        thrpt:  [−5.6982% −4.1339% −2.4805%]
                        Performance has regressed.
Inverse f32/PhastFT DIT/65536
                        time:   [266.20 µs 269.29 µs 271.47 µs]
                        thrpt:  [241.41 Melem/s 243.36 Melem/s 246.20 Melem/s]
                        thrpt:  [920.90 MiB/s 928.36 MiB/s 939.16 MiB/s]
                 change:
                        time:   [+1.2225% +2.8567% +4.5155%] (p = 0.00 < 0.05)
                        thrpt:  [−4.3204% −2.7774% −1.2077%]
                        Performance has regressed.
Inverse f32/PhastFT DIT/131072
                        time:   [555.92 µs 562.56 µs 566.18 µs]
                        thrpt:  [231.50 Melem/s 232.99 Melem/s 235.77 Melem/s]
                        thrpt:  [883.12 MiB/s 888.80 MiB/s 899.41 MiB/s]
                 change:
                        time:   [+0.6189% +2.5428% +4.4570%] (p = 0.01 < 0.05)
                        thrpt:  [−4.2669% −2.4798% −0.6151%]
                        Change within noise threshold.
Inverse f32/PhastFT DIT/262144
                        time:   [1.2962 ms 1.3113 ms 1.3207 ms]
                        thrpt:  [198.49 Melem/s 199.92 Melem/s 202.25 Melem/s]
                        thrpt:  [757.19 MiB/s 762.63 MiB/s 771.51 MiB/s]
                 change:
                        time:   [+15.051% +17.138% +19.222%] (p = 0.00 < 0.05)
                        thrpt:  [−16.123% −14.631% −13.082%]
                        Performance has regressed.
Inverse f32/PhastFT DIT/524288
                        time:   [2.4838 ms 2.5088 ms 2.5236 ms]
                        thrpt:  [207.75 Melem/s 208.98 Melem/s 211.08 Melem/s]
                        thrpt:  [792.51 MiB/s 797.19 MiB/s 805.22 MiB/s]
                 change:
                        time:   [−3.1532% −1.1005% +0.9699%] (p = 0.29 > 0.05)
                        thrpt:  [−0.9606% +1.1128% +3.2558%]
                        No change in performance detected.
Inverse f32/PhastFT DIT/1048576
                        time:   [5.7179 ms 5.8918 ms 5.9951 ms]
                        thrpt:  [174.91 Melem/s 177.97 Melem/s 183.39 Melem/s]
                        thrpt:  [667.21 MiB/s 678.91 MiB/s 699.56 MiB/s]
                 change:
                        time:   [+8.6001% +12.342% +16.188%] (p = 0.00 < 0.05)
                        thrpt:  [−13.932% −10.986% −7.9190%]
                        Performance has regressed.
Inverse f32/PhastFT DIT/2097152
                        time:   [12.667 ms 12.737 ms 12.812 ms]
                        thrpt:  [163.69 Melem/s 164.65 Melem/s 165.56 Melem/s]
                        thrpt:  [624.42 MiB/s 628.09 MiB/s 631.55 MiB/s]
                 change:
                        time:   [+5.8758% +9.0914% +12.085%] (p = 0.00 < 0.05)
                        thrpt:  [−10.782% −8.3337% −5.5497%]
                        Performance has regressed.
Found 5 outliers among 20 measurements (25.00%)
  5 (25.00%) low mild
Inverse f32/PhastFT DIT/4194304
                        time:   [28.021 ms 28.185 ms 28.350 ms]
                        thrpt:  [147.95 Melem/s 148.81 Melem/s 149.69 Melem/s]
                        thrpt:  [564.37 MiB/s 567.67 MiB/s 571.01 MiB/s]
                 change:
                        time:   [+7.4487% +9.3054% +11.287%] (p = 0.00 < 0.05)
                        thrpt:  [−10.143% −8.5132% −6.9323%]
                        Performance has regressed.
Inverse f32/PhastFT DIT/8388608
                        time:   [63.387 ms 63.611 ms 63.858 ms]
                        thrpt:  [131.36 Melem/s 131.87 Melem/s 132.34 Melem/s]
                        thrpt:  [501.11 MiB/s 503.06 MiB/s 504.84 MiB/s]
                 change:
                        time:   [−67.890% −67.633% −67.373%] (p = 0.00 < 0.05)
                        thrpt:  [+206.50% +208.96% +211.43%]
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
Benchmarking Inverse f32/PhastFT DIT/16777216: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 5.8s, or reduce sample count to 10.
Inverse f32/PhastFT DIT/16777216
                        time:   [178.80 ms 179.62 ms 180.44 ms]
                        thrpt:  [92.980 Melem/s 93.406 Melem/s 93.832 Melem/s]
                        thrpt:  [354.69 MiB/s 356.32 MiB/s 357.94 MiB/s]
                 change:
                        time:   [+28.029% +28.817% +29.600%] (p = 0.00 < 0.05)
                        thrpt:  [−22.839% −22.370% −21.893%]
                        Performance has regressed.

Forward f64/PhastFT DIT/64
                        time:   [136.49 ns 151.38 ns 162.55 ns]
                        thrpt:  [393.72 Melem/s 422.78 Melem/s 468.91 Melem/s]
                        thrpt:  [2.9334 GiB/s 3.1499 GiB/s 3.4936 GiB/s]
                 change:
                        time:   [−39.151% −35.673% −31.831%] (p = 0.00 < 0.05)
                        thrpt:  [+46.695% +55.455% +64.341%]
                        Performance has improved.
Found 4 outliers among 20 measurements (20.00%)
  4 (20.00%) high severe
Forward f64/PhastFT DIT/128
                        time:   [257.90 ns 285.15 ns 302.23 ns]
                        thrpt:  [423.52 Melem/s 448.89 Melem/s 496.32 Melem/s]
                        thrpt:  [3.1555 GiB/s 3.3445 GiB/s 3.6979 GiB/s]
                 change:
                        time:   [−32.149% −26.660% −20.821%] (p = 0.00 < 0.05)
                        thrpt:  [+26.296% +36.352% +47.382%]
                        Performance has improved.
Forward f64/PhastFT DIT/256
                        time:   [611.18 ns 672.80 ns 710.34 ns]
                        thrpt:  [360.39 Melem/s 380.50 Melem/s 418.86 Melem/s]
                        thrpt:  [2.6851 GiB/s 2.8350 GiB/s 3.1208 GiB/s]
                 change:
                        time:   [−26.188% −19.895% −12.570%] (p = 0.00 < 0.05)
                        thrpt:  [+14.378% +24.836% +35.479%]
                        Performance has improved.
Forward f64/PhastFT DIT/512
                        time:   [1.4074 µs 1.5468 µs 1.6329 µs]
                        thrpt:  [313.55 Melem/s 331.01 Melem/s 363.78 Melem/s]
                        thrpt:  [2.3361 GiB/s 2.4662 GiB/s 2.7104 GiB/s]
                 change:
                        time:   [−41.003% −36.494% −30.923%] (p = 0.00 < 0.05)
                        thrpt:  [+44.767% +57.464% +69.499%]
                        Performance has improved.
Forward f64/PhastFT DIT/1024
                        time:   [3.1119 µs 3.4416 µs 3.6421 µs]
                        thrpt:  [281.16 Melem/s 297.54 Melem/s 329.06 Melem/s]
                        thrpt:  [2.0948 GiB/s 2.2168 GiB/s 2.4517 GiB/s]
                 change:
                        time:   [−45.041% −38.919% −33.482%] (p = 0.00 < 0.05)
                        thrpt:  [+50.335% +63.717% +81.952%]
                        Performance has improved.
Forward f64/PhastFT DIT/2048
                        time:   [6.8000 µs 7.4400 µs 7.8293 µs]
                        thrpt:  [261.58 Melem/s 275.27 Melem/s 301.18 Melem/s]
                        thrpt:  [1.9489 GiB/s 2.0509 GiB/s 2.2439 GiB/s]
                 change:
                        time:   [−5.2906% +5.0273% +16.895%] (p = 0.40 > 0.05)
                        thrpt:  [−14.453% −4.7867% +5.5861%]
                        No change in performance detected.
Forward f64/PhastFT DIT/4096
                        time:   [13.925 µs 15.197 µs 15.970 µs]
                        thrpt:  [256.48 Melem/s 269.53 Melem/s 294.15 Melem/s]
                        thrpt:  [1.9109 GiB/s 2.0082 GiB/s 2.1916 GiB/s]
                 change:
                        time:   [−5.4318% +4.6283% +15.454%] (p = 0.38 > 0.05)
                        thrpt:  [−13.385% −4.4236% +5.7438%]
                        No change in performance detected.
Forward f64/PhastFT DIT/8192
                        time:   [35.544 µs 37.839 µs 39.346 µs]
                        thrpt:  [208.20 Melem/s 216.49 Melem/s 230.48 Melem/s]
                        thrpt:  [1.5512 GiB/s 1.6130 GiB/s 1.7172 GiB/s]
                 change:
                        time:   [−1.3039% +3.8095% +9.6240%] (p = 0.18 > 0.05)
                        thrpt:  [−8.7791% −3.6697% +1.3211%]
                        No change in performance detected.
Forward f64/PhastFT DIT/16384
                        time:   [92.006 µs 96.557 µs 99.863 µs]
                        thrpt:  [164.06 Melem/s 169.68 Melem/s 178.08 Melem/s]
                        thrpt:  [1.2224 GiB/s 1.2642 GiB/s 1.3268 GiB/s]
                 change:
                        time:   [+2.6522% +5.8911% +9.4664%] (p = 0.00 < 0.05)
                        thrpt:  [−8.6478% −5.5633% −2.5837%]
                        Performance has regressed.
Found 4 outliers among 20 measurements (20.00%)
  4 (20.00%) high mild
Forward f64/PhastFT DIT/32768
                        time:   [218.95 µs 230.73 µs 239.35 µs]
                        thrpt:  [136.90 Melem/s 142.02 Melem/s 149.66 Melem/s]
                        thrpt:  [1.0200 GiB/s 1.0581 GiB/s 1.1150 GiB/s]
                 change:
                        time:   [+8.4952% +12.178% +16.528%] (p = 0.00 < 0.05)
                        thrpt:  [−14.184% −10.856% −7.8300%]
                        Performance has regressed.
Found 4 outliers among 20 measurements (20.00%)
  1 (5.00%) low mild
  3 (15.00%) high mild
Forward f64/PhastFT DIT/65536
                        time:   [458.79 µs 482.02 µs 498.75 µs]
                        thrpt:  [131.40 Melem/s 135.96 Melem/s 142.84 Melem/s]
                        thrpt:  [1002.5 MiB/s 1.0130 GiB/s 1.0643 GiB/s]
                 change:
                        time:   [+4.6303% +8.0889% +11.640%] (p = 0.00 < 0.05)
                        thrpt:  [−10.426% −7.4835% −4.4254%]
                        Performance has regressed.
Found 3 outliers among 20 measurements (15.00%)
  3 (15.00%) high mild
Forward f64/PhastFT DIT/131072
                        time:   [1.1157 ms 1.1613 ms 1.1950 ms]
                        thrpt:  [109.68 Melem/s 112.87 Melem/s 117.48 Melem/s]
                        thrpt:  [836.79 MiB/s 861.12 MiB/s 896.31 MiB/s]
                 change:
                        time:   [+16.966% +20.573% +24.493%] (p = 0.00 < 0.05)
                        thrpt:  [−19.674% −17.062% −14.505%]
                        Performance has regressed.
Found 4 outliers among 20 measurements (20.00%)
  3 (15.00%) high mild
  1 (5.00%) high severe
Forward f64/PhastFT DIT/262144
                        time:   [2.1408 ms 2.2421 ms 2.3054 ms]
                        thrpt:  [113.71 Melem/s 116.92 Melem/s 122.45 Melem/s]
                        thrpt:  [867.54 MiB/s 892.02 MiB/s 934.22 MiB/s]
                 change:
                        time:   [−1.4045% +2.2981% +6.0806%] (p = 0.28 > 0.05)
                        thrpt:  [−5.7321% −2.2465% +1.4245%]
                        No change in performance detected.
Forward f64/PhastFT DIT/524288
                        time:   [4.7630 ms 4.9210 ms 5.0146 ms]
                        thrpt:  [104.55 Melem/s 106.54 Melem/s 110.07 Melem/s]
                        thrpt:  [797.67 MiB/s 812.85 MiB/s 839.80 MiB/s]
                 change:
                        time:   [+5.5029% +8.9526% +12.714%] (p = 0.00 < 0.05)
                        thrpt:  [−11.280% −8.2169% −5.2159%]
                        Performance has regressed.
Forward f64/PhastFT DIT/1048576
                        time:   [11.151 ms 11.261 ms 11.353 ms]
                        thrpt:  [92.359 Melem/s 93.116 Melem/s 94.036 Melem/s]
                        thrpt:  [704.64 MiB/s 710.42 MiB/s 717.44 MiB/s]
                 change:
                        time:   [+8.3798% +12.112% +15.467%] (p = 0.00 < 0.05)
                        thrpt:  [−13.395% −10.804% −7.7319%]
                        Performance has regressed.
Found 4 outliers among 20 measurements (20.00%)
  4 (20.00%) low mild
Benchmarking Forward f64/PhastFT DIT/2097152: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 9.9s, enable flat sampling, or reduce sample count to 10.
Forward f64/PhastFT DIT/2097152
                        time:   [23.765 ms 23.844 ms 23.922 ms]
                        thrpt:  [87.667 Melem/s 87.952 Melem/s 88.244 Melem/s]
                        thrpt:  [668.85 MiB/s 671.02 MiB/s 673.25 MiB/s]
                 change:
                        time:   [+9.4874% +11.569% +13.626%] (p = 0.00 < 0.05)
                        thrpt:  [−11.992% −10.369% −8.6653%]
                        Performance has regressed.
Found 3 outliers among 20 measurements (15.00%)
  2 (10.00%) low mild
  1 (5.00%) high severe
Forward f64/PhastFT DIT/4194304
                        time:   [52.615 ms 53.163 ms 53.800 ms]
                        thrpt:  [77.961 Melem/s 78.895 Melem/s 79.716 Melem/s]
                        thrpt:  [594.80 MiB/s 601.92 MiB/s 608.19 MiB/s]
                 change:
                        time:   [−67.725% −67.312% −66.885%] (p = 0.00 < 0.05)
                        thrpt:  [+201.97% +205.93% +209.84%]
                        Performance has improved.
Forward f64/PhastFT DIT/8388608
                        time:   [145.74 ms 146.10 ms 146.42 ms]
                        thrpt:  [57.290 Melem/s 57.418 Melem/s 57.559 Melem/s]
                        thrpt:  [437.09 MiB/s 438.06 MiB/s 439.14 MiB/s]
                 change:
                        time:   [+28.734% +29.097% +29.464%] (p = 0.00 < 0.05)
                        thrpt:  [−22.758% −22.539% −22.320%]
                        Performance has regressed.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) low mild
Benchmarking Forward f64/PhastFT DIT/16777216: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 13.7s, or reduce sample count to 10.
Forward f64/PhastFT DIT/16777216
                        time:   [489.91 ms 491.78 ms 493.61 ms]
                        thrpt:  [33.989 Melem/s 34.115 Melem/s 34.245 Melem/s]
                        thrpt:  [259.31 MiB/s 260.28 MiB/s 261.27 MiB/s]
                 change:
                        time:   [−0.8762% +0.0717% +0.9975%] (p = 0.89 > 0.05)
                        thrpt:  [−0.9876% −0.0717% +0.8839%]
                        No change in performance detected.

Inverse f64/PhastFT DIT/64
                        time:   [145.67 ns 158.18 ns 171.20 ns]
                        thrpt:  [373.83 Melem/s 404.60 Melem/s 439.34 Melem/s]
                        thrpt:  [2.7852 GiB/s 3.0145 GiB/s 3.2734 GiB/s]
                 change:
                        time:   [−39.775% −37.624% −34.916%] (p = 0.00 < 0.05)
                        thrpt:  [+53.648% +60.317% +66.043%]
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high severe
Inverse f64/PhastFT DIT/128
                        time:   [258.78 ns 294.04 ns 319.57 ns]
                        thrpt:  [400.54 Melem/s 435.32 Melem/s 494.62 Melem/s]
                        thrpt:  [2.9843 GiB/s 3.2434 GiB/s 3.6852 GiB/s]
                 change:
                        time:   [−33.484% −29.364% −24.635%] (p = 0.00 < 0.05)
                        thrpt:  [+32.688% +41.572% +50.340%]
                        Performance has improved.
Found 3 outliers among 20 measurements (15.00%)
  3 (15.00%) high severe
Inverse f64/PhastFT DIT/256
                        time:   [620.50 ns 681.90 ns 727.24 ns]
                        thrpt:  [352.02 Melem/s 375.42 Melem/s 412.57 Melem/s]
                        thrpt:  [2.6227 GiB/s 2.7971 GiB/s 3.0739 GiB/s]
                 change:
                        time:   [−28.264% −23.357% −17.929%] (p = 0.00 < 0.05)
                        thrpt:  [+21.846% +30.475% +39.400%]
                        Performance has improved.
Found 4 outliers among 20 measurements (20.00%)
  4 (20.00%) high severe
Inverse f64/PhastFT DIT/512
                        time:   [1.5021 µs 1.6777 µs 1.8026 µs]
                        thrpt:  [284.03 Melem/s 305.19 Melem/s 340.85 Melem/s]
                        thrpt:  [2.1162 GiB/s 2.2738 GiB/s 2.5395 GiB/s]
                 change:
                        time:   [−40.167% −34.638% −28.708%] (p = 0.00 < 0.05)
                        thrpt:  [+40.268% +52.994% +67.132%]
                        Performance has improved.
Inverse f64/PhastFT DIT/1024
                        time:   [3.2132 µs 3.5359 µs 3.7680 µs]
                        thrpt:  [271.76 Melem/s 289.60 Melem/s 318.69 Melem/s]
                        thrpt:  [2.0248 GiB/s 2.1577 GiB/s 2.3744 GiB/s]
                 change:
                        time:   [−43.297% −38.816% −33.167%] (p = 0.00 < 0.05)
                        thrpt:  [+49.627% +63.442% +76.359%]
                        Performance has improved.
Found 4 outliers among 20 measurements (20.00%)
  4 (20.00%) high mild
Inverse f64/PhastFT DIT/2048
                        time:   [6.9447 µs 7.5618 µs 8.0063 µs]
                        thrpt:  [255.80 Melem/s 270.83 Melem/s 294.90 Melem/s]
                        thrpt:  [1.9058 GiB/s 2.0179 GiB/s 2.1972 GiB/s]
                 change:
                        time:   [−5.9359% +3.1932% +13.160%] (p = 0.52 > 0.05)
                        thrpt:  [−11.629% −3.0943% +6.3105%]
                        No change in performance detected.
Inverse f64/PhastFT DIT/4096
                        time:   [14.619 µs 15.663 µs 16.398 µs]
                        thrpt:  [249.79 Melem/s 261.50 Melem/s 280.18 Melem/s]
                        thrpt:  [1.8611 GiB/s 1.9484 GiB/s 2.0875 GiB/s]
                 change:
                        time:   [−8.4093% −0.8239% +7.1276%] (p = 0.84 > 0.05)
                        thrpt:  [−6.6534% +0.8307% +9.1814%]
                        No change in performance detected.
Inverse f64/PhastFT DIT/8192
                        time:   [35.053 µs 37.174 µs 39.138 µs]
                        thrpt:  [209.31 Melem/s 220.37 Melem/s 233.70 Melem/s]
                        thrpt:  [1.5595 GiB/s 1.6419 GiB/s 1.7412 GiB/s]
                 change:
                        time:   [−3.5825% +0.1025% +4.1164%] (p = 0.96 > 0.05)
                        thrpt:  [−3.9536% −0.1023% +3.7156%]
                        No change in performance detected.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild
Inverse f64/PhastFT DIT/16384
                        time:   [91.294 µs 96.448 µs 100.28 µs]
                        thrpt:  [163.38 Melem/s 169.87 Melem/s 179.46 Melem/s]
                        thrpt:  [1.2173 GiB/s 1.2657 GiB/s 1.3371 GiB/s]
                 change:
                        time:   [+0.7499% +3.2110% +6.3583%] (p = 0.03 < 0.05)
                        thrpt:  [−5.9782% −3.1111% −0.7443%]
                        Change within noise threshold.
Found 3 outliers among 20 measurements (15.00%)
  3 (15.00%) high severe
Inverse f64/PhastFT DIT/32768
                        time:   [228.19 µs 231.93 µs 234.24 µs]
                        thrpt:  [139.89 Melem/s 141.28 Melem/s 143.60 Melem/s]
                        thrpt:  [1.0423 GiB/s 1.0526 GiB/s 1.0699 GiB/s]
                 change:
                        time:   [+8.4481% +11.696% +14.757%] (p = 0.00 < 0.05)
                        thrpt:  [−12.859% −10.471% −7.7900%]
                        Performance has regressed.
Inverse f64/PhastFT DIT/65536
                        time:   [475.78 µs 489.15 µs 505.75 µs]
                        thrpt:  [129.58 Melem/s 133.98 Melem/s 137.75 Melem/s]
                        thrpt:  [988.63 MiB/s 1022.2 MiB/s 1.0263 GiB/s]
                 change:
                        time:   [+5.3146% +8.3279% +11.568%] (p = 0.00 < 0.05)
                        thrpt:  [−10.369% −7.6877% −5.0464%]
                        Performance has regressed.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
Inverse f64/PhastFT DIT/131072
                        time:   [1.1383 ms 1.1517 ms 1.1602 ms]
                        thrpt:  [112.98 Melem/s 113.81 Melem/s 115.15 Melem/s]
                        thrpt:  [861.94 MiB/s 868.28 MiB/s 878.52 MiB/s]
                 change:
                        time:   [+17.372% +20.131% +23.133%] (p = 0.00 < 0.05)
                        thrpt:  [−18.787% −16.757% −14.800%]
                        Performance has regressed.
Inverse f64/PhastFT DIT/262144
                        time:   [2.1471 ms 2.2646 ms 2.3506 ms]
                        thrpt:  [111.52 Melem/s 115.76 Melem/s 122.09 Melem/s]
                        thrpt:  [850.84 MiB/s 883.14 MiB/s 931.48 MiB/s]
                 change:
                        time:   [+0.0172% +3.0944% +6.4116%] (p = 0.09 > 0.05)
                        thrpt:  [−6.0253% −3.0015% −0.0172%]
                        No change in performance detected.
Found 4 outliers among 20 measurements (20.00%)
  1 (5.00%) low mild
  3 (15.00%) high severe
Inverse f64/PhastFT DIT/524288
                        time:   [4.7772 ms 4.9671 ms 5.1079 ms]
                        thrpt:  [102.64 Melem/s 105.55 Melem/s 109.75 Melem/s]
                        thrpt:  [783.11 MiB/s 805.31 MiB/s 837.32 MiB/s]
                 change:
                        time:   [+4.0879% +7.3573% +10.628%] (p = 0.00 < 0.05)
                        thrpt:  [−9.6073% −6.8531% −3.9273%]
                        Performance has regressed.
Found 7 outliers among 20 measurements (35.00%)
  3 (15.00%) low mild
  1 (5.00%) high mild
  3 (15.00%) high severe
Inverse f64/PhastFT DIT/1048576
                        time:   [11.105 ms 11.269 ms 11.372 ms]
                        thrpt:  [92.210 Melem/s 93.052 Melem/s 94.424 Melem/s]
                        thrpt:  [703.51 MiB/s 709.93 MiB/s 720.40 MiB/s]
                 change:
                        time:   [+6.3804% +9.2888% +12.439%] (p = 0.00 < 0.05)
                        thrpt:  [−11.063% −8.4993% −5.9977%]
                        Performance has regressed.
Inverse f64/PhastFT DIT/2097152
                        time:   [25.376 ms 25.638 ms 25.945 ms]
                        thrpt:  [80.830 Melem/s 81.800 Melem/s 82.645 Melem/s]
                        thrpt:  [616.69 MiB/s 624.08 MiB/s 630.53 MiB/s]
                 change:
                        time:   [+13.173% +15.356% +17.884%] (p = 0.00 < 0.05)
                        thrpt:  [−15.171% −13.312% −11.640%]
                        Performance has regressed.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
Inverse f64/PhastFT DIT/4194304
                        time:   [54.806 ms 55.323 ms 55.887 ms]
                        thrpt:  [75.050 Melem/s 75.815 Melem/s 76.530 Melem/s]
                        thrpt:  [572.58 MiB/s 578.42 MiB/s 583.88 MiB/s]
                 change:
                        time:   [−66.887% −66.563% −66.263%] (p = 0.00 < 0.05)
                        thrpt:  [+196.41% +199.07% +202.00%]
                        Performance has improved.
Benchmarking Inverse f64/PhastFT DIT/8388608: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 5.1s, or reduce sample count to 10.
Inverse f64/PhastFT DIT/8388608
                        time:   [152.82 ms 153.62 ms 154.61 ms]
                        thrpt:  [54.256 Melem/s 54.607 Melem/s 54.891 Melem/s]
                        thrpt:  [413.94 MiB/s 416.62 MiB/s 418.78 MiB/s]
                 change:
                        time:   [+30.250% +30.902% +31.681%] (p = 0.00 < 0.05)
                        thrpt:  [−24.059% −23.607% −23.225%]
                        Performance has regressed.
Found 3 outliers among 20 measurements (15.00%)
  2 (10.00%) high mild
  1 (5.00%) high severe
Benchmarking Inverse f64/PhastFT DIT/16777216: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 14.1s, or reduce sample count to 10.
Inverse f64/PhastFT DIT/16777216
                        time:   [507.88 ms 509.55 ms 511.43 ms]
                        thrpt:  [32.804 Melem/s 32.925 Melem/s 33.034 Melem/s]
                        thrpt:  [250.28 MiB/s 251.20 MiB/s 252.03 MiB/s]
                 change:
                        time:   [+0.1438% +0.6710% +1.1566%] (p = 0.02 < 0.05)
                        thrpt:  [−1.1433% −0.6665% −0.1436%]
                        Change within noise threshold.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild

On Apple M4 this is big improvement for f32/32768 which I assume is also due to some implementations hitting cache associativity issues or a similar hardware quirk. No equivalent change for f64/16384, oddly enough. Also +7% gains for all f64 up to and including 1024, no change on larger sizes.

TODO:

  • more careful benchmarks on more hardware
  • understand the regressions at higher sizes
  • integrate the unrolled COBRA version from Optimising cobra_apply #47

@codecov-commenter
Copy link

codecov-commenter commented Jan 18, 2026

Codecov Report

❌ Patch coverage is 90.90909% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.43%. Comparing base (0f47ea1) to head (13d8b5b).

Files with missing lines Patch % Lines
src/bencher.rs 85.71% 5 Missing ⚠️
src/planner.rs 89.47% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #60      +/-   ##
==========================================
- Coverage   99.82%   99.43%   -0.40%     
==========================================
  Files          13       14       +1     
  Lines        2261     2289      +28     
==========================================
+ Hits         2257     2276      +19     
- Misses          4       13       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Shnatsel
Copy link
Collaborator Author

I've re-run the benchmarks against latest main on zen4 on the same versions of the compiler and all dependencies. The largest regression is 4% while many smaller sizes improve by a lot, some over 2x. I think this is good to go.

benchmarks on desktop zen4 vs main on commit 0f47ea1
> RUSTFLAGS='-C target-cpu=native' cargo bench --profile=profiling --bench=bench 'PhastFT DIT' -- --baseline=main-native-rayonless-0f47
   Compiling phastft v0.3.0 (/home/shnatsel/Code/PhastFT)
    Finished `profiling` profile [optimized + debuginfo] target(s) in 15.04s
     Running benches/bench.rs (target/profiling/deps/bench-50215b1ccd36228b)
Forward f32/PhastFT DIT/64
                        time:   [103.30 ns 107.08 ns 113.71 ns]
                        thrpt:  [562.84 Melem/s 597.67 Melem/s 619.56 Melem/s]
                        thrpt:  [2.0968 GiB/s 2.2265 GiB/s 2.3080 GiB/s]
                 change:
                        time:   [−32.334% −29.860% −26.937%] (p = 0.00 < 0.05)
                        thrpt:  [+36.868% +42.572% +47.784%]
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe
Forward f32/PhastFT DIT/128
                        time:   [170.06 ns 183.67 ns 192.88 ns]
                        thrpt:  [663.63 Melem/s 696.89 Melem/s 752.70 Melem/s]
                        thrpt:  [2.4722 GiB/s 2.5961 GiB/s 2.8040 GiB/s]
                 change:
                        time:   [−27.626% −23.075% −18.144%] (p = 0.00 < 0.05)
                        thrpt:  [+22.166% +29.996% +38.171%]
                        Performance has improved.
Found 5 outliers among 20 measurements (25.00%)
  4 (20.00%) high mild
  1 (5.00%) high severe
Forward f32/PhastFT DIT/256
                        time:   [333.46 ns 360.17 ns 376.93 ns]
                        thrpt:  [679.16 Melem/s 710.78 Melem/s 767.70 Melem/s]
                        thrpt:  [2.5301 GiB/s 2.6479 GiB/s 2.8599 GiB/s]
                 change:
                        time:   [−26.238% −20.704% −13.902%] (p = 0.00 < 0.05)
                        thrpt:  [+16.146% +26.110% +35.572%]
                        Performance has improved.
Forward f32/PhastFT DIT/512
                        time:   [787.30 ns 842.95 ns 875.56 ns]
                        thrpt:  [584.77 Melem/s 607.39 Melem/s 650.32 Melem/s]
                        thrpt:  [2.1784 GiB/s 2.2627 GiB/s 2.4226 GiB/s]
                 change:
                        time:   [−46.070% −42.265% −38.024%] (p = 0.00 < 0.05)
                        thrpt:  [+61.352% +73.205% +85.426%]
                        Performance has improved.
Forward f32/PhastFT DIT/1024
                        time:   [1.7059 µs 1.8643 µs 1.9605 µs]
                        thrpt:  [522.32 Melem/s 549.26 Melem/s 600.26 Melem/s]
                        thrpt:  [1.9458 GiB/s 2.0461 GiB/s 2.2361 GiB/s]
                 change:
                        time:   [−56.635% −53.244% −49.241%] (p = 0.00 < 0.05)
                        thrpt:  [+97.008% +113.88% +130.60%]
                        Performance has improved.
Forward f32/PhastFT DIT/2048
                        time:   [3.7394 µs 4.0943 µs 4.3071 µs]
                        thrpt:  [475.50 Melem/s 500.21 Melem/s 547.68 Melem/s]
                        thrpt:  [1.7714 GiB/s 1.8634 GiB/s 2.0403 GiB/s]
                 change:
                        time:   [−15.341% −6.0744% +4.1521%] (p = 0.25 > 0.05)
                        thrpt:  [−3.9866% +6.4673% +18.120%]
                        No change in performance detected.
Forward f32/PhastFT DIT/4096
                        time:   [7.8218 µs 8.4779 µs 8.8745 µs]
                        thrpt:  [461.55 Melem/s 483.14 Melem/s 523.66 Melem/s]
                        thrpt:  [1.7194 GiB/s 1.7998 GiB/s 1.9508 GiB/s]
                 change:
                        time:   [−13.965% −6.0131% +3.7289%] (p = 0.22 > 0.05)
                        thrpt:  [−3.5948% +6.3979% +16.232%]
                        No change in performance detected.
Forward f32/PhastFT DIT/8192
                        time:   [16.275 µs 17.438 µs 18.146 µs]
                        thrpt:  [451.46 Melem/s 469.78 Melem/s 503.36 Melem/s]
                        thrpt:  [1.6818 GiB/s 1.7501 GiB/s 1.8751 GiB/s]
                 change:
                        time:   [−13.167% −5.5692% +2.9072%] (p = 0.19 > 0.05)
                        thrpt:  [−2.8251% +5.8976% +15.163%]
                        No change in performance detected.
Forward f32/PhastFT DIT/16384
                        time:   [48.239 µs 50.301 µs 51.800 µs]
                        thrpt:  [316.29 Melem/s 325.72 Melem/s 339.64 Melem/s]
                        thrpt:  [1.1783 GiB/s 1.2134 GiB/s 1.2653 GiB/s]
                 change:
                        time:   [−8.6895% −5.3938% −1.7353%] (p = 0.01 < 0.05)
                        thrpt:  [+1.7660% +5.7013% +9.5164%]
                        Performance has improved.
Found 4 outliers among 20 measurements (20.00%)
  4 (20.00%) high mild
Forward f32/PhastFT DIT/32768
                        time:   [128.29 µs 132.97 µs 137.02 µs]
                        thrpt:  [239.15 Melem/s 246.43 Melem/s 255.42 Melem/s]
                        thrpt:  [912.29 MiB/s 940.07 MiB/s 974.37 MiB/s]
                 change:
                        time:   [−3.4335% −0.3304% +3.2654%] (p = 0.84 > 0.05)
                        thrpt:  [−3.1621% +0.3315% +3.5556%]
                        No change in performance detected.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild
Forward f32/PhastFT DIT/65536
                        time:   [254.25 µs 265.54 µs 273.77 µs]
                        thrpt:  [239.38 Melem/s 246.80 Melem/s 257.76 Melem/s]
                        thrpt:  [913.17 MiB/s 941.47 MiB/s 983.29 MiB/s]
                 change:
                        time:   [−4.4315% −1.0881% +2.4385%] (p = 0.55 > 0.05)
                        thrpt:  [−2.3804% +1.1001% +4.6369%]
                        No change in performance detected.
Found 3 outliers among 20 measurements (15.00%)
  3 (15.00%) high mild
Forward f32/PhastFT DIT/131072
                        time:   [539.17 µs 560.35 µs 575.39 µs]
                        thrpt:  [227.80 Melem/s 233.91 Melem/s 243.10 Melem/s]
                        thrpt:  [868.97 MiB/s 892.30 MiB/s 927.36 MiB/s]
                 change:
                        time:   [−3.1189% +0.2416% +4.0749%] (p = 0.90 > 0.05)
                        thrpt:  [−3.9153% −0.2411% +3.2193%]
                        No change in performance detected.
Found 4 outliers among 20 measurements (20.00%)
  4 (20.00%) high mild
Forward f32/PhastFT DIT/262144
                        time:   [1.1533 ms 1.1986 ms 1.2309 ms]
                        thrpt:  [212.97 Melem/s 218.72 Melem/s 227.29 Melem/s]
                        thrpt:  [812.40 MiB/s 834.34 MiB/s 867.04 MiB/s]
                 change:
                        time:   [−9.9469% −6.7249% −3.1160%] (p = 0.00 < 0.05)
                        thrpt:  [+3.2162% +7.2097% +11.046%]
                        Performance has improved.
Found 4 outliers among 20 measurements (20.00%)
  4 (20.00%) high mild
Forward f32/PhastFT DIT/524288
                        time:   [2.4640 ms 2.5966 ms 2.6957 ms]
                        thrpt:  [194.49 Melem/s 201.92 Melem/s 212.78 Melem/s]
                        thrpt:  [741.91 MiB/s 770.25 MiB/s 811.68 MiB/s]
                 change:
                        time:   [−4.6768% −0.1694% +4.5634%] (p = 0.94 > 0.05)
                        thrpt:  [−4.3643% +0.1697% +4.9063%]
                        No change in performance detected.
Found 4 outliers among 20 measurements (20.00%)
  4 (20.00%) high severe
Forward f32/PhastFT DIT/1048576
                        time:   [5.4337 ms 5.5563 ms 5.6237 ms]
                        thrpt:  [186.46 Melem/s 188.72 Melem/s 192.98 Melem/s]
                        thrpt:  [711.28 MiB/s 719.91 MiB/s 736.14 MiB/s]
                 change:
                        time:   [−8.4722% −4.1169% +0.6093%] (p = 0.09 > 0.05)
                        thrpt:  [−0.6056% +4.2937% +9.2564%]
                        No change in performance detected.
Forward f32/PhastFT DIT/2097152
                        time:   [11.800 ms 11.867 ms 11.955 ms]
                        thrpt:  [175.42 Melem/s 176.72 Melem/s 177.72 Melem/s]
                        thrpt:  [669.18 MiB/s 674.14 MiB/s 677.97 MiB/s]
                 change:
                        time:   [−5.5604% −2.5554% +0.1635%] (p = 0.10 > 0.05)
                        thrpt:  [−0.1632% +2.6224% +5.8878%]
                        No change in performance detected.
Found 4 outliers among 20 measurements (20.00%)
  4 (20.00%) low mild
Forward f32/PhastFT DIT/4194304
                        time:   [26.150 ms 26.294 ms 26.440 ms]
                        thrpt:  [158.64 Melem/s 159.52 Melem/s 160.39 Melem/s]
                        thrpt:  [605.15 MiB/s 608.51 MiB/s 611.85 MiB/s]
                 change:
                        time:   [+3.2572% +3.7901% +4.3385%] (p = 0.00 < 0.05)
                        thrpt:  [−4.1581% −3.6517% −3.1545%]
                        Performance has regressed.
Forward f32/PhastFT DIT/8388608
                        time:   [59.331 ms 59.369 ms 59.405 ms]
                        thrpt:  [141.21 Melem/s 141.30 Melem/s 141.39 Melem/s]
                        thrpt:  [538.67 MiB/s 539.01 MiB/s 539.34 MiB/s]
                 change:
                        time:   [+3.1756% +3.3002% +3.4275%] (p = 0.00 < 0.05)
                        thrpt:  [−3.3139% −3.1947% −3.0778%]
                        Performance has regressed.
Found 3 outliers among 20 measurements (15.00%)
  2 (10.00%) low mild
  1 (5.00%) high mild
Benchmarking Forward f32/PhastFT DIT/16777216: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 5.6s, or reduce sample count to 10.
Forward f32/PhastFT DIT/16777216
                        time:   [168.71 ms 169.59 ms 170.52 ms]
                        thrpt:  [98.387 Melem/s 98.929 Melem/s 99.442 Melem/s]
                        thrpt:  [375.32 MiB/s 377.38 MiB/s 379.34 MiB/s]
                 change:
                        time:   [−19.155% −18.518% −17.807%] (p = 0.00 < 0.05)
                        thrpt:  [+21.664% +22.727% +23.694%]
                        Performance has improved.

Inverse f32/PhastFT DIT/64
                        time:   [119.75 ns 120.98 ns 122.33 ns]
                        thrpt:  [523.18 Melem/s 529.03 Melem/s 534.43 Melem/s]
                        thrpt:  [1.9490 GiB/s 1.9708 GiB/s 1.9909 GiB/s]
                 change:
                        time:   [−23.794% −22.869% −21.985%] (p = 0.00 < 0.05)
                        thrpt:  [+28.181% +29.650% +31.223%]
                        Performance has improved.
Inverse f32/PhastFT DIT/128
                        time:   [175.18 ns 181.90 ns 193.95 ns]
                        thrpt:  [659.96 Melem/s 703.69 Melem/s 730.69 Melem/s]
                        thrpt:  [2.4585 GiB/s 2.6214 GiB/s 2.7220 GiB/s]
                 change:
                        time:   [−18.615% −16.942% −14.244%] (p = 0.00 < 0.05)
                        thrpt:  [+16.610% +20.398% +22.873%]
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe
Inverse f32/PhastFT DIT/256
                        time:   [331.89 ns 358.37 ns 384.60 ns]
                        thrpt:  [665.62 Melem/s 714.35 Melem/s 771.35 Melem/s]
                        thrpt:  [2.4796 GiB/s 2.6612 GiB/s 2.8735 GiB/s]
                 change:
                        time:   [−17.761% −13.315% −8.4191%] (p = 0.00 < 0.05)
                        thrpt:  [+9.1930% +15.360% +21.596%]
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high severe
Inverse f32/PhastFT DIT/512
                        time:   [756.87 ns 821.23 ns 867.95 ns]
                        thrpt:  [589.89 Melem/s 623.46 Melem/s 676.47 Melem/s]
                        thrpt:  [2.1975 GiB/s 2.3226 GiB/s 2.5201 GiB/s]
                 change:
                        time:   [−41.398% −38.979% −36.319%] (p = 0.00 < 0.05)
                        thrpt:  [+57.032% +63.877% +70.642%]
                        Performance has improved.
Found 3 outliers among 20 measurements (15.00%)
  3 (15.00%) high severe
Inverse f32/PhastFT DIT/1024
                        time:   [1.7195 µs 1.8757 µs 1.9881 µs]
                        thrpt:  [515.07 Melem/s 545.92 Melem/s 595.52 Melem/s]
                        thrpt:  [1.9188 GiB/s 2.0337 GiB/s 2.2185 GiB/s]
                 change:
                        time:   [−54.071% −51.336% −48.342%] (p = 0.00 < 0.05)
                        thrpt:  [+93.581% +105.49% +117.73%]
                        Performance has improved.
Inverse f32/PhastFT DIT/2048
                        time:   [3.7056 µs 4.0188 µs 4.2435 µs]
                        thrpt:  [482.63 Melem/s 509.61 Melem/s 552.67 Melem/s]
                        thrpt:  [1.7979 GiB/s 1.8984 GiB/s 2.0589 GiB/s]
                 change:
                        time:   [−8.9721% −0.8636% +7.8486%] (p = 0.83 > 0.05)
                        thrpt:  [−7.2774% +0.8711% +9.8564%]
                        No change in performance detected.
Inverse f32/PhastFT DIT/4096
                        time:   [7.8308 µs 8.4852 µs 8.9576 µs]
                        thrpt:  [457.27 Melem/s 482.72 Melem/s 523.06 Melem/s]
                        thrpt:  [1.7035 GiB/s 1.7983 GiB/s 1.9486 GiB/s]
                 change:
                        time:   [−7.3510% −0.2609% +8.1731%] (p = 0.95 > 0.05)
                        thrpt:  [−7.5556% +0.2616% +7.9343%]
                        No change in performance detected.
Inverse f32/PhastFT DIT/8192
                        time:   [16.199 µs 17.265 µs 18.178 µs]
                        thrpt:  [450.65 Melem/s 474.48 Melem/s 505.72 Melem/s]
                        thrpt:  [1.6788 GiB/s 1.7676 GiB/s 1.8839 GiB/s]
                 change:
                        time:   [−6.7377% −0.8235% +5.4072%] (p = 0.81 > 0.05)
                        thrpt:  [−5.1298% +0.8304% +7.2244%]
                        No change in performance detected.
Inverse f32/PhastFT DIT/16384
                        time:   [49.232 µs 49.865 µs 50.277 µs]
                        thrpt:  [325.88 Melem/s 328.57 Melem/s 332.79 Melem/s]
                        thrpt:  [1.2140 GiB/s 1.2240 GiB/s 1.2397 GiB/s]
                 change:
                        time:   [−1.9654% −0.6160% +0.7743%] (p = 0.39 > 0.05)
                        thrpt:  [−0.7684% +0.6199% +2.0048%]
                        No change in performance detected.
Inverse f32/PhastFT DIT/32768
                        time:   [131.23 µs 133.19 µs 134.48 µs]
                        thrpt:  [243.67 Melem/s 246.03 Melem/s 249.70 Melem/s]
                        thrpt:  [929.53 MiB/s 938.53 MiB/s 952.54 MiB/s]
                 change:
                        time:   [+1.1993% +2.7264% +4.3050%] (p = 0.00 < 0.05)
                        thrpt:  [−4.1273% −2.6541% −1.1851%]
                        Performance has regressed.
Inverse f32/PhastFT DIT/65536
                        time:   [262.36 µs 266.28 µs 268.84 µs]
                        thrpt:  [243.77 Melem/s 246.11 Melem/s 249.80 Melem/s]
                        thrpt:  [929.91 MiB/s 938.85 MiB/s 952.90 MiB/s]
                 change:
                        time:   [+1.8522% +3.3648% +4.9070%] (p = 0.00 < 0.05)
                        thrpt:  [−4.6775% −3.2553% −1.8185%]
                        Performance has regressed.
Inverse f32/PhastFT DIT/131072
                        time:   [542.92 µs 552.73 µs 559.16 µs]
                        thrpt:  [234.41 Melem/s 237.14 Melem/s 241.42 Melem/s]
                        thrpt:  [894.19 MiB/s 904.60 MiB/s 920.94 MiB/s]
                 change:
                        time:   [+1.2146% +3.1059% +5.1118%] (p = 0.00 < 0.05)
                        thrpt:  [−4.8632% −3.0123% −1.2000%]
                        Performance has regressed.
Inverse f32/PhastFT DIT/262144
                        time:   [1.1801 ms 1.1929 ms 1.2002 ms]
                        thrpt:  [218.42 Melem/s 219.74 Melem/s 222.13 Melem/s]
                        thrpt:  [833.20 MiB/s 838.26 MiB/s 847.36 MiB/s]
                 change:
                        time:   [−4.5700% −2.8677% −1.0830%] (p = 0.01 < 0.05)
                        thrpt:  [+1.0949% +2.9524% +4.7888%]
                        Performance has improved.
Inverse f32/PhastFT DIT/524288
                        time:   [2.4618 ms 2.4861 ms 2.5007 ms]
                        thrpt:  [209.66 Melem/s 210.89 Melem/s 212.97 Melem/s]
                        thrpt:  [799.78 MiB/s 804.47 MiB/s 812.40 MiB/s]
                 change:
                        time:   [−2.1332% +1.2153% +4.4939%] (p = 0.48 > 0.05)
                        thrpt:  [−4.3006% −1.2007% +2.1797%]
                        No change in performance detected.
Inverse f32/PhastFT DIT/1048576
                        time:   [5.3891 ms 5.5869 ms 5.7057 ms]
                        thrpt:  [183.78 Melem/s 187.68 Melem/s 194.57 Melem/s]
                        thrpt:  [701.06 MiB/s 715.96 MiB/s 742.24 MiB/s]
                 change:
                        time:   [−2.6181% +1.5208% +6.2865%] (p = 0.53 > 0.05)
                        thrpt:  [−5.9147% −1.4980% +2.6885%]
                        No change in performance detected.
Inverse f32/PhastFT DIT/2097152
                        time:   [11.978 ms 12.038 ms 12.098 ms]
                        thrpt:  [173.35 Melem/s 174.21 Melem/s 175.09 Melem/s]
                        thrpt:  [661.26 MiB/s 664.57 MiB/s 667.92 MiB/s]
                 change:
                        time:   [−2.3795% +1.8420% +6.1592%] (p = 0.41 > 0.05)
                        thrpt:  [−5.8018% −1.8087% +2.4375%]
                        No change in performance detected.
Found 5 outliers among 20 measurements (25.00%)
  5 (25.00%) low mild
Inverse f32/PhastFT DIT/4194304
                        time:   [26.876 ms 27.025 ms 27.176 ms]
                        thrpt:  [154.34 Melem/s 155.20 Melem/s 156.06 Melem/s]
                        thrpt:  [588.76 MiB/s 592.05 MiB/s 595.32 MiB/s]
                 change:
                        time:   [+3.3368% +4.0776% +4.8861%] (p = 0.00 < 0.05)
                        thrpt:  [−4.6585% −3.9179% −3.2291%]
                        Performance has regressed.
Inverse f32/PhastFT DIT/8388608
                        time:   [61.850 ms 61.893 ms 61.935 ms]
                        thrpt:  [135.44 Melem/s 135.54 Melem/s 135.63 Melem/s]
                        thrpt:  [516.67 MiB/s 517.03 MiB/s 517.38 MiB/s]
                 change:
                        time:   [+4.5030% +4.6130% +4.7204%] (p = 0.00 < 0.05)
                        thrpt:  [−4.5076% −4.4095% −4.3090%]
                        Performance has regressed.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
Benchmarking Inverse f32/PhastFT DIT/16777216: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 5.7s, or reduce sample count to 10.
Inverse f32/PhastFT DIT/16777216
                        time:   [175.16 ms 176.06 ms 177.04 ms]
                        thrpt:  [94.764 Melem/s 95.292 Melem/s 95.780 Melem/s]
                        thrpt:  [361.50 MiB/s 363.51 MiB/s 365.37 MiB/s]
                 change:
                        time:   [−18.630% −17.956% −17.317%] (p = 0.00 < 0.05)
                        thrpt:  [+20.944% +21.885% +22.896%]
                        Performance has improved.

Forward f64/PhastFT DIT/64
                        time:   [134.38 ns 148.45 ns 159.01 ns]
                        thrpt:  [402.49 Melem/s 431.13 Melem/s 476.26 Melem/s]
                        thrpt:  [2.9988 GiB/s 3.2122 GiB/s 3.5484 GiB/s]
                 change:
                        time:   [−42.087% −38.712% −34.892%] (p = 0.00 < 0.05)
                        thrpt:  [+53.591% +63.164% +72.673%]
                        Performance has improved.
Found 4 outliers among 20 measurements (20.00%)
  4 (20.00%) high severe
Forward f64/PhastFT DIT/128
                        time:   [254.40 ns 281.61 ns 298.57 ns]
                        thrpt:  [428.71 Melem/s 454.53 Melem/s 503.15 Melem/s]
                        thrpt:  [3.1941 GiB/s 3.3865 GiB/s 3.7488 GiB/s]
                 change:
                        time:   [−36.280% −29.943% −22.939%] (p = 0.00 < 0.05)
                        thrpt:  [+29.767% +42.741% +56.937%]
                        Performance has improved.
Forward f64/PhastFT DIT/256
                        time:   [587.35 ns 647.73 ns 683.87 ns]
                        thrpt:  [374.34 Melem/s 395.23 Melem/s 435.86 Melem/s]
                        thrpt:  [2.7891 GiB/s 2.9447 GiB/s 3.2474 GiB/s]
                 change:
                        time:   [−31.811% −25.563% −17.975%] (p = 0.00 < 0.05)
                        thrpt:  [+21.913% +34.342% +46.652%]
                        Performance has improved.
Forward f64/PhastFT DIT/512
                        time:   [1.3590 µs 1.4899 µs 1.5702 µs]
                        thrpt:  [326.08 Melem/s 343.64 Melem/s 376.76 Melem/s]
                        thrpt:  [2.4295 GiB/s 2.5603 GiB/s 2.8071 GiB/s]
                 change:
                        time:   [−45.711% −40.233% −34.572%] (p = 0.00 < 0.05)
                        thrpt:  [+52.839% +67.315% +84.201%]
                        Performance has improved.
Forward f64/PhastFT DIT/1024
                        time:   [3.0070 µs 3.3063 µs 3.4911 µs]
                        thrpt:  [293.32 Melem/s 309.71 Melem/s 340.54 Melem/s]
                        thrpt:  [2.1854 GiB/s 2.3075 GiB/s 2.5372 GiB/s]
                 change:
                        time:   [−47.257% −42.016% −35.852%] (p = 0.00 < 0.05)
                        thrpt:  [+55.889% +72.462% +89.599%]
                        Performance has improved.
Forward f64/PhastFT DIT/2048
                        time:   [6.4591 µs 7.0791 µs 7.4552 µs]
                        thrpt:  [274.71 Melem/s 289.30 Melem/s 317.07 Melem/s]
                        thrpt:  [2.0467 GiB/s 2.1555 GiB/s 2.3624 GiB/s]
                 change:
                        time:   [−13.216% −1.7555% +11.454%] (p = 0.79 > 0.05)
                        thrpt:  [−10.277% +1.7868% +15.228%]
                        No change in performance detected.
Forward f64/PhastFT DIT/4096
                        time:   [13.359 µs 14.601 µs 15.364 µs]
                        thrpt:  [266.60 Melem/s 280.53 Melem/s 306.62 Melem/s]
                        thrpt:  [1.9863 GiB/s 2.0901 GiB/s 2.2845 GiB/s]
                 change:
                        time:   [−12.364% −2.0764% +8.4036%] (p = 0.71 > 0.05)
                        thrpt:  [−7.7521% +2.1204% +14.108%]
                        No change in performance detected.
Forward f64/PhastFT DIT/8192
                        time:   [34.293 µs 36.470 µs 37.904 µs]
                        thrpt:  [216.12 Melem/s 224.63 Melem/s 238.88 Melem/s]
                        thrpt:  [1.6102 GiB/s 1.6736 GiB/s 1.7798 GiB/s]
                 change:
                        time:   [−7.2394% −1.4840% +4.7740%] (p = 0.63 > 0.05)
                        thrpt:  [−4.5565% +1.5063% +7.8044%]
                        No change in performance detected.
Forward f64/PhastFT DIT/16384
                        time:   [86.758 µs 90.696 µs 93.589 µs]
                        thrpt:  [175.06 Melem/s 180.65 Melem/s 188.85 Melem/s]
                        thrpt:  [1.3043 GiB/s 1.3459 GiB/s 1.4070 GiB/s]
                 change:
                        time:   [−4.5745% −0.8321% +2.7077%] (p = 0.68 > 0.05)
                        thrpt:  [−2.6363% +0.8391% +4.7938%]
                        No change in performance detected.
Found 4 outliers among 20 measurements (20.00%)
  4 (20.00%) high severe
Forward f64/PhastFT DIT/32768
                        time:   [202.00 µs 214.94 µs 224.29 µs]
                        thrpt:  [146.10 Melem/s 152.45 Melem/s 162.22 Melem/s]
                        thrpt:  [1.0885 GiB/s 1.1358 GiB/s 1.2086 GiB/s]
                 change:
                        time:   [−2.4491% +2.2836% +7.3793%] (p = 0.37 > 0.05)
                        thrpt:  [−6.8722% −2.2326% +2.5106%]
                        No change in performance detected.
Found 3 outliers among 20 measurements (15.00%)
  3 (15.00%) high mild
Forward f64/PhastFT DIT/65536
                        time:   [437.07 µs 459.04 µs 474.94 µs]
                        thrpt:  [137.99 Melem/s 142.77 Melem/s 149.95 Melem/s]
                        thrpt:  [1.0281 GiB/s 1.0637 GiB/s 1.1172 GiB/s]
                 change:
                        time:   [−3.5204% +0.6838% +5.1750%] (p = 0.76 > 0.05)
                        thrpt:  [−4.9204% −0.6792% +3.6488%]
                        No change in performance detected.
Found 3 outliers among 20 measurements (15.00%)
  3 (15.00%) high mild
Forward f64/PhastFT DIT/131072
                        time:   [960.30 µs 1.0064 ms 1.0405 ms]
                        thrpt:  [125.97 Melem/s 130.23 Melem/s 136.49 Melem/s]
                        thrpt:  [961.11 MiB/s 993.60 MiB/s 1.0169 GiB/s]
                 change:
                        time:   [−9.7153% −6.2487% −2.5406%] (p = 0.00 < 0.05)
                        thrpt:  [+2.6069% +6.6652% +10.761%]
                        Performance has improved.
Found 4 outliers among 20 measurements (20.00%)
  4 (20.00%) high mild
Forward f64/PhastFT DIT/262144
                        time:   [2.1639 ms 2.2573 ms 2.3153 ms]
                        thrpt:  [113.22 Melem/s 116.13 Melem/s 121.15 Melem/s]
                        thrpt:  [863.82 MiB/s 886.02 MiB/s 924.27 MiB/s]
                 change:
                        time:   [−1.0394% +3.6317% +8.7743%] (p = 0.16 > 0.05)
                        thrpt:  [−8.0665% −3.5044% +1.0503%]
                        No change in performance detected.
Forward f64/PhastFT DIT/524288
                        time:   [4.6037 ms 4.6942 ms 4.7440 ms]
                        thrpt:  [110.52 Melem/s 111.69 Melem/s 113.89 Melem/s]
                        thrpt:  [843.16 MiB/s 852.11 MiB/s 868.87 MiB/s]
                 change:
                        time:   [−4.2952% +0.1551% +4.7179%] (p = 0.94 > 0.05)
                        thrpt:  [−4.5053% −0.1548% +4.4880%]
                        No change in performance detected.
Forward f64/PhastFT DIT/1048576
                        time:   [10.285 ms 10.337 ms 10.394 ms]
                        thrpt:  [100.88 Melem/s 101.44 Melem/s 101.96 Melem/s]
                        thrpt:  [769.68 MiB/s 773.90 MiB/s 777.86 MiB/s]
                 change:
                        time:   [−0.4721% +3.3727% +7.8886%] (p = 0.13 > 0.05)
                        thrpt:  [−7.3118% −3.2627% +0.4743%]
                        No change in performance detected.
Found 5 outliers among 20 measurements (25.00%)
  5 (25.00%) low mild
Benchmarking Forward f64/PhastFT DIT/2097152: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 9.5s, enable flat sampling, or reduce sample count to 10.
Forward f64/PhastFT DIT/2097152
                        time:   [22.406 ms 22.479 ms 22.556 ms]
                        thrpt:  [92.975 Melem/s 93.293 Melem/s 93.597 Melem/s]
                        thrpt:  [709.34 MiB/s 711.77 MiB/s 714.09 MiB/s]
                 change:
                        time:   [+1.1303% +2.8452% +4.3892%] (p = 0.00 < 0.05)
                        thrpt:  [−4.2046% −2.7665% −1.1177%]
                        Performance has regressed.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe
Forward f64/PhastFT DIT/4194304
                        time:   [50.571 ms 50.616 ms 50.662 ms]
                        thrpt:  [82.790 Melem/s 82.866 Melem/s 82.938 Melem/s]
                        thrpt:  [631.64 MiB/s 632.22 MiB/s 632.77 MiB/s]
                 change:
                        time:   [+2.4846% +2.7307% +2.9220%] (p = 0.00 < 0.05)
                        thrpt:  [−2.8391% −2.6581% −2.4243%]
                        Performance has regressed.
Forward f64/PhastFT DIT/8388608
                        time:   [141.72 ms 141.79 ms 141.87 ms]
                        thrpt:  [59.128 Melem/s 59.160 Melem/s 59.191 Melem/s]
                        thrpt:  [451.11 MiB/s 451.36 MiB/s 451.59 MiB/s]
                 change:
                        time:   [−20.331% −19.801% −19.298%] (p = 0.00 < 0.05)
                        thrpt:  [+23.913% +24.690% +25.519%]
                        Performance has improved.
Benchmarking Forward f64/PhastFT DIT/16777216: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 13.3s, or reduce sample count to 10.
Forward f64/PhastFT DIT/16777216
                        time:   [474.62 ms 476.32 ms 478.64 ms]
                        thrpt:  [35.052 Melem/s 35.223 Melem/s 35.349 Melem/s]
                        thrpt:  [267.42 MiB/s 268.73 MiB/s 269.69 MiB/s]
                 change:
                        time:   [−2.3199% −1.7128% −1.0233%] (p = 0.00 < 0.05)
                        thrpt:  [+1.0339% +1.7426% +2.3750%]
                        Performance has improved.
Found 4 outliers among 20 measurements (20.00%)
  1 (5.00%) high mild
  3 (15.00%) high severe

Inverse f64/PhastFT DIT/64
                        time:   [144.16 ns 156.35 ns 168.90 ns]
                        thrpt:  [378.93 Melem/s 409.33 Melem/s 443.95 Melem/s]
                        thrpt:  [2.8232 GiB/s 3.0497 GiB/s 3.3077 GiB/s]
                 change:
                        time:   [−39.171% −37.139% −34.087%] (p = 0.00 < 0.05)
                        thrpt:  [+51.716% +59.081% +64.396%]
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high severe
Inverse f64/PhastFT DIT/128
                        time:   [257.91 ns 290.23 ns 314.24 ns]
                        thrpt:  [407.33 Melem/s 441.03 Melem/s 496.30 Melem/s]
                        thrpt:  [3.0348 GiB/s 3.2859 GiB/s 3.6977 GiB/s]
                 change:
                        time:   [−34.564% −29.400% −23.601%] (p = 0.00 < 0.05)
                        thrpt:  [+30.892% +41.643% +52.822%]
                        Performance has improved.
Found 3 outliers among 20 measurements (15.00%)
  3 (15.00%) high severe
Inverse f64/PhastFT DIT/256
                        time:   [614.04 ns 677.82 ns 724.93 ns]
                        thrpt:  [353.14 Melem/s 377.68 Melem/s 416.91 Melem/s]
                        thrpt:  [2.6311 GiB/s 2.8139 GiB/s 3.1062 GiB/s]
                 change:
                        time:   [−29.252% −23.623% −17.899%] (p = 0.00 < 0.05)
                        thrpt:  [+21.802% +30.929% +41.346%]
                        Performance has improved.
Found 4 outliers among 20 measurements (20.00%)
  4 (20.00%) high mild
Inverse f64/PhastFT DIT/512
                        time:   [1.3976 µs 1.5386 µs 1.6402 µs]
                        thrpt:  [312.15 Melem/s 332.77 Melem/s 366.33 Melem/s]
                        thrpt:  [2.3257 GiB/s 2.4793 GiB/s 2.7294 GiB/s]
                 change:
                        time:   [−45.002% −40.480% −35.246%] (p = 0.00 < 0.05)
                        thrpt:  [+54.430% +68.011% +81.824%]
                        Performance has improved.
Inverse f64/PhastFT DIT/1024
                        time:   [3.1107 µs 3.4249 µs 3.6462 µs]
                        thrpt:  [280.84 Melem/s 298.99 Melem/s 329.19 Melem/s]
                        thrpt:  [2.0924 GiB/s 2.2276 GiB/s 2.4526 GiB/s]
                 change:
                        time:   [−46.280% −41.187% −35.543%] (p = 0.00 < 0.05)
                        thrpt:  [+55.141% +70.031% +86.151%]
                        Performance has improved.
Inverse f64/PhastFT DIT/2048
                        time:   [6.6631 µs 7.2842 µs 7.7168 µs]
                        thrpt:  [265.39 Melem/s 281.16 Melem/s 307.36 Melem/s]
                        thrpt:  [1.9773 GiB/s 2.0948 GiB/s 2.2900 GiB/s]
                 change:
                        time:   [−11.399% −1.4946% +9.4139%] (p = 0.78 > 0.05)
                        thrpt:  [−8.6039% +1.5173% +12.865%]
                        No change in performance detected.
Inverse f64/PhastFT DIT/4096
                        time:   [13.713 µs 14.963 µs 15.847 µs]
                        thrpt:  [258.46 Melem/s 273.74 Melem/s 298.69 Melem/s]
                        thrpt:  [1.9257 GiB/s 2.0395 GiB/s 2.2254 GiB/s]
                 change:
                        time:   [−10.570% −1.4203% +8.7281%] (p = 0.77 > 0.05)
                        thrpt:  [−8.0275% +1.4407% +11.819%]
                        No change in performance detected.
Inverse f64/PhastFT DIT/8192
                        time:   [34.895 µs 37.231 µs 38.927 µs]
                        thrpt:  [210.45 Melem/s 220.03 Melem/s 234.76 Melem/s]
                        thrpt:  [1.5679 GiB/s 1.6394 GiB/s 1.7491 GiB/s]
                 change:
                        time:   [−6.4331% −1.6959% +3.5729%] (p = 0.52 > 0.05)
                        thrpt:  [−3.4497% +1.7251% +6.8754%]
                        No change in performance detected.
Found 3 outliers among 20 measurements (15.00%)
  3 (15.00%) high mild
Inverse f64/PhastFT DIT/16384
                        time:   [89.090 µs 92.454 µs 95.587 µs]
                        thrpt:  [171.40 Melem/s 177.21 Melem/s 183.90 Melem/s]
                        thrpt:  [1.2771 GiB/s 1.3203 GiB/s 1.3702 GiB/s]
                 change:
                        time:   [−4.2343% −1.5444% +1.1739%] (p = 0.31 > 0.05)
                        thrpt:  [−1.1603% +1.5686% +4.4215%]
                        No change in performance detected.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high severe
Inverse f64/PhastFT DIT/32768
                        time:   [209.47 µs 217.53 µs 225.52 µs]
                        thrpt:  [145.30 Melem/s 150.64 Melem/s 156.44 Melem/s]
                        thrpt:  [1.0826 GiB/s 1.1223 GiB/s 1.1655 GiB/s]
                 change:
                        time:   [−1.7314% +1.7303% +5.6896%] (p = 0.37 > 0.05)
                        thrpt:  [−5.3834% −1.7009% +1.7619%]
                        No change in performance detected.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
Inverse f64/PhastFT DIT/65536
                        time:   [450.81 µs 463.22 µs 478.02 µs]
                        thrpt:  [137.10 Melem/s 141.48 Melem/s 145.37 Melem/s]
                        thrpt:  [1.0215 GiB/s 1.0541 GiB/s 1.0831 GiB/s]
                 change:
                        time:   [−1.4427% +1.5670% +4.8878%] (p = 0.34 > 0.05)
                        thrpt:  [−4.6600% −1.5428% +1.4638%]
                        No change in performance detected.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
Inverse f64/PhastFT DIT/131072
                        time:   [985.77 µs 1.0234 ms 1.0592 ms]
                        thrpt:  [123.75 Melem/s 128.08 Melem/s 132.96 Melem/s]
                        thrpt:  [944.13 MiB/s 977.14 MiB/s 1014.4 MiB/s]
                 change:
                        time:   [−7.1533% −4.6837% −2.1651%] (p = 0.00 < 0.05)
                        thrpt:  [+2.2130% +4.9139% +7.7044%]
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild
Inverse f64/PhastFT DIT/262144
                        time:   [2.1361 ms 2.2562 ms 2.3440 ms]
                        thrpt:  [111.84 Melem/s 116.19 Melem/s 122.72 Melem/s]
                        thrpt:  [853.25 MiB/s 886.43 MiB/s 936.29 MiB/s]
                 change:
                        time:   [+0.2344% +3.9368% +8.1584%] (p = 0.06 > 0.05)
                        thrpt:  [−7.5430% −3.7876% −0.2339%]
                        No change in performance detected.
Found 4 outliers among 20 measurements (20.00%)
  1 (5.00%) low mild
  3 (15.00%) high severe
Inverse f64/PhastFT DIT/524288
                        time:   [4.4492 ms 4.6686 ms 4.8337 ms]
                        thrpt:  [108.46 Melem/s 112.30 Melem/s 117.84 Melem/s]
                        thrpt:  [827.52 MiB/s 856.78 MiB/s 899.04 MiB/s]
                 change:
                        time:   [−3.6781% +0.2970% +4.6067%] (p = 0.89 > 0.05)
                        thrpt:  [−4.4038% −0.2961% +3.8185%]
                        No change in performance detected.
Found 7 outliers among 20 measurements (35.00%)
  3 (15.00%) low mild
  4 (20.00%) high severe
Inverse f64/PhastFT DIT/1048576
                        time:   [10.298 ms 10.510 ms 10.625 ms]
                        thrpt:  [98.688 Melem/s 99.765 Melem/s 101.83 Melem/s]
                        thrpt:  [752.93 MiB/s 761.15 MiB/s 776.88 MiB/s]
                 change:
                        time:   [−1.7326% +2.5562% +6.8997%] (p = 0.26 > 0.05)
                        thrpt:  [−6.4544% −2.4924% +1.7632%]
                        No change in performance detected.
Benchmarking Inverse f64/PhastFT DIT/2097152: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 9.6s, enable flat sampling, or reduce sample count to 10.
Inverse f64/PhastFT DIT/2097152
                        time:   [23.185 ms 23.256 ms 23.331 ms]
                        thrpt:  [89.888 Melem/s 90.175 Melem/s 90.454 Melem/s]
                        thrpt:  [685.79 MiB/s 687.98 MiB/s 690.11 MiB/s]
                 change:
                        time:   [+1.2200% +2.6328% +3.9372%] (p = 0.00 < 0.05)
                        thrpt:  [−3.7880% −2.5653% −1.2053%]
                        Performance has regressed.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) high mild
  1 (5.00%) high severe
Inverse f64/PhastFT DIT/4194304
                        time:   [53.259 ms 53.316 ms 53.378 ms]
                        thrpt:  [78.577 Melem/s 78.669 Melem/s 78.753 Melem/s]
                        thrpt:  [599.49 MiB/s 600.20 MiB/s 600.84 MiB/s]
                 change:
                        time:   [+4.7409% +4.8757% +5.0096%] (p = 0.00 < 0.05)
                        thrpt:  [−4.7706% −4.6490% −4.5263%]
                        Performance has regressed.
Inverse f64/PhastFT DIT/8388608
                        time:   [150.32 ms 150.41 ms 150.51 ms]
                        thrpt:  [55.735 Melem/s 55.770 Melem/s 55.804 Melem/s]
                        thrpt:  [425.23 MiB/s 425.49 MiB/s 425.75 MiB/s]
                 change:
                        time:   [−19.862% −19.325% −18.813%] (p = 0.00 < 0.05)
                        thrpt:  [+23.172% +23.954% +24.784%]
                        Performance has improved.
Benchmarking Inverse f64/PhastFT DIT/16777216: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 13.8s, or reduce sample count to 10.
Inverse f64/PhastFT DIT/16777216
                        time:   [497.70 ms 499.55 ms 502.10 ms]
                        thrpt:  [33.414 Melem/s 33.585 Melem/s 33.709 Melem/s]
                        thrpt:  [254.93 MiB/s 256.23 MiB/s 257.18 MiB/s]
                 change:
                        time:   [−1.0127% −0.3975% +0.1480%] (p = 0.24 > 0.05)
                        thrpt:  [−0.1478% +0.3991% +1.0231%]
                        No change in performance detected.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high severe

There is more we can do here. I'd like to integrate the LUT variant of COBRA from #47. We could also add constraints on the applicable sizes for each transform so that we wouldn't need to bundle all unrolled variants into one function, not have to measure COBRA for smaller sizes, etc.

@Shnatsel
Copy link
Collaborator Author

I've messed around with COBRA block sizes. On Zen4, switching block size from 128 to 64 significantly helps f64 benchmarks.

I'll add the LUT variant from #47 and add variants for different block sizes.

@Shnatsel Shnatsel mentioned this pull request Jan 20, 2026
@Shnatsel
Copy link
Collaborator Author

closing in favor of #62 - CO-BRAVO is fastest for all benchmarked sizes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unrolled COBRA is slower than the generic implementation on desktop Zen 4

2 participants