Skip to content

Conversation

@PSeitz-dd
Copy link
Contributor

@PSeitz-dd PSeitz-dd commented Jan 6, 2026

Use NEON registers (uint32x4_t) instead of scalar [u32;4] arrays.

Very large gains for delta decompression (relevant for tantivy)

Screenshot 2026-01-06 at 15 29 46
BitPacker4x/decompress-1                                                                            
                        time:   [48.542 ns 48.658 ns 48.831 ns]
                        thrpt:  [26.213 Gelem/s 26.306 Gelem/s 26.369 Gelem/s]
                 change:
                        time:   [-0.5449% -0.2760% +0.0015%] (p = 0.15 > 0.05)
                        thrpt:  [-0.0015% +0.2767% +0.5479%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

BitPacker4x/decompress-delta-1                                                                           
                        time:   [165.27 ns 165.69 ns 166.08 ns]
                        thrpt:  [7.7070 Gelem/s 7.7252 Gelem/s 7.7447 Gelem/s]
                 change:
                        time:   [-87.384% -87.310% -87.230%] (p = 0.00 < 0.05)
                        thrpt:  [+683.07% +688.00% +692.67%]
                        Performance has improved.

BitPacker4x/decompress-strict-delta-1                                                                           
                        time:   [184.46 ns 185.04 ns 185.57 ns]
                        thrpt:  [6.8978 Gelem/s 6.9176 Gelem/s 6.9392 Gelem/s]
                 change:
                        time:   [-85.351% -85.288% -85.230%] (p = 0.00 < 0.05)
                        thrpt:  [+577.05% +579.73% +582.63%]
                        Performance has improved.

BitPacker4x/compress-1  time:   [68.262 ns 68.459 ns 68.628 ns]                                  
                        thrpt:  [18.651 Gelem/s 18.697 Gelem/s 18.751 Gelem/s]
                 change:
                        time:   [-10.788% -10.473% -10.168%] (p = 0.00 < 0.05)
                        thrpt:  [+11.319% +11.698% +12.092%]
                        Performance has improved.

BitPacker4x/compress-delta-1                                                                           
                        time:   [104.93 ns 105.08 ns 105.32 ns]
                        thrpt:  [12.153 Gelem/s 12.182 Gelem/s 12.199 Gelem/s]
                 change:
                        time:   [-0.5067% -0.0505% +0.4030%] (p = 0.92 > 0.05)
                        thrpt:  [-0.4014% +0.0505% +0.5092%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

BitPacker4x/compress-strict-delta-1                                                                           
                        time:   [133.65 ns 133.86 ns 134.14 ns]
                        thrpt:  [9.5420 Gelem/s 9.5623 Gelem/s 9.5774 Gelem/s]
                 change:
                        time:   [-0.4209% -0.1702% +0.0917%] (p = 0.45 > 0.05)
                        thrpt:  [-0.0916% +0.1705% +0.4226%]
                        No change in performance detected.
Found 3 outliers among 10 measurements (30.00%)
  1 (10.00%) low mild
  2 (20.00%) high mild

BitPacker4x/decompress-2                                                                            
                        time:   [48.707 ns 48.749 ns 48.806 ns]
                        thrpt:  [26.226 Gelem/s 26.257 Gelem/s 26.280 Gelem/s]
                 change:
                        time:   [-0.0381% +0.0927% +0.2267%] (p = 0.20 > 0.05)
                        thrpt:  [-0.2261% -0.0926% +0.0381%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

BitPacker4x/decompress-delta-2                                                                           
                        time:   [167.96 ns 168.28 ns 168.86 ns]
                        thrpt:  [7.5804 Gelem/s 7.6065 Gelem/s 7.6209 Gelem/s]
                 change:
                        time:   [-87.153% -87.106% -87.062%] (p = 0.00 < 0.05)
                        thrpt:  [+672.92% +675.56% +678.37%]
                        Performance has improved.

BitPacker4x/decompress-strict-delta-2                                                                           
                        time:   [184.04 ns 184.28 ns 184.46 ns]
                        thrpt:  [6.9390 Gelem/s 6.9461 Gelem/s 6.9548 Gelem/s]
                 change:
                        time:   [-85.176% -85.145% -85.113%] (p = 0.00 < 0.05)
                        thrpt:  [+571.75% +573.15% +574.57%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

BitPacker4x/compress-2  time:   [67.967 ns 68.124 ns 68.353 ns]                                  
                        thrpt:  [18.726 Gelem/s 18.789 Gelem/s 18.833 Gelem/s]
                 change:
                        time:   [-10.102% -9.7868% -9.4293%] (p = 0.00 < 0.05)
                        thrpt:  [+10.411% +10.848% +11.237%]
                        Performance has improved.

BitPacker4x/compress-delta-2                                                                           
                        time:   [105.41 ns 105.59 ns 105.86 ns]
                        thrpt:  [12.091 Gelem/s 12.122 Gelem/s 12.143 Gelem/s]
                 change:
                        time:   [-0.2046% +0.0443% +0.3099%] (p = 0.65 > 0.05)
                        thrpt:  [-0.3089% -0.0443% +0.2050%]
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high severe

BitPacker4x/compress-strict-delta-2                                                                           
                        time:   [131.83 ns 131.97 ns 132.15 ns]
                        thrpt:  [9.6862 Gelem/s 9.6989 Gelem/s 9.7092 Gelem/s]
                 change:
                        time:   [-0.4409% -0.1669% +0.0865%] (p = 0.49 > 0.05)
                        thrpt:  [-0.0864% +0.1672% +0.4428%]
                        No change in performance detected.

BitPacker4x/decompress-24                                                                            
                        time:   [48.628 ns 48.698 ns 48.775 ns]
                        thrpt:  [26.243 Gelem/s 26.285 Gelem/s 26.322 Gelem/s]
                 change:
                        time:   [+0.1369% +0.3018% +0.4655%] (p = 0.00 < 0.05)
                        thrpt:  [-0.4633% -0.3009% -0.1367%]
                        Change within noise threshold.

BitPacker4x/decompress-delta-24                                                                           
                        time:   [160.63 ns 161.15 ns 161.64 ns]
                        thrpt:  [7.9189 Gelem/s 7.9430 Gelem/s 7.9685 Gelem/s]
                 change:
                        time:   [-84.301% -84.228% -84.151%] (p = 0.00 < 0.05)
                        thrpt:  [+530.96% +534.03% +536.99%]
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe

BitPacker4x/decompress-strict-delta-24                                                                           
                        time:   [199.51 ns 200.14 ns 200.58 ns]
                        thrpt:  [6.3814 Gelem/s 6.3954 Gelem/s 6.4157 Gelem/s]
                 change:
                        time:   [-80.709% -80.637% -80.558%] (p = 0.00 < 0.05)
                        thrpt:  [+414.35% +416.46% +418.37%]
                        Performance has improved.

BitPacker4x/compress-24 time:   [73.814 ns 74.021 ns 74.355 ns]                                   
                        thrpt:  [17.215 Gelem/s 17.292 Gelem/s 17.341 Gelem/s]
                 change:
                        time:   [-6.1920% -5.6643% -5.0146%] (p = 0.00 < 0.05)
                        thrpt:  [+5.2793% +6.0044% +6.6007%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

BitPacker4x/compress-delta-24                                                                           
                        time:   [91.694 ns 92.188 ns 92.491 ns]
                        thrpt:  [13.839 Gelem/s 13.885 Gelem/s 13.960 Gelem/s]
                 change:
                        time:   [-10.684% -10.098% -9.5526%] (p = 0.00 < 0.05)
                        thrpt:  [+10.561% +11.233% +11.963%]
                        Performance has improved.

BitPacker4x/compress-strict-delta-24                                                                           
                        time:   [101.01 ns 101.24 ns 101.49 ns]
                        thrpt:  [12.612 Gelem/s 12.644 Gelem/s 12.672 Gelem/s]
                 change:
                        time:   [-5.4486% -5.1514% -4.8557%] (p = 0.00 < 0.05)
                        thrpt:  [+5.1035% +5.4311% +5.7626%]
                        Performance has improved.

BitPacker4x/decompress-31                                                                           
                        time:   [58.273 ns 58.643 ns 59.041 ns]
                        thrpt:  [21.680 Gelem/s 21.827 Gelem/s 21.966 Gelem/s]
                 change:
                        time:   [-0.1638% +0.1288% +0.5314%] (p = 0.45 > 0.05)
                        thrpt:  [-0.5286% -0.1286% +0.1641%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

BitPacker4x/decompress-delta-31                                                                           
                        time:   [179.40 ns 179.64 ns 180.12 ns]
                        thrpt:  [7.1064 Gelem/s 7.1255 Gelem/s 7.1347 Gelem/s]
                 change:
                        time:   [-85.798% -85.679% -85.568%] (p = 0.00 < 0.05)
                        thrpt:  [+592.93% +598.28% +604.14%]
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild

BitPacker4x/decompress-strict-delta-31                                                                           
                        time:   [198.67 ns 199.15 ns 199.58 ns]
                        thrpt:  [6.4134 Gelem/s 6.4274 Gelem/s 6.4429 Gelem/s]
                 change:
                        time:   [-84.239% -84.056% -83.893%] (p = 0.00 < 0.05)
                        thrpt:  [+520.86% +527.20% +534.47%]
                        Performance has improved.

BitPacker4x/compress-31 time:   [82.631 ns 82.806 ns 82.943 ns]                                   
                        thrpt:  [15.432 Gelem/s 15.458 Gelem/s 15.491 Gelem/s]
                 change:
                        time:   [-14.486% -14.192% -13.886%] (p = 0.00 < 0.05)
                        thrpt:  [+16.125% +16.539% +16.940%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

BitPacker4x/compress-delta-31                                                                           
                        time:   [84.768 ns 85.435 ns 86.181 ns]
                        thrpt:  [14.852 Gelem/s 14.982 Gelem/s 15.100 Gelem/s]
                 change:
                        time:   [-14.516% -13.814% -13.231%] (p = 0.00 < 0.05)
                        thrpt:  [+15.249% +16.028% +16.982%]
                        Performance has improved.

BitPacker4x/compress-strict-delta-31                                                                           
                        time:   [93.315 ns 93.687 ns 93.978 ns]
                        thrpt:  [13.620 Gelem/s 13.662 Gelem/s 13.717 Gelem/s]
                 change:
                        time:   [-9.5296% -8.8432% -8.1792%] (p = 0.00 < 0.05)
                        thrpt:  [+8.9078% +9.7011% +10.533%]
                        Performance has improved.

Use NEON registers (uint32x4_t) instead of scalar [u32;4] arrays.
faster benches

Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com>
Comment on lines +181 to +185
#[inline]
unsafe fn op_or(left: DataType, right: DataType) -> DataType {
// Bitwise OR of two vectors
vorrq_u32(left, right)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are places where we'd do use std::arch::aarch64::vorrq_u32 as op_or, i'm ambivalent on whether that's better or not

@fulmicoton-dd
Copy link
Contributor

awesome

@fulmicoton-dd fulmicoton-dd merged commit 7e8a009 into quickwit-oss:master Jan 8, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants