-
Notifications
You must be signed in to change notification settings - Fork 13
Choose fastest bit reversal method at runtime #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ecting the fastest one
…rsions, same as Dif one
…ents and include it in the planner
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #60 +/- ##
==========================================
- Coverage 99.82% 99.43% -0.40%
==========================================
Files 13 14 +1
Lines 2261 2289 +28
==========================================
+ Hits 2257 2276 +19
- Misses 4 13 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I've re-run the benchmarks against latest main on zen4 on the same versions of the compiler and all dependencies. The largest regression is 4% while many smaller sizes improve by a lot, some over 2x. I think this is good to go. benchmarks on desktop zen4 vs main on commit 0f47ea1There is more we can do here. I'd like to integrate the LUT variant of COBRA from #47. We could also add constraints on the applicable sizes for each transform so that we wouldn't need to bundle all unrolled variants into one function, not have to measure COBRA for smaller sizes, etc. |
|
I've messed around with COBRA block sizes. On Zen4, switching block size from 128 to 64 significantly helps I'll add the LUT variant from #47 and add variants for different block sizes. |
|
closing in favor of #62 - CO-BRAVO is fastest for all benchmarked sizes |
Helps x86 a lot by switching away from unrolled impls. Fixes #49
Also papers over whatever memory subsystem quirk we're hitting at f32/8388608 and f64/4194304, possibly something to do with cache associativity.
preliminary benchmarks from zen4
On Apple M4 this is big improvement for f32/32768 which I assume is also due to some implementations hitting cache associativity issues or a similar hardware quirk. No equivalent change for f64/16384, oddly enough. Also +7% gains for all f64 up to and including 1024, no change on larger sizes.
TODO:
cobra_apply#47