The current implementations generate large binaries because they have one specialized implementation for each bitwidth, and do loop unrolling.
Add a flag-enabled implementation that uses a more compact scalar implementation. This would be useful for web assembly for instance.