Skip to content

fix(build): only enable SIMDE_BACKEND for non-x86 architectures#254

Merged
darvid merged 6 commits intomainfrom
fix/simde-backend-x86-perf-253
Feb 11, 2026
Merged

fix(build): only enable SIMDE_BACKEND for non-x86 architectures#254
darvid merged 6 commits intomainfrom
fix/simde-backend-x86-perf-253

Conversation

@darvid
Copy link
Owner

@darvid darvid commented Feb 11, 2026

Summary

Fixes #253 — performance regression in v0.8.0 (and v0.7.23+) caused by unconditionally enabling SIMDE_BACKEND=ON for all vectorscan builds.

  • SIMDE_BACKEND=ON replaces vectorscan's native x86 CPU detection with a stub that reports zero CPU features, disabling all SSE4.2/AVX2/AVX512 code paths and capping performance at SSE2 level
  • This caused a ~2.5-13x throughput regression depending on workload complexity
  • Now only enables SIMDE_BACKEND on ARM and other non-x86 architectures where vectorscan genuinely needs the SIMD emulation layer
  • x86-64 builds use the native backend with runtime CPU feature detection, restoring full performance

Benchmark Results (50 patterns, 500KB documents, Ryzen 7 5800X)

Build Avg Time/Scan Throughput
Before (SIMDE_BACKEND=ON) 6.7 ms 70.8 MB/s
After (SIMDE_BACKEND=OFF on x86) 2.6 ms 182.2 MB/s
Reporter's v0.7.19 baseline 3.2 ms 154.3 MB/s

Root Cause

Commit 8df0fcd (v0.7.23) added -DSIMDE_BACKEND=ON to maximize wheel compatibility across CPU variants. However, SIMDE_BACKEND on x86-64:

  1. Replaces src/util/arch/x86/cpuid_flags.c with a SIMDE stub returning 0 (no features)
  2. Disables all higher ISA dispatch (AVX2, AVX512, SSE4.2 string instructions)
  3. Disables __builtin_constant_p() optimizations in supervector operations
  4. Forces HS_TUNE_FAMILY_GENERIC instead of CPU-specific tuning

The "compatibility" benefit is negligible on x86-64 since SSE4.2 (vectorscan's minimum requirement) has been available since Intel Nehalem (2008).

Test plan

  • All 32 existing tests pass
  • Benchmark confirms throughput restored to v0.7.19 levels
  • CI passes on all platforms (x86-64 Linux, macOS, ARM)
  • Verify ARM wheels still build correctly with SIMDE_BACKEND=ON

- SIMDE_BACKEND was unconditionally enabled for all vectorscan builds,
  which disables native x86 CPU feature detection and caps performance
  at SSE2 level
- on x86-64, this caused a ~2.5-13x throughput regression vs v0.7.21
  because vectorscan's runtime dispatch to SSE4.2/AVX2/AVX512 code
  paths was completely bypassed
- now only enables SIMDE_BACKEND on ARM and other non-x86 architectures
  where vectorscan genuinely needs the SIMD emulation layer
- add benchmark script for reproducing and validating the regression
- GitHub deprecated macos-13 (Intel) runners
- macOS x86_64 wheels are now cross-compiled on ARM runners via
  Rosetta 2, which cibuildwheel handles natively
- vectorscan 5.4.12 uses -march=x86-64-v2 in cflags-x86.cmake and
  archdetect.cmake, but GCC <11 (manylinux2014 devtoolset) does not
  recognize this value
- patch source at build time to use -march=nehalem which provides the
  same SSE4.2 baseline and is supported by all GCC versions
- only applied when using native x86 backend (not SIMDE_BACKEND)
- use CMAKE_OSX_ARCHITECTURES (target arch) instead of
  CMAKE_SYSTEM_PROCESSOR (host arch) for SIMDE_BACKEND decision
  on macOS, so cross-compiling x86_64 on ARM correctly disables
  SIMDE and builds native x86 vectorscan
- forward CMAKE_OSX_ARCHITECTURES to ExternalProject_Add so
  vectorscan builds for the correct target architecture
- handle BSD sed -i syntax difference on macOS for the
  x86-64-v2 → nehalem patch
- CMake's list handling drops empty string in sed -i "" causing
  BSD sed to fail with "rename(): No such file or directory"
- perl -pi -e works identically on Linux and macOS
- uv 0.10.2 leaks host Python 3.12 stdlib into cibuildwheel
  venvs on Windows, causing SRE module mismatch and import
  errors for non-3.12 Python targets
@darvid darvid merged commit 5bc8cbe into main Feb 11, 2026
60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance regression in hyperscan 0.8.0 vs 0.7.x

1 participant