You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
forward_step_inner (metal_qwen35.rs:6014-6019) performs embedding lookup via a scalar loop:
for i in0..hidden {*dst.add(i) = f16_to_f32(*src.add(i));}
The f16_to_f32 function (line 3418) is hand-written IEEE-754 bit manipulation — branch-heavy, 10+ operations per element. For hidden=1024, this executes 1024 iterations per decode token.
Why
On aarch64, hardware f16↔f32 conversion is a single-cycle instruction (FCVT). NEON provides vcvt_f32_f16 which converts 4 half→4 float in one instruction. The scalar bit-manipulation approach is ~10× slower than the hardware path.
The same function is used at 4 other call sites (lines 3477, 4844, 5529, 6018).
Options (in priority order)
GPU kernel (best): Dispatch a trivial Metal kernel that reads embed_tokens[token_id * hidden .. +hidden] as half and writes to hidden buffer as float. Adds 1 dispatch but eliminates all CPU work. Especially beneficial when embedding is already in Private memory (perf(metal): migrate weight buffers to StorageModePrivate #179).
NEON vectorized (good): Replace the loop with vcvt_f32_f16 intrinsics processing 4 elements per iteration (256 NEON iterations instead of 1024 scalar).
Use f16::to_f32() from half crate (minimal): At least gets compiler auto-vectorization. But adds a dependency.
Impact
Small but real — currently in the "other_us" bucket of the profiler. At ~10ns per scalar conversion × 1024 = ~10µs/token. Not the bottleneck, but free performance and code cleanliness.
Problem
forward_step_inner(metal_qwen35.rs:6014-6019) performs embedding lookup via a scalar loop:The
f16_to_f32function (line 3418) is hand-written IEEE-754 bit manipulation — branch-heavy, 10+ operations per element. For hidden=1024, this executes 1024 iterations per decode token.Why
On aarch64, hardware f16↔f32 conversion is a single-cycle instruction (
FCVT). NEON providesvcvt_f32_f16which converts 4 half→4 float in one instruction. The scalar bit-manipulation approach is ~10× slower than the hardware path.The same function is used at 4 other call sites (lines 3477, 4844, 5529, 6018).
Options (in priority order)
GPU kernel (best): Dispatch a trivial Metal kernel that reads embed_tokens[token_id * hidden .. +hidden] as
halfand writes to hidden buffer asfloat. Adds 1 dispatch but eliminates all CPU work. Especially beneficial when embedding is already in Private memory (perf(metal): migrate weight buffers to StorageModePrivate #179).NEON vectorized (good): Replace the loop with
vcvt_f32_f16intrinsics processing 4 elements per iteration (256 NEON iterations instead of 1024 scalar).Use
f16::to_f32()from half crate (minimal): At least gets compiler auto-vectorization. But adds a dependency.Impact
Small but real — currently in the "other_us" bucket of the profiler. At ~10ns per scalar conversion × 1024 = ~10µs/token. Not the bottleneck, but free performance and code cleanliness.
Related: #179 (StorageModePrivate), #152 (decompose metal_qwen35.rs)