Skip to content

perf(metal): vectorize f16→f32 embedding lookup (eliminate scalar bit-manipulation loop) #180

@ohdearquant

Description

@ohdearquant

Problem

forward_step_inner (metal_qwen35.rs:6014-6019) performs embedding lookup via a scalar loop:

for i in 0..hidden {
    *dst.add(i) = f16_to_f32(*src.add(i));
}

The f16_to_f32 function (line 3418) is hand-written IEEE-754 bit manipulation — branch-heavy, 10+ operations per element. For hidden=1024, this executes 1024 iterations per decode token.

Why

On aarch64, hardware f16↔f32 conversion is a single-cycle instruction (FCVT). NEON provides vcvt_f32_f16 which converts 4 half→4 float in one instruction. The scalar bit-manipulation approach is ~10× slower than the hardware path.

The same function is used at 4 other call sites (lines 3477, 4844, 5529, 6018).

Options (in priority order)

  1. GPU kernel (best): Dispatch a trivial Metal kernel that reads embed_tokens[token_id * hidden .. +hidden] as half and writes to hidden buffer as float. Adds 1 dispatch but eliminates all CPU work. Especially beneficial when embedding is already in Private memory (perf(metal): migrate weight buffers to StorageModePrivate #179).

  2. NEON vectorized (good): Replace the loop with vcvt_f32_f16 intrinsics processing 4 elements per iteration (256 NEON iterations instead of 1024 scalar).

  3. Use f16::to_f32() from half crate (minimal): At least gets compiler auto-vectorization. But adds a dependency.

Impact

Small but real — currently in the "other_us" bucket of the profiler. At ~10ns per scalar conversion × 1024 = ~10µs/token. Not the bottleneck, but free performance and code cleanliness.

Related: #179 (StorageModePrivate), #152 (decompose metal_qwen35.rs)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestlattice-inferenceAffects the lattice-inference crate (transformer inference)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions