perf(metal): vectorize f16→f32 embedding lookup (eliminate scalar bit-manipulation loop)

## Problem

`forward_step_inner` (metal_qwen35.rs:6014-6019) performs embedding lookup via a scalar loop:
```rust
for i in 0..hidden {
    *dst.add(i) = f16_to_f32(*src.add(i));
}
```

The `f16_to_f32` function (line 3418) is hand-written IEEE-754 bit manipulation — branch-heavy, 10+ operations per element. For hidden=1024, this executes 1024 iterations per decode token.

## Why

On aarch64, hardware f16↔f32 conversion is a single-cycle instruction (`FCVT`). NEON provides `vcvt_f32_f16` which converts 4 half→4 float in one instruction. The scalar bit-manipulation approach is ~10× slower than the hardware path.

The same function is used at 4 other call sites (lines 3477, 4844, 5529, 6018).

## Options (in priority order)

1. **GPU kernel** (best): Dispatch a trivial Metal kernel that reads embed_tokens[token_id * hidden .. +hidden] as `half` and writes to hidden buffer as `float`. Adds 1 dispatch but eliminates all CPU work. Especially beneficial when embedding is already in Private memory (#179).

2. **NEON vectorized** (good): Replace the loop with `vcvt_f32_f16` intrinsics processing 4 elements per iteration (256 NEON iterations instead of 1024 scalar).

3. **Use `f16::to_f32()` from half crate** (minimal): At least gets compiler auto-vectorization. But adds a dependency.

## Impact

Small but real — currently in the "other_us" bucket of the profiler. At ~10ns per scalar conversion × 1024 = ~10µs/token. Not the bottleneck, but free performance and code cleanliness.

Related: #179 (StorageModePrivate), #152 (decompose metal_qwen35.rs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(metal): vectorize f16→f32 embedding lookup (eliminate scalar bit-manipulation loop) #180

Problem

Why

Options (in priority order)

Impact

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(metal): vectorize f16→f32 embedding lookup (eliminate scalar bit-manipulation loop) #180

Description

Problem

Why

Options (in priority order)

Impact

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions