Motivation
metal_qwen35.rs is 15,484 lines — debugging the lm_head throughput regression (#151) required grepping through the entire file for dispatch paths, kernel definitions, and constructor logic interleaved together. This is unsustainable as we add more features (MoE, MTP, vision, grammar).
Proposed Structure
crates/inference/src/forward/metal_qwen35/
├── mod.rs ~ 250 lines (cfg gate + pub use inner::*)
├── shaders.rs ~ 2300 lines (MSL_SOURCE, MSL_Q4_TILED_SOURCE)
├── types.rs ~ 1300 lines (structs, enums, data shapes)
├── engine.rs ~ 1100 lines (MetalQwen35Engine::new, buffer utils, KV cache)
├── constructors.rs ~ 2200 lines (MetalQwen35State::new, from_q4_dir, LoRA lifecycle)
├── forward.rs ~ 2900 lines (encode_gdn_layer, encode_gqa_layer, generate, prefill)
├── dispatch.rs ~ 900 lines (all dispatch_* helpers)
├── sampling.rs ~ 480 lines (chat_completion, generate_streaming, PPL)
└── tests.rs ~ 3200 lines (#[cfg(test)] mod tests)
Wiring: mod inner { use shaders::*; use types::*; ... } preserves the flat namespace so existing cross-references compile without path qualification.
Constraints
- Zero behavioral change — pure structural refactor
pub use inner::* re-exports unchanged
- All existing tests pass without modification
- Feature gate
#[cfg(all(target_os = "macos", feature = "metal-gpu"))] stays in mod.rs
Acceptance
Motivation
metal_qwen35.rsis 15,484 lines — debugging the lm_head throughput regression (#151) required grepping through the entire file for dispatch paths, kernel definitions, and constructor logic interleaved together. This is unsustainable as we add more features (MoE, MTP, vision, grammar).Proposed Structure
Wiring:
mod inner { use shaders::*; use types::*; ... }preserves the flat namespace so existing cross-references compile without path qualification.Constraints
pub use inner::*re-exports unchanged#[cfg(all(target_os = "macos", feature = "metal-gpu"))]stays in mod.rsAcceptance
make cipassesbench_decode_abshows no throughput change (within noise)