Skip to content

refactor(inference): decompose metal_qwen35.rs (15K→8 files) #152

@ohdearquant

Description

@ohdearquant

Motivation

metal_qwen35.rs is 15,484 lines — debugging the lm_head throughput regression (#151) required grepping through the entire file for dispatch paths, kernel definitions, and constructor logic interleaved together. This is unsustainable as we add more features (MoE, MTP, vision, grammar).

Proposed Structure

crates/inference/src/forward/metal_qwen35/
├── mod.rs              ~  250 lines  (cfg gate + pub use inner::*)
├── shaders.rs          ~ 2300 lines  (MSL_SOURCE, MSL_Q4_TILED_SOURCE)
├── types.rs            ~ 1300 lines  (structs, enums, data shapes)
├── engine.rs           ~ 1100 lines  (MetalQwen35Engine::new, buffer utils, KV cache)
├── constructors.rs     ~ 2200 lines  (MetalQwen35State::new, from_q4_dir, LoRA lifecycle)
├── forward.rs          ~ 2900 lines  (encode_gdn_layer, encode_gqa_layer, generate, prefill)
├── dispatch.rs         ~  900 lines  (all dispatch_* helpers)
├── sampling.rs         ~  480 lines  (chat_completion, generate_streaming, PPL)
└── tests.rs            ~ 3200 lines  (#[cfg(test)] mod tests)

Wiring: mod inner { use shaders::*; use types::*; ... } preserves the flat namespace so existing cross-references compile without path qualification.

Constraints

  • Zero behavioral change — pure structural refactor
  • pub use inner::* re-exports unchanged
  • All existing tests pass without modification
  • Feature gate #[cfg(all(target_os = "macos", feature = "metal-gpu"))] stays in mod.rs

Acceptance

  • make ci passes
  • bench_decode_ab shows no throughput change (within noise)
  • PPL golden test passes
  • No public API changes (same re-exports)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions