Skip to content

Remote MoE sharding fails before expert batch POST: decode_token_with_moe returned None during prefill #146

@faridlesosibirskfinland

Description

I reproduced the Gemma-4-26B-A4B MoE sharding setup.

Built full vindex from google/gemma-4-26B-A4B-it:

larql extract /mnt/2T/models/hf/gemma-4-26B-A4B-it
-o output/gemma4-26b-a4b-q4k-fmd.vindex
--quant q4k
--feature-major-down

Local inference works:

larql run gemma4-26b-a4b-fmd --max-tokens 8 "The capital of France is"
=> Paris

Two expert servers start correctly:

larql serve gemma4-26b-a4b-fmd --port 8081 --experts 0-63
larql serve gemma4-26b-a4b-fmd --port 8082 --experts 64-127

Both servers show:

Down features Q4K: loaded
Endpoints: POST /v1/expert/batch
GET /v1/health succeeds

But remote MoE inference fails:

larql run gemma4-26b-a4b-fmd
--moe-shards "0-63=http://127.0.0.1:8081,64-127=http://127.0.0.1:8082"
--max-tokens 1
"The capital of France is"

Error:

decode_token_with_moe returned None during prefill

With --moe-dispatch batch:

decode returned None during prefill

Server logs show only GET /v1/health.
No POST /v1/expert/batch is emitted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions