I reproduced the Gemma-4-26B-A4B MoE sharding setup.
Built full vindex from google/gemma-4-26B-A4B-it:
larql extract /mnt/2T/models/hf/gemma-4-26B-A4B-it
-o output/gemma4-26b-a4b-q4k-fmd.vindex
--quant q4k
--feature-major-down
Local inference works:
larql run gemma4-26b-a4b-fmd --max-tokens 8 "The capital of France is"
=> Paris
Two expert servers start correctly:
larql serve gemma4-26b-a4b-fmd --port 8081 --experts 0-63
larql serve gemma4-26b-a4b-fmd --port 8082 --experts 64-127
Both servers show:
Down features Q4K: loaded
Endpoints: POST /v1/expert/batch
GET /v1/health succeeds
But remote MoE inference fails:
larql run gemma4-26b-a4b-fmd
--moe-shards "0-63=http://127.0.0.1:8081,64-127=http://127.0.0.1:8082"
--max-tokens 1
"The capital of France is"
Error:
decode_token_with_moe returned None during prefill
With --moe-dispatch batch:
decode returned None during prefill
Server logs show only GET /v1/health.
No POST /v1/expert/batch is emitted.
I reproduced the Gemma-4-26B-A4B MoE sharding setup.
Built full vindex from google/gemma-4-26B-A4B-it:
larql extract /mnt/2T/models/hf/gemma-4-26B-A4B-it
-o output/gemma4-26b-a4b-q4k-fmd.vindex
--quant q4k
--feature-major-down
Local inference works:
larql run gemma4-26b-a4b-fmd --max-tokens 8 "The capital of France is"
=> Paris
Two expert servers start correctly:
larql serve gemma4-26b-a4b-fmd --port 8081 --experts 0-63
larql serve gemma4-26b-a4b-fmd --port 8082 --experts 64-127
Both servers show:
Down features Q4K: loaded
Endpoints: POST /v1/expert/batch
GET /v1/health succeeds
But remote MoE inference fails:
larql run gemma4-26b-a4b-fmd
--moe-shards "0-63=http://127.0.0.1:8081,64-127=http://127.0.0.1:8082"
--max-tokens 1
"The capital of France is"
Error:
decode_token_with_moe returned None during prefill
With --moe-dispatch batch:
decode returned None during prefill
Server logs show only GET /v1/health.
No POST /v1/expert/batch is emitted.