Skip to content

feat(server meta): expose SWA geometry in /v1/models meta#132

Merged
marksverdhei merged 1 commit into
htfrom
feat/swa-geometry-meta
Jun 17, 2026
Merged

feat(server meta): expose SWA geometry in /v1/models meta#132
marksverdhei merged 1 commit into
htfrom
feat/swa-geometry-meta

Conversation

@marksverdhei

Copy link
Copy Markdown

Closes #123.

What

Adds 3 new fields to the /v1/models meta object so clients (heierchat VRAM panel) can compute exact KV bytes/token for SWA-hybrid models (gemma-4 12B/31B, gemma-4-12b-256k@gem) instead of the current safe-upper-bound estimate.

New meta fields (per loaded model)

Field Source Example (gemma-4 12B)
n_swa llama_model_n_swa(model) 1024
n_swa_layers new llama_model_n_swa_layers(model) 40
swa_type new llama_model_swa_type_name(model) "standard"

New public API

LLAMA_API int32_t    llama_model_n_swa_layers(const struct llama_model * model);
LLAMA_API const char * llama_model_swa_type_name(const struct llama_model * model);

Exact KV formula the FE will compute

kv_per_tok(ctx) = (k_gqa + v_gqa) * kv_type_bytes * (
                    n_full_layers
                  + n_swa_layers * min(ctx, n_swa) / ctx )

where n_full_layers = n_layer - n_swa_layers.

Verification

  • CPU-only build (-DGGML_CUDA=OFF): compiles clean, zero warnings
  • Purely additive — no behavior change until consumers read the fields

Files changed (4, +36 lines)

  • include/llama.h — 2 new API declarations
  • src/llama-model.cpp — implementations (layer-count loop + enum→string)
  • tools/server/server-context.h — 3 new fields in server_context_meta
  • tools/server/server-context.cpp — populate + emit in /v1/models JSON

Add three new fields to the /v1/models meta object so clients can
compute exact KV bytes/token for SWA-hybrid models (gemma-4 etc.):

  n_swa        — sliding window size (already in API, now in meta)
  n_swa_layers — count of SWA layers in the model
  swa_type     — "none" | "standard" | "chunked" | "symmetric"

New public API:
  llama_model_n_swa_layers(model) — loops hparams.is_swa(il)
  llama_model_swa_type_name(model) — enum-to-string for swa_type

With these, clients compute the exact formula:
  kv/tok = (k_gqa + v_gqa) * kv_bytes * (
             n_full_layers
           + n_swa_layers * min(ctx, n_swa) / ctx )

where n_full_layers = n_layer - n_swa_layers.

Previously the best a client could do was assume all layers scale
with ctx (correct for dense, 2-8x over-estimate for SWA-hybrid).

Purely additive — no behavior change until consumers read the fields.

Closes #123.
@marksverdhei marksverdhei merged commit 765d7fb into ht Jun 17, 2026
9 checks passed
@marksverdhei marksverdhei deleted the feat/swa-geometry-meta branch June 17, 2026 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(server meta): expose SWA geometry (n_swa + SWA-layer count) for exact gemma-4 KV prediction

1 participant