Skip to content

feat(dflash): add /props introspection endpoint#190

Closed
easel wants to merge 2 commits into
Luce-Org:mainfrom
easel:feat/props-endpoint
Closed

feat(dflash): add /props introspection endpoint#190
easel wants to merge 2 commits into
Luce-Org:mainfrom
easel:feat/props-endpoint

Conversation

@easel

@easel easel commented May 13, 2026

Copy link
Copy Markdown
Collaborator

Motivation

I’m working on a set of agent workflow benchmarks and want to capture enough runtime metadata per run to make benchmark cells self-describing: model path, context window, sampling defaults, speculative mode, KV cache settings, pflash/cache state, and daemon liveness. This adds a llama.cpp-style /props endpoint for that purpose, with lucebox-specific details kept as structured extensions.

Summary

Adds a read-only GET /props endpoint returning a JSON snapshot of the live Python-server state for bench-time capture and diagnostics.

The endpoint now uses the cross-server / llama.cpp-compatible shape expected by downstream runtime-props capture:

  • top-level default_generation_settings.{n_ctx, temperature, top_p, top_k, min_p, repeat_penalty}
  • top-level model_alias, model_path, and build_info
  • top-level speculative_mode with off, dflash, or pflash
  • runtime.backend plus lucebox runtime extensions such as KV cache types, FA window, lazy draft, and target sharding
  • reasoning.{supported, default, supported_efforts}
  • sampling.capabilities for operator-visible request parameter support flags

Old aliases are intentionally not emitted: runtime.max_ctx, model.id, model.target_path, flat sampling.supports_*, and reasoning.default_enabled.

Response Shape

Top-level keys:

{ default_generation_settings, model_alias, model_path, build_info, speculative_mode, server, model, runtime, reasoning, speculative, sampling, pflash, prefix_cache, full_cache, tool_replay, daemon, api }

Each section is scoped to Python-server state. Daemon build identity, request-rate metrics, and per-endpoint parameter schemas remain v1 non-goals.

Design Notes

  • server.version is read from dflash/pyproject.toml via stdlib tomllib; malformed pyproject logs a warning and falls back to "0.0.0+unknown".
  • server.props_schema = 1 remains the compatibility marker for /props parsers.
  • runtime.kv_cache_k/v report effective daemon KV types via _effective_kv_type().
  • runtime.backend is best-effort: DFLASH_RUNTIME_BACKEND / DFLASH27B_GPU_BACKEND, then nearby CMakeCache.txt, then cuda fallback.
  • speculative_mode is pflash when pflash is enabled, otherwise dflash when DDTree speculative decode is supported, otherwise off.
  • prefix_cache and full_cache expose cumulative lifetime hit counters.
  • full_cache.disk_bytes is snapshotted on mutation so /props does not walk the filesystem on read.
  • tool_replay reports the exact tool-call replay memory counters from ToolMemory.
  • api.endpoints is hand-curated with a drift test against FastAPI routes.

Tests

Focused server suite passes:

  • uv run --extra dev pytest dflash/scripts/test_server.py -q — 75 passed

Covered areas include endpoint shape, removed old aliases, default generation settings, reasoning shape, speculative mode selection, backend resolution, arch gating, pflash toggle, target-sharding cache behavior, endpoint-list drift, KV type resolution, cache counters, full-cache disk-byte snapshots, and ToolMemory.stats().

Open Items

  • Run curl http://localhost:8000/props | jq . against a live server after deployment/restart and sanity-check runtime values.

🤖 Generated with Claude Code

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 5 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/scripts/server.py">

<violation number="1" location="dflash/scripts/server.py:150">
P2: Unvalidated `float()` parsing of `DFLASH_FP_ALPHA` can crash `/props` on malformed env values</violation>

<violation number="2" location="dflash/scripts/server.py:926">
P2: /props misreports target_sharding for laguna by checking requested extra_daemon_args instead of effective daemon args</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread dflash/scripts/server.py Outdated
Comment thread dflash/scripts/server.py Outdated
@easel easel force-pushed the feat/props-endpoint branch from 5600c93 to d492983 Compare May 14, 2026 00:20
@easel easel marked this pull request as draft May 14, 2026 00:21
@easel easel force-pushed the feat/props-endpoint branch 2 times, most recently from b93232c to 87888dc Compare May 14, 2026 00:39
GET /props returns a single read-only JSON document describing the live
Python-server state — model arch, KV/FA config, pflash mode, cache
occupancy, daemon liveness — for bench-time capture and diagnostics.
Matches llama.cpp's /props convention; modeled after antirez/ds4 PR Luce-Org#81.

Shape sections: server / model / runtime / reasoning / speculative /
sampling / pflash / prefix_cache / full_cache / tool_replay / daemon /
api. Field-by-field rationale lives in dflash/docs/props_endpoint_plan.md.

Implementation notes:

  - server.version is read from dflash/pyproject.toml via stdlib tomllib;
    importlib.metadata is skipped because the workspace declares
    [tool.uv] package=false (never installed as a wheel).

  - props_schema=1 is a separate compat marker for clients that parse
    /props programmatically. Bump rules live in a comment by the constant.

  - Arch-gated capability booleans (reasoning_supported, speculative_
    supported, tools_supported) flow through a single _capabilities()
    helper so /props and the Codex /v1/models variant cannot drift.

  - runtime.kv_cache_k/v come from a new _effective_kv_type() that
    mirrors the C++ resolve_kv_types() rules (qwen35 default Q4_0,
    laguna default Q8_0, per-arch precedence chains). Distinct from
    _resolve_kv_k_type(), which remains a stable hash salt for the
    prefix cache.

  - prefix_cache and full_cache now carry cumulative _lifetime_hits
    counters incremented at the existing hit sites; they survive
    eviction unlike per-entry hit counts.

  - full_cache.disk_bytes is snapshotted on every mutation
    (confirm_full_snap, _retire_full_entry, rehydrate_full_cache) so
    /props never has to walk the filesystem on read.

  - ToolMemory.stats() returns counters under no lock; cross-field tear
    is acceptable for introspection, documented in a comment.

Tests (17 new, all passing alongside the 54 existing baseline tests):

  - Shape / version / version fallback
  - Arch gating (qwen35, laguna)
  - pflash enabled/disabled toggle
  - target-sharding disables both cache layers
  - api.endpoints drift detector vs actual FastAPI routes
  - _capabilities helper
  - _effective_kv_type per-arch + per-axis behavior
  - PrefixCache lifetime_hits survives eviction
  - full_cache disk_bytes refreshes on add and on retire
  - ToolMemory.stats() reflects current entries/bytes

Explicit v1 non-goals (see plan doc): no /metrics, no daemon build
identity, no per-endpoint param schemas, no daemon PID/uptime/bin_path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@easel easel force-pushed the feat/props-endpoint branch from 87888dc to f6f8e97 Compare May 14, 2026 00:41
@easel

easel commented May 14, 2026

Copy link
Copy Markdown
Collaborator Author

@cubic-dev-ai thanks for the review — both flagged against commit 5600c93. Status on current HEAD (f6f8e97):

Issue 1 (server.py:150, unvalidated float(DFLASH_FP_ALPHA)) — already addressed in the first amendment (d492983). The parse goes through a new _parse_optional_float() helper that returns None and logs a warning on non-numeric values rather than raising at request time.

Issue 2 (server.py:926, target_sharding misreport on laguna) — valid bug, fixed in f6f8e97. Root cause: the laguna daemon-spawn path (if arch in _LAGUNA_ARCHES: cmd = [...]) doesn't call cmd.extend(extra_daemon_args) — that branch is qwen35-only. So on arch=laguna with --target-gpus=... passed, the flag is silently dropped at spawn but /props was still reporting target_sharding: True. Now gated:

"target_sharding": (
    arch not in _LAGUNA_ARCHES
    and _extra_daemon_has_target_sharding(extra_daemon_args)
),

Added test_props_target_sharding_false_on_laguna_even_when_args_passed to lock the behavior in. 72/72 tests passing.

@cubic-dev-ai

cubic-dev-ai Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

@cubic-dev-ai thanks for the review — both flagged against commit 5600c93. Status on current HEAD (f6f8e97):

Issue 1 (server.py:150, unvalidated float(DFLASH_FP_ALPHA)) — already addressed in the first amendment (d492983). The parse goes through a new _parse_optional_float() helper that returns None and logs a warning on non-numeric values rather than raising at request time.

Issue 2 (server.py:926, target_sharding misreport on laguna) — valid bug, fixed in f6f8e97. Root cause: the laguna daemon-spawn path (if arch in _LAGUNA_ARCHES: cmd = [...]) doesn't call cmd.extend(extra_daemon_args) — that branch is qwen35-only. So on arch=laguna with --target-gpus=... passed, the flag is silently dropped at spawn but /props was still reporting target_sharding: True. Now gated:
...

@easel Both fixes look correct in the current diff — nice work.

Issue 1_parse_optional_float is clean. Wraps the bare float() call in a try/except with a logged warning instead of crashing, returns None for empty/missing/non-numeric values. pflash.bsa_alpha properly shows null in the disabled path too.

Issue 2 — the arch not in _LAGUNA_ARCHES guard is the right fix. Since the laguna branch discards extra_daemon_args entirely, reflecting that in /props avoids a misleading report. The test locking this in (test_props_target_sharding_false_on_laguna_even_when_args_passed) is good — it'd catch a future refactor that accidentally wires extra_daemon_args into the laguna spawn without updating the reporting logic.

No further concerns from my side. 72/72 ✅

@easel

easel commented May 22, 2026

Copy link
Copy Markdown
Collaborator Author

Superseded by integration PR #226 via 5b67cf2 (server props + thinking controls reworked into the integration line).

@easel easel closed this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant