feat(dflash): add /props introspection endpoint#190
Conversation
There was a problem hiding this comment.
2 issues found across 5 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/scripts/server.py">
<violation number="1" location="dflash/scripts/server.py:150">
P2: Unvalidated `float()` parsing of `DFLASH_FP_ALPHA` can crash `/props` on malformed env values</violation>
<violation number="2" location="dflash/scripts/server.py:926">
P2: /props misreports target_sharding for laguna by checking requested extra_daemon_args instead of effective daemon args</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
5600c93 to
d492983
Compare
b93232c to
87888dc
Compare
GET /props returns a single read-only JSON document describing the live Python-server state — model arch, KV/FA config, pflash mode, cache occupancy, daemon liveness — for bench-time capture and diagnostics. Matches llama.cpp's /props convention; modeled after antirez/ds4 PR Luce-Org#81. Shape sections: server / model / runtime / reasoning / speculative / sampling / pflash / prefix_cache / full_cache / tool_replay / daemon / api. Field-by-field rationale lives in dflash/docs/props_endpoint_plan.md. Implementation notes: - server.version is read from dflash/pyproject.toml via stdlib tomllib; importlib.metadata is skipped because the workspace declares [tool.uv] package=false (never installed as a wheel). - props_schema=1 is a separate compat marker for clients that parse /props programmatically. Bump rules live in a comment by the constant. - Arch-gated capability booleans (reasoning_supported, speculative_ supported, tools_supported) flow through a single _capabilities() helper so /props and the Codex /v1/models variant cannot drift. - runtime.kv_cache_k/v come from a new _effective_kv_type() that mirrors the C++ resolve_kv_types() rules (qwen35 default Q4_0, laguna default Q8_0, per-arch precedence chains). Distinct from _resolve_kv_k_type(), which remains a stable hash salt for the prefix cache. - prefix_cache and full_cache now carry cumulative _lifetime_hits counters incremented at the existing hit sites; they survive eviction unlike per-entry hit counts. - full_cache.disk_bytes is snapshotted on every mutation (confirm_full_snap, _retire_full_entry, rehydrate_full_cache) so /props never has to walk the filesystem on read. - ToolMemory.stats() returns counters under no lock; cross-field tear is acceptable for introspection, documented in a comment. Tests (17 new, all passing alongside the 54 existing baseline tests): - Shape / version / version fallback - Arch gating (qwen35, laguna) - pflash enabled/disabled toggle - target-sharding disables both cache layers - api.endpoints drift detector vs actual FastAPI routes - _capabilities helper - _effective_kv_type per-arch + per-axis behavior - PrefixCache lifetime_hits survives eviction - full_cache disk_bytes refreshes on add and on retire - ToolMemory.stats() reflects current entries/bytes Explicit v1 non-goals (see plan doc): no /metrics, no daemon build identity, no per-endpoint param schemas, no daemon PID/uptime/bin_path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
87888dc to
f6f8e97
Compare
|
@cubic-dev-ai thanks for the review — both flagged against commit Issue 1 (server.py:150, unvalidated Issue 2 (server.py:926, "target_sharding": (
arch not in _LAGUNA_ARCHES
and _extra_daemon_has_target_sharding(extra_daemon_args)
),Added |
@easel Both fixes look correct in the current diff — nice work. Issue 1 — Issue 2 — the No further concerns from my side. 72/72 ✅ |
Motivation
I’m working on a set of agent workflow benchmarks and want to capture enough runtime metadata per run to make benchmark cells self-describing: model path, context window, sampling defaults, speculative mode, KV cache settings, pflash/cache state, and daemon liveness. This adds a llama.cpp-style
/propsendpoint for that purpose, with lucebox-specific details kept as structured extensions.Summary
Adds a read-only
GET /propsendpoint returning a JSON snapshot of the live Python-server state for bench-time capture and diagnostics.The endpoint now uses the cross-server / llama.cpp-compatible shape expected by downstream runtime-props capture:
default_generation_settings.{n_ctx, temperature, top_p, top_k, min_p, repeat_penalty}model_alias,model_path, andbuild_infospeculative_modewithoff,dflash, orpflashruntime.backendplus lucebox runtime extensions such as KV cache types, FA window, lazy draft, and target shardingreasoning.{supported, default, supported_efforts}sampling.capabilitiesfor operator-visible request parameter support flagsOld aliases are intentionally not emitted:
runtime.max_ctx,model.id,model.target_path, flatsampling.supports_*, andreasoning.default_enabled.Response Shape
Top-level keys:
{ default_generation_settings, model_alias, model_path, build_info, speculative_mode, server, model, runtime, reasoning, speculative, sampling, pflash, prefix_cache, full_cache, tool_replay, daemon, api }Each section is scoped to Python-server state. Daemon build identity, request-rate metrics, and per-endpoint parameter schemas remain v1 non-goals.
Design Notes
server.versionis read fromdflash/pyproject.tomlvia stdlibtomllib; malformed pyproject logs a warning and falls back to"0.0.0+unknown".server.props_schema = 1remains the compatibility marker for/propsparsers.runtime.kv_cache_k/vreport effective daemon KV types via_effective_kv_type().runtime.backendis best-effort:DFLASH_RUNTIME_BACKEND/DFLASH27B_GPU_BACKEND, then nearbyCMakeCache.txt, thencudafallback.speculative_modeispflashwhen pflash is enabled, otherwisedflashwhen DDTree speculative decode is supported, otherwiseoff.prefix_cacheandfull_cacheexpose cumulative lifetime hit counters.full_cache.disk_bytesis snapshotted on mutation so/propsdoes not walk the filesystem on read.tool_replayreports the exact tool-call replay memory counters fromToolMemory.api.endpointsis hand-curated with a drift test against FastAPI routes.Tests
Focused server suite passes:
uv run --extra dev pytest dflash/scripts/test_server.py -q— 75 passedCovered areas include endpoint shape, removed old aliases, default generation settings, reasoning shape, speculative mode selection, backend resolution, arch gating, pflash toggle, target-sharding cache behavior, endpoint-list drift, KV type resolution, cache counters, full-cache disk-byte snapshots, and
ToolMemory.stats().Open Items
curl http://localhost:8000/props | jq .against a live server after deployment/restart and sanity-check runtime values.🤖 Generated with Claude Code