chore(repo): rename dflash→server, group pflash+megakernel under optimizations/ by davide221 · Pull Request #281 · Luce-Org/lucebox-hub

davide221 · 2026-05-26T21:39:08Z

What

Repo layout cleanup. Treat the project as an inference engine.

dflash/ → server/ (the C++/CUDA inference server lives here)
pflash/ → optimizations/pflash/
megakernel/ → optimizations/megakernel/

Why

dflash is an implementation detail of how the server does spec-decode. The directory at the root of the repo is the inference server as a product surface. pflash and megakernel are perf optimizations of that server, so they live under optimizations/.

What changed

318 files renamed via git mv (full rename history preserved)
.gitmodules — submodule paths → server/deps/llama.cpp, server/deps/Block-Sparse-Attention
pyproject.toml — [tool.uv.workspace] members → ["server", "optimizations/megakernel", "optimizations/pflash"]
.github/workflows/ci.yml — all dflash/... build paths → server/...
scripts/check_uv_workspace.sh — workspace verification paths
harness/clients/*.sh, harness/benchmarks/*.sh — bench script paths
All README/RESULTS/ARCHITECTURE/SPEC_PREFILL/laguna_integration_plan markdown — doc cross-refs updated

What did NOT change

Python package names: lucebox-dflash, pflash, qwen35-megakernel-bf16 (less downstream breakage; only directory layout moved)
C++ #include paths (src-relative via CMake -Isrc, unaffected by move)
share/model_cards/ runtime lookup (uses self_bin_dir() — relative)
Submodule binding names (kept as dflash/deps/... identifiers — arbitrary, only paths matter)

Tested on lucebox2 (RTX 3090, CUDA 12.6, sm_86)

cmake --build server/build -j32   # clean: dflash_server, test_qwen35moe_*, test_server_unit
                                  # only test_flash_attn_sparse fails (pre-existing, unrelated)

# AR-only smoke
POST /v1/chat/completions  "What is 7*8?"   → "56", HTTP 200, decode 33 tok/s

# Full dflash+ddtree+draft
POST /v1/chat/completions  300-token essay  → HTTP 200, 22.9 tok/s decode,
[spec-decode] accepted=213/1392 (15.3%) avg_commit=3.45

Server boots, model loads, both AR and dflash spec-decode paths work, telemetry intact, model output sensible.

Coordination note

This conflicts with every open PR touching dflash/* (PRs #226, #75, #48, plus weicj's open work). Suggest landing on a coordinated freeze window so contributors can rebase in one pass. Once merged, downstream rebases are mechanical — just replace dflash/ with server/ in their changed paths.

🧙 Built with WOZCODE

…mizations/

…name) Brings in PR Luce-Org#281 (chore: rename dflash→server, pflash+megakernel → optimizations/) + small docs polish 080f89b. Our lucebox/ Python package (added by us in 2560086, never upstream) is untouched. Our docs additions under dflash/docs/* are migrated to server/docs/*. Our deletions of bench scripts confirmed against the new server/scripts/* paths. Workspace members in pyproject.toml: ["server", "lucebox", "optimizations/megakernel", "optimizations/pflash"] — preserving our lucebox member alongside upstream's renamed paths. # Conflicts: # README.md # pyproject.toml # server/docs/BENCHMARK_SNAPSHOT_SPEC.md # server/docs/experiments/cache-impact-2026-05-24.md # server/docs/experiments/gemma4-26b-thinking-control-2026-05-25.md # server/docs/experiments/kv-cache-q4-vs-tq3-2026-05-25.md # server/docs/experiments/thinking-control-protocol.md # server/docs/experiments/thinking-mechanism-explainer.md # server/docs/run-requests/area-swe-bench-integration.md # server/docs/run-requests/bragi-gemma4-laguna-config-issues.md # server/docs/run-requests/forge-vs-vidar-ds4f.md # server/docs/run-requests/luce-dflash-think-92.md # server/docs/run-requests/qwen36-budget-signaling-overhaul.md # server/docs/run-requests/qwen36-hard-limit-reply-budget-bump.md # server/docs/run-requests/sindri-rtx3090ti-qwen36-nothink-92.md # server/scripts/bench_agent.py # server/scripts/bench_agent_loop.py # server/scripts/bench_daemon.py # server/scripts/bench_he.py # server/scripts/bench_he_http.py # server/scripts/bench_llm.py # server/scripts/bench_server.py # server/scripts/entrypoint.sh # server/scripts/fixtures/agent_cases/cases.json # server/scripts/server.py # server/scripts/test_prefix_cache.py # server/scripts/test_server.py

PR Luce-Org#281 moved dflash/ → server/. The pull_request `paths:` filter still targeted dflash/* — so PRs touching the C++ server code wouldn't trigger the Docker prebuild sanity check. Repoint to server/ so CI catches Dockerfile / source regressions before merge.

…nch in-tree Collapses 134 commits of `integration/props-uv-squared-clean` onto current main as one reviewable change. Most of the underlying server-side work already landed via separate PRs: thinking-budget v2 + multi-dialect reasoning + sidecars (Luce-Org#269), dflash→server rename + optimizations/ grouping (Luce-Org#281), qwen35moe hybrid CPU/CUDA expert split (Luce-Org#262), and a stream of smaller fixes from bragi over the last week. What remained in integration is everything *above* the server: the host-side runner, the container image, the benchmark/profile evidence pipeline, the harness for driving real clients, and the luce-bench framework itself. ## What changed ### Docker + host wrapper - Dockerfile (CUDA 12.8 base; copies server/, lucebox/, harness/, luce-bench/ into one image; wires `python -m lucebench.cli` as the `benchmark` entrypoint subcommand). - `lucebox.sh` (~470 lines of host bash, zero deps beyond docker + nvidia-smi): `check`, `configure`, `pull`, `download-models`, `serve`, `install`/`start`/ `status`/`logs` (user-systemd), `print-run`, `benchmark`, `profile`. - `.github/workflows/docker.yml` builds + pushes `ghcr.io/luce-org/lucebox-hub` tags (`:cuda12`, `:vX.Y.Z-cuda12`, `:X.Y-cuda12`, `:sha-<short>-cuda12`). - `server/scripts/entrypoint.sh` resolves draft GGUF by target architecture (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6); warns when multiple targets are present in models/. ### lucebox Python package (in-container CLI) - `lucebox/` workspace member: `cli.py`, `autotune.py` (VRAM-tiered tier selection + per-host config writeback), `config.py` (typed TOML), `download.py`, `docker_run.py`, `host_check.py`, `host_facts.py`, `profile.py` (profile sweep across DFLASH_MAX_CTX × DFLASH_BUDGET, KV cache types, pFlash modes, lazy-draft, prefix-cache slots), `smoke.py`, `types.py`. - `lucebox/tests/` for the typed surfaces. - Level1/Level2/Level3 profile gates; sweep results merged back into `~/.lucebox/config.toml` only after capability + ds4-eval/agentic-tools/ agentic-session validation gates pass. ### luce-bench in monorepo - `luce-bench/` workspace member at v0.2.4 — the standalone bench framework (areas: ds4-eval, code, longctx, agent, forge; sweep + per-host snapshot output; v0.2.4 includes the forge area's EvalConfig + run_scenario signature realignment). - `[tool.uv.sources] luce-bench = { workspace = true }` replaces the prior git tag pin. - `.github/workflows/release-luce-bench.yml` publishes to PyPI from the monorepo on `luce-bench-v*` tags (trusted publisher, `pypi` environment). ### harness workspace - `harness/` workspace member: client adapters (`claude_code`, `codex`, `opencode`, `hermes`, `pi`, `openclaw`), `client_test_runner.py`, `benchmarks/run_lucebox_vs_llamacpp.sh`, prompts. `lucebox profile` delegates the actual bench runs to harness. ### Bench + profile evidence - `server/docs/BENCHMARK_SNAPSHOT_SPEC.md` — schema for tuning/profile artifacts. Snapshots themselves live in the standalone `luce-bench-baselines` repo (out of this tree). ### Misc - Updated CI workflow path filters for `server/` (post-rename). - README's "Quick start" section, hardware coverage table, env var reference table; minor edits to optimizations READMEs. - model card sidecar updates landed alongside Luce-Org#269 but kept here at current values (qwen3.6, gemma-4-26b-a4b, gemma-4-31b, laguna-xs.2, `_schema.json`). ## Out of scope / follow-ups - 31b backend wiring beyond what `share/model_cards/gemma-4-31b-it.json` shipped (working empirically @ 24GB on sindri AR-only; 26b spec-decode path already proven). - gemma4 MoE expert split (howard0su's PR Luce-Org#262 territory; merged but not applied to gemma4 yet). - Multi-Token Prediction (upstream PR #23398, draft). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(repo): rename dflash→server, group pflash+megakernel under opti…

39fe251

…mizations/

davide221 merged commit 6aec735 into main May 26, 2026
1 of 3 checks passed

This was referenced May 27, 2026

docker, cli, smoke, bench and autotune first-run dx #226

Closed

feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree #285

Closed

davide221 deleted the chore/rename-server-optimizations branch May 27, 2026 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(repo): rename dflash→server, group pflash+megakernel under optimizations/#281

chore(repo): rename dflash→server, group pflash+megakernel under optimizations/#281
davide221 merged 1 commit into
mainfrom
chore/rename-server-optimizations

davide221 commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davide221 commented May 26, 2026

What

Why

What changed

What did NOT change

Tested on lucebox2 (RTX 3090, CUDA 12.6, sm_86)

Coordination note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant