Skip to content

chore(repo): rename dflash→server, group pflash+megakernel under optimizations/#281

Merged
davide221 merged 1 commit into
mainfrom
chore/rename-server-optimizations
May 26, 2026
Merged

chore(repo): rename dflash→server, group pflash+megakernel under optimizations/#281
davide221 merged 1 commit into
mainfrom
chore/rename-server-optimizations

Conversation

@davide221

Copy link
Copy Markdown
Contributor

What

Repo layout cleanup. Treat the project as an inference engine.

  • dflash/server/ (the C++/CUDA inference server lives here)
  • pflash/optimizations/pflash/
  • megakernel/optimizations/megakernel/

Why

dflash is an implementation detail of how the server does spec-decode. The directory at the root of the repo is the inference server as a product surface. pflash and megakernel are perf optimizations of that server, so they live under optimizations/.

What changed

  • 318 files renamed via git mv (full rename history preserved)
  • .gitmodules — submodule paths → server/deps/llama.cpp, server/deps/Block-Sparse-Attention
  • pyproject.toml[tool.uv.workspace] members["server", "optimizations/megakernel", "optimizations/pflash"]
  • .github/workflows/ci.yml — all dflash/... build paths → server/...
  • scripts/check_uv_workspace.sh — workspace verification paths
  • harness/clients/*.sh, harness/benchmarks/*.sh — bench script paths
  • All README/RESULTS/ARCHITECTURE/SPEC_PREFILL/laguna_integration_plan markdown — doc cross-refs updated

What did NOT change

  • Python package names: lucebox-dflash, pflash, qwen35-megakernel-bf16 (less downstream breakage; only directory layout moved)
  • C++ #include paths (src-relative via CMake -Isrc, unaffected by move)
  • share/model_cards/ runtime lookup (uses self_bin_dir() — relative)
  • Submodule binding names (kept as dflash/deps/... identifiers — arbitrary, only paths matter)

Tested on lucebox2 (RTX 3090, CUDA 12.6, sm_86)

cmake --build server/build -j32   # clean: dflash_server, test_qwen35moe_*, test_server_unit
                                  # only test_flash_attn_sparse fails (pre-existing, unrelated)

# AR-only smoke
POST /v1/chat/completions  "What is 7*8?"   → "56", HTTP 200, decode 33 tok/s

# Full dflash+ddtree+draft
POST /v1/chat/completions  300-token essay  → HTTP 200, 22.9 tok/s decode,
[spec-decode] accepted=213/1392 (15.3%) avg_commit=3.45

Server boots, model loads, both AR and dflash spec-decode paths work, telemetry intact, model output sensible.

Coordination note

This conflicts with every open PR touching dflash/* (PRs #226, #75, #48, plus weicj's open work). Suggest landing on a coordinated freeze window so contributors can rebase in one pass. Once merged, downstream rebases are mechanical — just replace dflash/ with server/ in their changed paths.

🧙 Built with WOZCODE

@davide221 davide221 merged commit 6aec735 into main May 26, 2026
1 of 3 checks passed
easel added a commit to easel/lucebox-hub that referenced this pull request May 26, 2026
…name)

Brings in PR Luce-Org#281 (chore: rename dflash→server, pflash+megakernel
→ optimizations/) + small docs polish 080f89b.

Our lucebox/ Python package (added by us in 2560086, never upstream)
is untouched. Our docs additions under dflash/docs/* are migrated to
server/docs/*. Our deletions of bench scripts confirmed against the
new server/scripts/* paths.

Workspace members in pyproject.toml: ["server", "lucebox",
"optimizations/megakernel", "optimizations/pflash"] — preserving
our lucebox member alongside upstream's renamed paths.

# Conflicts:
#	README.md
#	pyproject.toml
#	server/docs/BENCHMARK_SNAPSHOT_SPEC.md
#	server/docs/experiments/cache-impact-2026-05-24.md
#	server/docs/experiments/gemma4-26b-thinking-control-2026-05-25.md
#	server/docs/experiments/kv-cache-q4-vs-tq3-2026-05-25.md
#	server/docs/experiments/thinking-control-protocol.md
#	server/docs/experiments/thinking-mechanism-explainer.md
#	server/docs/run-requests/area-swe-bench-integration.md
#	server/docs/run-requests/bragi-gemma4-laguna-config-issues.md
#	server/docs/run-requests/forge-vs-vidar-ds4f.md
#	server/docs/run-requests/luce-dflash-think-92.md
#	server/docs/run-requests/qwen36-budget-signaling-overhaul.md
#	server/docs/run-requests/qwen36-hard-limit-reply-budget-bump.md
#	server/docs/run-requests/sindri-rtx3090ti-qwen36-nothink-92.md
#	server/scripts/bench_agent.py
#	server/scripts/bench_agent_loop.py
#	server/scripts/bench_daemon.py
#	server/scripts/bench_he.py
#	server/scripts/bench_he_http.py
#	server/scripts/bench_llm.py
#	server/scripts/bench_server.py
#	server/scripts/entrypoint.sh
#	server/scripts/fixtures/agent_cases/cases.json
#	server/scripts/server.py
#	server/scripts/test_prefix_cache.py
#	server/scripts/test_server.py
easel added a commit to easel/lucebox-hub that referenced this pull request May 27, 2026
PR Luce-Org#281 moved dflash/ → server/. The pull_request `paths:` filter
still targeted dflash/* — so PRs touching the C++ server code
wouldn't trigger the Docker prebuild sanity check. Repoint to
server/ so CI catches Dockerfile / source regressions before merge.
easel added a commit to easel/lucebox-hub that referenced this pull request May 27, 2026
…nch in-tree

Collapses 134 commits of `integration/props-uv-squared-clean` onto current
main as one reviewable change. Most of the underlying server-side work
already landed via separate PRs: thinking-budget v2 + multi-dialect
reasoning + sidecars (Luce-Org#269), dflash→server rename + optimizations/ grouping
(Luce-Org#281), qwen35moe hybrid CPU/CUDA expert split (Luce-Org#262), and a stream of
smaller fixes from bragi over the last week. What remained in integration
is everything *above* the server: the host-side runner, the container
image, the benchmark/profile evidence pipeline, the harness for driving
real clients, and the luce-bench framework itself.

## What changed

### Docker + host wrapper
- Dockerfile (CUDA 12.8 base; copies server/, lucebox/, harness/, luce-bench/
  into one image; wires `python -m lucebench.cli` as the `benchmark`
  entrypoint subcommand).
- `lucebox.sh` (~470 lines of host bash, zero deps beyond docker + nvidia-smi):
  `check`, `configure`, `pull`, `download-models`, `serve`, `install`/`start`/
  `status`/`logs` (user-systemd), `print-run`, `benchmark`, `profile`.
- `.github/workflows/docker.yml` builds + pushes `ghcr.io/luce-org/lucebox-hub`
  tags (`:cuda12`, `:vX.Y.Z-cuda12`, `:X.Y-cuda12`, `:sha-<short>-cuda12`).
- `server/scripts/entrypoint.sh` resolves draft GGUF by target architecture
  (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6); warns when multiple
  targets are present in models/.

### lucebox Python package (in-container CLI)
- `lucebox/` workspace member: `cli.py`, `autotune.py` (VRAM-tiered tier
  selection + per-host config writeback), `config.py` (typed TOML),
  `download.py`, `docker_run.py`, `host_check.py`, `host_facts.py`,
  `profile.py` (profile sweep across DFLASH_MAX_CTX × DFLASH_BUDGET, KV
  cache types, pFlash modes, lazy-draft, prefix-cache slots), `smoke.py`,
  `types.py`.
- `lucebox/tests/` for the typed surfaces.
- Level1/Level2/Level3 profile gates; sweep results merged back into
  `~/.lucebox/config.toml` only after capability + ds4-eval/agentic-tools/
  agentic-session validation gates pass.

### luce-bench in monorepo
- `luce-bench/` workspace member at v0.2.4 — the standalone bench framework
  (areas: ds4-eval, code, longctx, agent, forge; sweep + per-host snapshot
  output; v0.2.4 includes the forge area's EvalConfig + run_scenario
  signature realignment).
- `[tool.uv.sources] luce-bench = { workspace = true }` replaces the prior
  git tag pin.
- `.github/workflows/release-luce-bench.yml` publishes to PyPI from the
  monorepo on `luce-bench-v*` tags (trusted publisher, `pypi` environment).

### harness workspace
- `harness/` workspace member: client adapters (`claude_code`, `codex`,
  `opencode`, `hermes`, `pi`, `openclaw`), `client_test_runner.py`,
  `benchmarks/run_lucebox_vs_llamacpp.sh`, prompts. `lucebox profile`
  delegates the actual bench runs to harness.

### Bench + profile evidence
- `server/docs/BENCHMARK_SNAPSHOT_SPEC.md` — schema for tuning/profile
  artifacts. Snapshots themselves live in the standalone
  `luce-bench-baselines` repo (out of this tree).

### Misc
- Updated CI workflow path filters for `server/` (post-rename).
- README's "Quick start" section, hardware coverage table, env var
  reference table; minor edits to optimizations READMEs.
- model card sidecar updates landed alongside Luce-Org#269 but kept here at
  current values (qwen3.6, gemma-4-26b-a4b, gemma-4-31b, laguna-xs.2,
  `_schema.json`).

## Out of scope / follow-ups
- 31b backend wiring beyond what `share/model_cards/gemma-4-31b-it.json`
  shipped (working empirically @ 24GB on sindri AR-only; 26b spec-decode
  path already proven).
- gemma4 MoE expert split (howard0su's PR Luce-Org#262 territory; merged but not
  applied to gemma4 yet).
- Multi-Token Prediction (upstream PR #23398, draft).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel added a commit to easel/lucebox-hub that referenced this pull request May 27, 2026
…nch in-tree

Collapses 134 commits of `integration/props-uv-squared-clean` onto current
main as one reviewable change. Most of the underlying server-side work
already landed via separate PRs: thinking-budget v2 + multi-dialect
reasoning + sidecars (Luce-Org#269), dflash→server rename + optimizations/ grouping
(Luce-Org#281), qwen35moe hybrid CPU/CUDA expert split (Luce-Org#262), and a stream of
smaller fixes from bragi over the last week. What remained in integration
is everything *above* the server: the host-side runner, the container
image, the benchmark/profile evidence pipeline, the harness for driving
real clients, and the luce-bench framework itself.

## What changed

### Docker + host wrapper
- Dockerfile (CUDA 12.8 base; copies server/, lucebox/, harness/, luce-bench/
  into one image; wires `python -m lucebench.cli` as the `benchmark`
  entrypoint subcommand).
- `lucebox.sh` (~470 lines of host bash, zero deps beyond docker + nvidia-smi):
  `check`, `configure`, `pull`, `download-models`, `serve`, `install`/`start`/
  `status`/`logs` (user-systemd), `print-run`, `benchmark`, `profile`.
- `.github/workflows/docker.yml` builds + pushes `ghcr.io/luce-org/lucebox-hub`
  tags (`:cuda12`, `:vX.Y.Z-cuda12`, `:X.Y-cuda12`, `:sha-<short>-cuda12`).
- `server/scripts/entrypoint.sh` resolves draft GGUF by target architecture
  (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6); warns when multiple
  targets are present in models/.

### lucebox Python package (in-container CLI)
- `lucebox/` workspace member: `cli.py`, `autotune.py` (VRAM-tiered tier
  selection + per-host config writeback), `config.py` (typed TOML),
  `download.py`, `docker_run.py`, `host_check.py`, `host_facts.py`,
  `profile.py` (profile sweep across DFLASH_MAX_CTX × DFLASH_BUDGET, KV
  cache types, pFlash modes, lazy-draft, prefix-cache slots), `smoke.py`,
  `types.py`.
- `lucebox/tests/` for the typed surfaces.
- Level1/Level2/Level3 profile gates; sweep results merged back into
  `~/.lucebox/config.toml` only after capability + ds4-eval/agentic-tools/
  agentic-session validation gates pass.

### luce-bench in monorepo
- `luce-bench/` workspace member at v0.2.4 — the standalone bench framework
  (areas: ds4-eval, code, longctx, agent, forge; sweep + per-host snapshot
  output; v0.2.4 includes the forge area's EvalConfig + run_scenario
  signature realignment).
- `[tool.uv.sources] luce-bench = { workspace = true }` replaces the prior
  git tag pin.
- `.github/workflows/release-luce-bench.yml` publishes to PyPI from the
  monorepo on `luce-bench-v*` tags (trusted publisher, `pypi` environment).

### harness workspace
- `harness/` workspace member: client adapters (`claude_code`, `codex`,
  `opencode`, `hermes`, `pi`, `openclaw`), `client_test_runner.py`,
  `benchmarks/run_lucebox_vs_llamacpp.sh`, prompts. `lucebox profile`
  delegates the actual bench runs to harness.

### Bench + profile evidence
- `server/docs/BENCHMARK_SNAPSHOT_SPEC.md` — schema for tuning/profile
  artifacts. Snapshots themselves live in the standalone
  `luce-bench-baselines` repo (out of this tree).

### Misc
- Updated CI workflow path filters for `server/` (post-rename).
- README's "Quick start" section, hardware coverage table, env var
  reference table; minor edits to optimizations READMEs.
- model card sidecar updates landed alongside Luce-Org#269 but kept here at
  current values (qwen3.6, gemma-4-26b-a4b, gemma-4-31b, laguna-xs.2,
  `_schema.json`).

## Out of scope / follow-ups
- 31b backend wiring beyond what `share/model_cards/gemma-4-31b-it.json`
  shipped (working empirically @ 24GB on sindri AR-only; 26b spec-decode
  path already proven).
- gemma4 MoE expert split (howard0su's PR Luce-Org#262 territory; merged but not
  applied to gemma4 yet).
- Multi-Token Prediction (upstream PR #23398, draft).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel added a commit to easel/lucebox-hub that referenced this pull request May 27, 2026
…nch in-tree

Collapses 134 commits of `integration/props-uv-squared-clean` onto current
main as one reviewable change. Most of the underlying server-side work
already landed via separate PRs: thinking-budget v2 + multi-dialect
reasoning + sidecars (Luce-Org#269), dflash→server rename + optimizations/ grouping
(Luce-Org#281), qwen35moe hybrid CPU/CUDA expert split (Luce-Org#262), and a stream of
smaller fixes from bragi over the last week. What remained in integration
is everything *above* the server: the host-side runner, the container
image, the benchmark/profile evidence pipeline, the harness for driving
real clients, and the luce-bench framework itself.

## What changed

### Docker + host wrapper
- Dockerfile (CUDA 12.8 base; copies server/, lucebox/, harness/, luce-bench/
  into one image; wires `python -m lucebench.cli` as the `benchmark`
  entrypoint subcommand).
- `lucebox.sh` (~470 lines of host bash, zero deps beyond docker + nvidia-smi):
  `check`, `configure`, `pull`, `download-models`, `serve`, `install`/`start`/
  `status`/`logs` (user-systemd), `print-run`, `benchmark`, `profile`.
- `.github/workflows/docker.yml` builds + pushes `ghcr.io/luce-org/lucebox-hub`
  tags (`:cuda12`, `:vX.Y.Z-cuda12`, `:X.Y-cuda12`, `:sha-<short>-cuda12`).
- `server/scripts/entrypoint.sh` resolves draft GGUF by target architecture
  (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6); warns when multiple
  targets are present in models/.

### lucebox Python package (in-container CLI)
- `lucebox/` workspace member: `cli.py`, `autotune.py` (VRAM-tiered tier
  selection + per-host config writeback), `config.py` (typed TOML),
  `download.py`, `docker_run.py`, `host_check.py`, `host_facts.py`,
  `profile.py` (profile sweep across DFLASH_MAX_CTX × DFLASH_BUDGET, KV
  cache types, pFlash modes, lazy-draft, prefix-cache slots), `smoke.py`,
  `types.py`.
- `lucebox/tests/` for the typed surfaces.
- Level1/Level2/Level3 profile gates; sweep results merged back into
  `~/.lucebox/config.toml` only after capability + ds4-eval/agentic-tools/
  agentic-session validation gates pass.

### luce-bench in monorepo
- `luce-bench/` workspace member at v0.2.4 — the standalone bench framework
  (areas: ds4-eval, code, longctx, agent, forge; sweep + per-host snapshot
  output; v0.2.4 includes the forge area's EvalConfig + run_scenario
  signature realignment).
- `[tool.uv.sources] luce-bench = { workspace = true }` replaces the prior
  git tag pin.
- `.github/workflows/release-luce-bench.yml` publishes to PyPI from the
  monorepo on `luce-bench-v*` tags (trusted publisher, `pypi` environment).

### harness workspace
- `harness/` workspace member: client adapters (`claude_code`, `codex`,
  `opencode`, `hermes`, `pi`, `openclaw`), `client_test_runner.py`,
  `benchmarks/run_lucebox_vs_llamacpp.sh`, prompts. `lucebox profile`
  delegates the actual bench runs to harness.

### Bench + profile evidence
- `server/docs/BENCHMARK_SNAPSHOT_SPEC.md` — schema for tuning/profile
  artifacts. Snapshots themselves live in the standalone
  `luce-bench-baselines` repo (out of this tree).

### Misc
- Updated CI workflow path filters for `server/` (post-rename).
- README's "Quick start" section, hardware coverage table, env var
  reference table; minor edits to optimizations READMEs.
- model card sidecar updates landed alongside Luce-Org#269 but kept here at
  current values (qwen3.6, gemma-4-26b-a4b, gemma-4-31b, laguna-xs.2,
  `_schema.json`).

## Out of scope / follow-ups
- 31b backend wiring beyond what `share/model_cards/gemma-4-31b-it.json`
  shipped (working empirically @ 24GB on sindri AR-only; 26b spec-decode
  path already proven).
- gemma4 MoE expert split (howard0su's PR Luce-Org#262 territory; merged but not
  applied to gemma4 yet).
- Multi-Token Prediction (upstream PR #23398, draft).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@davide221 davide221 deleted the chore/rename-server-optimizations branch May 27, 2026 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant