From 023de163d6c1df3ce9afeead8148bdc6d28fc2cb Mon Sep 17 00:00:00 2001 From: MikukuOvO Date: Tue, 5 May 2026 05:25:32 +0000 Subject: [PATCH 1/2] docs: add single-prompt multi-generation guide --- .../single_prompt_multi_generation.md | 235 ++++++++++++++++++ 1 file changed, 235 insertions(+) create mode 100644 docs/advanced/single_prompt_multi_generation.md diff --git a/docs/advanced/single_prompt_multi_generation.md b/docs/advanced/single_prompt_multi_generation.md new file mode 100644 index 0000000000..600f9317de --- /dev/null +++ b/docs/advanced/single_prompt_multi_generation.md @@ -0,0 +1,235 @@ +# Single-Prompt Multi-Generation + +Single-prompt multi-generation means generating multiple trajectories for the +same prompt in one rollout group. In Miles Diffusion, this is both an algorithmic +requirement and a performance knob: + +- Algorithmically, Miles Diffusion compares multiple sampled outputs within a + prompt group, so each prompt should have multiple sampled outputs. +- System-wise, packing multiple samples from the same prompt into one SGLang-D + request can improve rollout throughput and reduce request overhead. + +This document summarizes how miles-diffusion currently supports this path, what +the main knobs mean, and what still needs validation before treating it as the +default high-throughput configuration. + +## Terminology + +| Term | Meaning | +| --- | --- | +| Prompt group | One prompt plus its sampled outputs. In code, samples in the same group share `group_index`. | +| `rollout_batch_size` | Number of prompts per rollout. | +| `n_samples_per_prompt` | Number of generated samples per prompt. This is the Miles Diffusion group size. | +| `diffusion_microgroup_size` | Number of same-prompt samples packed into one SGLang-D `/rollout/generate` request. | +| `num_outputs_per_prompt` | SGLang-D request field used by miles to ask the engine for multiple outputs for one prompt. | +| `samples_per_rollout` | `rollout_batch_size * n_samples_per_prompt`. | + +For example, the 4-GPU OCR recipe uses: + +```text +rollout_batch_size = 32 prompts +n_samples_per_prompt = 16 samples per prompt +samples_per_rollout = 32 * 16 = 512 samples +diffusion_microgroup_size = 16 +``` + +This means each prompt produces 16 samples, and the rollout path tries to send +all 16 samples for one prompt in a single SGLang-D request. + +## Current Miles Flow + +The current implementation is prompt-group first. + +1. The data source reads `rollout_batch_size` prompts. +2. Each prompt is copied `n_samples_per_prompt` times. +3. The copied samples share the same `group_index` and receive unique sample + indices. +4. The SGLang-D rollout path splits each group into one or more microgroups. +5. Each microgroup is sent to SGLang-D with `num_outputs_per_prompt` equal to + the microgroup length. +6. Rewards are computed for all samples in the group. +7. Miles Diffusion reward normalization is applied per prompt group, unless + configured otherwise. +8. The grouped samples are flattened before training. + +The key distinction is: + +```text +n_samples_per_prompt controls the algorithmic group size. +diffusion_microgroup_size controls how aggressively the group is packed for rollout. +``` + +Changing `diffusion_microgroup_size` should not change the Miles Diffusion group +semantics, but it can change memory pressure, latency, and SGLang-D batching +behavior. + +## Accuracy and Reward Semantics + +For Miles Diffusion, the important invariant is that rewards from the same prompt +are normalized together. + +By default, miles-diffusion computes: + +```text +advantage = sample_reward - mean(rewards in the same prompt group) +``` + +When `--globalize-reward-std` is enabled, the mean remains per prompt group but +the standard deviation is computed over the full rollout batch. This matches the +Miles Diffusion Qwen-Image recipes. + +The rollout path also assigns deterministic seed ranges per prompt group: + +```text +seed_base = rollout_seed + group_index * n_samples_per_prompt +sample seeds = seed_base, seed_base + 1, ... +``` + +SGLang-D currently expands from the first seed inside the request, so miles keeps +the microgroup seed ranges disjoint. A future SGLang-D seed-list API would make +this less implicit. + +## Current Recipe Settings + +| Recipe | `rollout_batch_size` | `n_samples_per_prompt` | `diffusion_microgroup_size` | Interpretation | +| --- | ---: | ---: | ---: | --- | +| OCR 2-GPU | 16 | 16 | 16 | Full prompt group packed into one rollout request. | +| OCR 4-GPU | 32 | 16 | 16 | Same per-prompt packing, larger prompt batch. | +| PickScore 2-GPU | 16 | 16 | 16 | Full group packing in the scaled-down recipe. | +| PickScore 4-GPU | 32 | 16 | 8 | One prompt group is split into two rollout requests, reducing per-request memory pressure. | + +The training-side global batch is derived from: + +```text +global_batch_size = rollout_batch_size * n_samples_per_prompt / num_steps_per_rollout +``` + +So increasing `n_samples_per_prompt` changes both rollout volume and training +batch math. For pure rollout packing experiments, keep `n_samples_per_prompt` +fixed and tune only `diffusion_microgroup_size`. + +## Performance Expectations + +Single-prompt multi-generation is expected to help when rollout GPUs are +under-utilized. It improves the rollout path by batching multiple same-prompt +samples into one DiT denoising pass, instead of sending the same prompt as many +independent single-output requests. + +The expected benefits are: + +- fewer HTTP requests per rollout; +- better rollout-side GPU utilization; +- less duplicated prompt-conditioning overhead; +- higher samples-per-second when memory headroom is sufficient. + +The tradeoff is memory. Larger `diffusion_microgroup_size` increases the +effective batch size inside SGLang-D. For Qwen-Image, SGLang PR #21988 reports +strong speedup at 512 x 512 when `num_outputs_per_prompt` is increased, but also +substantial peak-memory growth. At 1024 x 1024, very large multi-output batches +can become VAE-decode bound or OOM. + +## SGLang-D Dependency + +Miles already sends `num_outputs_per_prompt` to SGLang-D, but Qwen-Image needs a +SGLang-D-side condition-batch fix for this to be reliable. + +The relevant upstream SGLang PR is: + +```text +sgl-project/sglang#21988 +``` + +That PR fixes the mismatch where latent samples are expanded for +`num_outputs_per_prompt > 1`, but prompt and negative-prompt conditioning remain +at the original prompt batch size. + +Practical implication: + +- If the installed SGLang-D revision includes the fix, Qwen-Image multi-output + generation can be tested directly. +- If the installed SGLang-D revision does not include the fix, + `diffusion_microgroup_size > 1` may fail or produce invalid Qwen-Image + conditioning shapes. +- For reproducible experiments, pin the exact SGLang-D commit rather than a + moving branch tip. + +## Known Limitations + +| Limitation | Impact | +| --- | --- | +| Qwen-Image multi-output depends on the SGLang-D condition-batch fix. | The miles-side request path exists, but the engine version must support it. | +| SGLang-D does not currently accept an explicit seed list for one multi-output request. | Miles relies on `seed, seed + 1, ...` expansion and keeps microgroup seed ranges disjoint. | +| SGLang-D rollout does not yet support oversampling plus abort. | `over_sampling_batch_size` must equal `rollout_batch_size` in the current diffusion rollout path. | +| Reward model throughput can dominate. | PickScore can become slower than OCR; increasing rollout batch alone may not improve end-to-end iteration time. | +| Larger output tensors increase IO pressure. | For image and especially video rollout, output encoding/compression may become necessary before adding heavier transfer systems. | +| Batch-invariant behavior still needs per-model validation. | Multi-output batching should be checked against serial generation for reward, trajectory, and log-prob behavior. | + +## Validation Checklist + +Before enabling large microgroups by default, validate the following for the +target model and task: + +1. `diffusion_microgroup_size=1` and `diffusion_microgroup_size=N` produce + compatible rollout tensor shapes. +2. Samples in a prompt group have distinct seeds and non-identical outputs. +3. `rollout_log_probs` are finite and have the expected timestep shape. +4. Train-side log-prob replay remains within the target tolerance. +5. Reward group statistics are healthy, especially `group_std_avg` and + zero-std group counts. +6. Rollout GPU utilization improves without pushing VAE decode or reward + scoring into the bottleneck. +7. End-to-end iteration time improves, not only rollout generation time. + +For incremental testing, use this order: + +```text +diffusion_microgroup_size = 1 -> 2 -> 4 -> 8 -> 16 +``` + +Keep the rest of the recipe fixed while sweeping this knob. + +## Practical Guidance + +- Use `n_samples_per_prompt=16` for Miles Diffusion Qwen-Image OCR/PickScore + experiments unless intentionally changing the algorithmic group size. +- Use `diffusion_microgroup_size=1` for correctness debugging. +- Use `diffusion_microgroup_size=8` or `16` for throughput testing, depending on + available memory and reward-model cost. +- Prefer OCR for early rollout-throughput validation because the reward path is + lighter than PickScore. +- Treat PickScore results as end-to-end system tests, not pure rollout + benchmarks, because the reward model can dominate iteration time. +- For T2V, start with few-frame or single-frame settings first. The same + single-prompt multi-generation idea applies, but video rollout has much higher + memory and IO pressure. + +## Example Configuration + +OCR 4-GPU Miles Diffusion rollout: + +```bash +--rollout-batch-size 32 +--n-samples-per-prompt 16 +--num-steps-per-rollout 2 +--diffusion-microgroup-size 16 +--micro-batch-size-sample 4 +--micro-batch-size-tstep 2 +--globalize-reward-std +``` + +PickScore 4-GPU Miles Diffusion rollout: + +```bash +--rollout-batch-size 32 +--n-samples-per-prompt 16 +--num-steps-per-rollout 2 +--diffusion-microgroup-size 8 +--micro-batch-size-sample 8 +--micro-batch-size-tstep 1 +--globalize-reward-std +``` + +The OCR setting is a better first target for validating whether single-prompt +multi-generation raises rollout utilization. The PickScore setting is more +representative of full system behavior because it includes a heavier reward +worker. From 8fe9de3a10ffffeac53d2eeba311cf3267842248 Mon Sep 17 00:00:00 2001 From: MikukuOvO Date: Sat, 9 May 2026 05:11:47 +0000 Subject: [PATCH 2/2] docs: clarify single-prompt multi-generation --- .../single_prompt_multi_generation.md | 29 ++++++++++++------- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/docs/advanced/single_prompt_multi_generation.md b/docs/advanced/single_prompt_multi_generation.md index 600f9317de..a2e1a00dd2 100644 --- a/docs/advanced/single_prompt_multi_generation.md +++ b/docs/advanced/single_prompt_multi_generation.md @@ -4,10 +4,14 @@ Single-prompt multi-generation means generating multiple trajectories for the same prompt in one rollout group. In Miles Diffusion, this is both an algorithmic requirement and a performance knob: -- Algorithmically, Miles Diffusion compares multiple sampled outputs within a - prompt group, so each prompt should have multiple sampled outputs. -- System-wise, packing multiple samples from the same prompt into one SGLang-D - request can improve rollout throughput and reduce request overhead. +- Algorithmically, GRPO-style RL needs multiple samples from the same prompt so + Miles Diffusion can compare rewards within that prompt group and compute a + group-relative advantage. +- System-wise, Miles sends one same-prompt microgroup to SGLang-D with + `num_outputs_per_prompt=N`. SGLang-D text-encodes that prompt once, expands the + prompt conditioning to the latent batch in the fixed Qwen-Image path, and runs + the denoising timestep loop on the expanded batch. This removes repeated + encoder/request overhead and makes DiT forwards larger when memory allows. This document summarizes how miles-diffusion currently supports this path, what the main knobs mean, and what still needs validation before treating it as the @@ -111,15 +115,17 @@ fixed and tune only `diffusion_microgroup_size`. ## Performance Expectations Single-prompt multi-generation is expected to help when rollout GPUs are -under-utilized. It improves the rollout path by batching multiple same-prompt -samples into one DiT denoising pass, instead of sending the same prompt as many -independent single-output requests. +under-utilized. In the SGLang-D path, Miles turns a microgroup into one request; +SGLang-D creates `num_outputs_per_prompt` latent samples, reuses the text-encoder +result and prompt conditioning for that prompt, and iterates the denoising +timesteps over the expanded batch. With CFG, positive and negative branches are +still separate, but each branch is batched across the microgroup. The expected benefits are: - fewer HTTP requests per rollout; -- better rollout-side GPU utilization; -- less duplicated prompt-conditioning overhead; +- less duplicated text-encoder and prompt-conditioning work; +- larger DiT forwards and better rollout-side GPU utilization; - higher samples-per-second when memory headroom is sufficient. The tradeoff is memory. Larger `diffusion_microgroup_size` increases the @@ -140,8 +146,9 @@ sgl-project/sglang#21988 ``` That PR fixes the mismatch where latent samples are expanded for -`num_outputs_per_prompt > 1`, but prompt and negative-prompt conditioning remain -at the original prompt batch size. +`num_outputs_per_prompt > 1`, but Qwen-Image prompt and negative-prompt +conditioning remain at the original prompt batch size instead of being repeated +to the expanded latent batch. Practical implication: