-
Notifications
You must be signed in to change notification settings - Fork 0
[Batch 6] Greedy mesh compute shader — GPU-driven meshing #391
Copy link
Copy link
Open
Labels
batch-6Batch 6: CapstoneBatch 6: CapstonedocumentationImprovements or additions to documentationImprovements or additions to documentationengineenhancementNew feature or requestNew feature or requestperf/gpu-computeGPU compute shader workGPU compute shader workshaders
Description
Summary
Implement a compute shader that reads chunk block data from the GPU storage buffer (#389) and produces vertex data directly on the GPU. This eliminates the CPU meshing bottleneck entirely — no worker thread meshing, no vertex upload, no staging buffer. The GPU builds meshes and immediately draws them.
Depends on: #389 (GPU block data buffer)
This is the capstone rendering optimization. Combined with MDI (#371), GPU culling (#379), and occlusion culling (#387), the entire rendering pipeline becomes GPU-driven.
Current Meshing Pipeline
- Worker thread reads
Chunk.blockson CPU - Greedy mesher processes 16 subchunks, merges adjacent faces
- Produces
[]Vertexarrays (solid/cutout/fluid) - Main thread uploads to GPU via staging buffer
- Vertex data lives in megabuffer, drawn via
drawOffset()
Bottleneck: CPU meshing at ~2-5ms per chunk. With 1000+ chunks loading, mesh queue is always behind generation queue.
Target: Compute Meshing
Overview
[GPU Block Buffer] → [Compute Mesh Shader] → [Vertex Output Buffer] → [Draw]
↑
[Indirect Draw Commands] ←─┘
- Compute shader reads blocks from
GpuBlockBufferfor one chunk - For each block: check 6 face neighbors, generate visible faces
- Merge adjacent same-block-type faces (greedy merge)
- Write vertices to output buffer
- Write
DrawIndirectCommandfor draw dispatch
Compute Shader Design
// mesh.comp
layout(local_size_x = 16, local_size_y = 16, local_size_z = 1) in;
layout(binding = 0) readonly buffer BlockData { uint blocks[]; }; // chunk blocks
layout(binding = 1) writeonly buffer VertexOutput { Vertex vertices[]; };
layout(binding = 2) buffer DrawCommand { DrawIndirectCommand cmd; };
layout(binding = 3) readonly buffer NeighborData { uint neighbors[]; }; // 4 neighbor chunks
// Each workgroup processes one horizontal slice (16x16 blocks at one Y level)
// Within slice: each thread processes one block column
// For each block: check 6 faces, emit visible quad verticesGreedy Merge on GPU
- Per-slice face mask:
uint face_mask[16][16]— 1 bit per face per block - Workgroup shared memory for the current slice
- Reduction: merge adjacent same-type faces in shared memory
- Output: variable-length vertex stream per workgroup
Output Management
- Allocate output buffer slots atomically:
atomicAdd(vertex_counter, count) - Each workgroup reserves space, writes vertices, updates draw command
- Pipeline barrier between compute dispatch and graphics draw
Implementation Plan
Step 1: Basic face culling compute shader
- No greedy merge initially — just emit 1 quad per visible face
- Verify correctness: same visual output as CPU mesher
- Performance measurement: compare CPU vs GPU mesh time
Step 2: Greedy merge in shared memory
- Face mask generation in shared memory
- Row-by-row greedy merge within each slice
- Column merge across slices
- This is the hard part — may need multiple passes within the workgroup
Step 3: Cutout and fluid passes
- Separate dispatch for cutout blocks (alpha-tested) and fluid blocks
- Or: single dispatch with pass tag per vertex
- Draw commands for each pass written to separate buffers
Step 4: Neighbor data
- Block data for 4 cardinal neighbors needed for boundary faces
- Already uploaded in
GpuBlockBuffer— just need the slot index mapping - Pass neighbor slot indices as push constants or uniform
Step 5: Integration with render graph
RenderGraph: addMeshBuildPassbeforeOpaquePass- Dispatch compute for all chunks that need remeshing
- Pipeline barrier:
COMPUTE_SHADER → VERTEX_SHADER - OpaquePass draws from the compute-generated vertex buffer
Step 6: CPU fallback
- Keep CPU mesher for devices without adequate compute capability
- Runtime detection: check
maxComputeWorkGroupSizeandmaxComputeSharedMemorySize
Files to Create
assets/shaders/vulkan/mesh.comp— compute meshing shaderassets/shaders/vulkan/mesh.comp.spvsrc/world/gpu_mesher.zig— dispatch management, output buffer lifecycle
Files to Modify
src/engine/graphics/render_graph.zig— add MeshBuildPasssrc/world/world_renderer.zig— integrate GPU mesher dispatchsrc/world/world_streamer.zig— trigger remesh via GPU instead of CPU jobsrc/engine/graphics/vulkan/pipeline_manager.zig— compute mesh pipelinebuild.zig— glslangValidator check
Testing
- Visual parity with CPU mesher (face-for-face match)
- Greedy merge produces correct merged quads
- Cutout and fluid passes render correctly
- Boundary faces (neighbor chunks) handled correctly
- Performance: GPU mesh time < 1ms per chunk (vs 2-5ms CPU)
- Works at all render distances
- CPU fallback works on devices without compute
Risks
- Greedy merge on GPU is complex — shared memory management, synchronization within workgroups
- May need iterative approach: start with face culling only, add greedy merge incrementally
- Output buffer sizing is tricky — need atomic allocation with worst-case bounds
Roadmap: docs/PERFORMANCE_ROADMAP.md — Batch 6, Issue 4A-2
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
batch-6Batch 6: CapstoneBatch 6: CapstonedocumentationImprovements or additions to documentationImprovements or additions to documentationengineenhancementNew feature or requestNew feature or requestperf/gpu-computeGPU compute shader workGPU compute shader workshaders