Skip to content

[Batch 6] Greedy mesh compute shader — GPU-driven meshing #391

@MichaelFisher1997

Description

@MichaelFisher1997

Summary

Implement a compute shader that reads chunk block data from the GPU storage buffer (#389) and produces vertex data directly on the GPU. This eliminates the CPU meshing bottleneck entirely — no worker thread meshing, no vertex upload, no staging buffer. The GPU builds meshes and immediately draws them.

Depends on: #389 (GPU block data buffer)

This is the capstone rendering optimization. Combined with MDI (#371), GPU culling (#379), and occlusion culling (#387), the entire rendering pipeline becomes GPU-driven.

Current Meshing Pipeline

  1. Worker thread reads Chunk.blocks on CPU
  2. Greedy mesher processes 16 subchunks, merges adjacent faces
  3. Produces []Vertex arrays (solid/cutout/fluid)
  4. Main thread uploads to GPU via staging buffer
  5. Vertex data lives in megabuffer, drawn via drawOffset()

Bottleneck: CPU meshing at ~2-5ms per chunk. With 1000+ chunks loading, mesh queue is always behind generation queue.

Target: Compute Meshing

Overview

[GPU Block Buffer] → [Compute Mesh Shader] → [Vertex Output Buffer] → [Draw]
                                                        ↑
                              [Indirect Draw Commands] ←─┘
  1. Compute shader reads blocks from GpuBlockBuffer for one chunk
  2. For each block: check 6 face neighbors, generate visible faces
  3. Merge adjacent same-block-type faces (greedy merge)
  4. Write vertices to output buffer
  5. Write DrawIndirectCommand for draw dispatch

Compute Shader Design

// mesh.comp
layout(local_size_x = 16, local_size_y = 16, local_size_z = 1) in;

layout(binding = 0) readonly buffer BlockData { uint blocks[]; }; // chunk blocks
layout(binding = 1) writeonly buffer VertexOutput { Vertex vertices[]; };
layout(binding = 2) buffer DrawCommand { DrawIndirectCommand cmd; };
layout(binding = 3) readonly buffer NeighborData { uint neighbors[]; }; // 4 neighbor chunks

// Each workgroup processes one horizontal slice (16x16 blocks at one Y level)
// Within slice: each thread processes one block column
// For each block: check 6 faces, emit visible quad vertices

Greedy Merge on GPU

  • Per-slice face mask: uint face_mask[16][16] — 1 bit per face per block
  • Workgroup shared memory for the current slice
  • Reduction: merge adjacent same-type faces in shared memory
  • Output: variable-length vertex stream per workgroup

Output Management

  • Allocate output buffer slots atomically: atomicAdd(vertex_counter, count)
  • Each workgroup reserves space, writes vertices, updates draw command
  • Pipeline barrier between compute dispatch and graphics draw

Implementation Plan

Step 1: Basic face culling compute shader

  • No greedy merge initially — just emit 1 quad per visible face
  • Verify correctness: same visual output as CPU mesher
  • Performance measurement: compare CPU vs GPU mesh time

Step 2: Greedy merge in shared memory

  • Face mask generation in shared memory
  • Row-by-row greedy merge within each slice
  • Column merge across slices
  • This is the hard part — may need multiple passes within the workgroup

Step 3: Cutout and fluid passes

  • Separate dispatch for cutout blocks (alpha-tested) and fluid blocks
  • Or: single dispatch with pass tag per vertex
  • Draw commands for each pass written to separate buffers

Step 4: Neighbor data

  • Block data for 4 cardinal neighbors needed for boundary faces
  • Already uploaded in GpuBlockBuffer — just need the slot index mapping
  • Pass neighbor slot indices as push constants or uniform

Step 5: Integration with render graph

  • RenderGraph: add MeshBuildPass before OpaquePass
  • Dispatch compute for all chunks that need remeshing
  • Pipeline barrier: COMPUTE_SHADER → VERTEX_SHADER
  • OpaquePass draws from the compute-generated vertex buffer

Step 6: CPU fallback

  • Keep CPU mesher for devices without adequate compute capability
  • Runtime detection: check maxComputeWorkGroupSize and maxComputeSharedMemorySize

Files to Create

  • assets/shaders/vulkan/mesh.comp — compute meshing shader
  • assets/shaders/vulkan/mesh.comp.spv
  • src/world/gpu_mesher.zig — dispatch management, output buffer lifecycle

Files to Modify

  • src/engine/graphics/render_graph.zig — add MeshBuildPass
  • src/world/world_renderer.zig — integrate GPU mesher dispatch
  • src/world/world_streamer.zig — trigger remesh via GPU instead of CPU job
  • src/engine/graphics/vulkan/pipeline_manager.zig — compute mesh pipeline
  • build.zig — glslangValidator check

Testing

  • Visual parity with CPU mesher (face-for-face match)
  • Greedy merge produces correct merged quads
  • Cutout and fluid passes render correctly
  • Boundary faces (neighbor chunks) handled correctly
  • Performance: GPU mesh time < 1ms per chunk (vs 2-5ms CPU)
  • Works at all render distances
  • CPU fallback works on devices without compute

Risks

  • Greedy merge on GPU is complex — shared memory management, synchronization within workgroups
  • May need iterative approach: start with face culling only, add greedy merge incrementally
  • Output buffer sizing is tricky — need atomic allocation with worst-case bounds

Roadmap: docs/PERFORMANCE_ROADMAP.md — Batch 6, Issue 4A-2

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions