[Batch 6] Greedy mesh compute shader — GPU-driven meshing

## Summary

Implement a compute shader that reads chunk block data from the GPU storage buffer (#389) and produces vertex data directly on the GPU. This eliminates the CPU meshing bottleneck entirely — no worker thread meshing, no vertex upload, no staging buffer. The GPU builds meshes and immediately draws them.

**Depends on:** #389 (GPU block data buffer)

This is the capstone rendering optimization. Combined with MDI (#371), GPU culling (#379), and occlusion culling (#387), the entire rendering pipeline becomes GPU-driven.

## Current Meshing Pipeline

1. Worker thread reads `Chunk.blocks` on CPU
2. Greedy mesher processes 16 subchunks, merges adjacent faces
3. Produces `[]Vertex` arrays (solid/cutout/fluid)
4. Main thread uploads to GPU via staging buffer
5. Vertex data lives in megabuffer, drawn via `drawOffset()`

**Bottleneck**: CPU meshing at ~2-5ms per chunk. With 1000+ chunks loading, mesh queue is always behind generation queue.

## Target: Compute Meshing

### Overview
```
[GPU Block Buffer] → [Compute Mesh Shader] → [Vertex Output Buffer] → [Draw]
                                                        ↑
                              [Indirect Draw Commands] ←─┘
```

1. Compute shader reads blocks from `GpuBlockBuffer` for one chunk
2. For each block: check 6 face neighbors, generate visible faces
3. Merge adjacent same-block-type faces (greedy merge)
4. Write vertices to output buffer
5. Write `DrawIndirectCommand` for draw dispatch

### Compute Shader Design

```glsl
// mesh.comp
layout(local_size_x = 16, local_size_y = 16, local_size_z = 1) in;

layout(binding = 0) readonly buffer BlockData { uint blocks[]; }; // chunk blocks
layout(binding = 1) writeonly buffer VertexOutput { Vertex vertices[]; };
layout(binding = 2) buffer DrawCommand { DrawIndirectCommand cmd; };
layout(binding = 3) readonly buffer NeighborData { uint neighbors[]; }; // 4 neighbor chunks

// Each workgroup processes one horizontal slice (16x16 blocks at one Y level)
// Within slice: each thread processes one block column
// For each block: check 6 faces, emit visible quad vertices
```

### Greedy Merge on GPU
- Per-slice face mask: `uint face_mask[16][16]` — 1 bit per face per block
- Workgroup shared memory for the current slice
- Reduction: merge adjacent same-type faces in shared memory
- Output: variable-length vertex stream per workgroup

### Output Management
- Allocate output buffer slots atomically: `atomicAdd(vertex_counter, count)`
- Each workgroup reserves space, writes vertices, updates draw command
- Pipeline barrier between compute dispatch and graphics draw

## Implementation Plan

### Step 1: Basic face culling compute shader
- No greedy merge initially — just emit 1 quad per visible face
- Verify correctness: same visual output as CPU mesher
- Performance measurement: compare CPU vs GPU mesh time

### Step 2: Greedy merge in shared memory
- Face mask generation in shared memory
- Row-by-row greedy merge within each slice
- Column merge across slices
- This is the hard part — may need multiple passes within the workgroup

### Step 3: Cutout and fluid passes
- Separate dispatch for cutout blocks (alpha-tested) and fluid blocks
- Or: single dispatch with pass tag per vertex
- Draw commands for each pass written to separate buffers

### Step 4: Neighbor data
- Block data for 4 cardinal neighbors needed for boundary faces
- Already uploaded in `GpuBlockBuffer` — just need the slot index mapping
- Pass neighbor slot indices as push constants or uniform

### Step 5: Integration with render graph
- `RenderGraph`: add `MeshBuildPass` before `OpaquePass`
- Dispatch compute for all chunks that need remeshing
- Pipeline barrier: `COMPUTE_SHADER → VERTEX_SHADER`
- OpaquePass draws from the compute-generated vertex buffer

### Step 6: CPU fallback
- Keep CPU mesher for devices without adequate compute capability
- Runtime detection: check `maxComputeWorkGroupSize` and `maxComputeSharedMemorySize`

## Files to Create

- `assets/shaders/vulkan/mesh.comp` — compute meshing shader
- `assets/shaders/vulkan/mesh.comp.spv`
- `src/world/gpu_mesher.zig` — dispatch management, output buffer lifecycle

## Files to Modify

- `src/engine/graphics/render_graph.zig` — add MeshBuildPass
- `src/world/world_renderer.zig` — integrate GPU mesher dispatch
- `src/world/world_streamer.zig` — trigger remesh via GPU instead of CPU job
- `src/engine/graphics/vulkan/pipeline_manager.zig` — compute mesh pipeline
- `build.zig` — glslangValidator check

## Testing

- [ ] Visual parity with CPU mesher (face-for-face match)
- [ ] Greedy merge produces correct merged quads
- [ ] Cutout and fluid passes render correctly
- [ ] Boundary faces (neighbor chunks) handled correctly
- [ ] Performance: GPU mesh time < 1ms per chunk (vs 2-5ms CPU)
- [ ] Works at all render distances
- [ ] CPU fallback works on devices without compute

## Risks

- Greedy merge on GPU is complex — shared memory management, synchronization within workgroups
- May need iterative approach: start with face culling only, add greedy merge incrementally
- Output buffer sizing is tricky — need atomic allocation with worst-case bounds

**Roadmap:** `docs/PERFORMANCE_ROADMAP.md` — Batch 6, Issue 4A-2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Batch 6] Greedy mesh compute shader — GPU-driven meshing #391

Summary

Current Meshing Pipeline

Target: Compute Meshing

Overview

Compute Shader Design

Greedy Merge on GPU

Output Management

Implementation Plan

Step 1: Basic face culling compute shader

Step 2: Greedy merge in shared memory

Step 3: Cutout and fluid passes

Step 4: Neighbor data

Step 5: Integration with render graph

Step 6: CPU fallback

Files to Create

Files to Modify

Testing

Risks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Batch 6] Greedy mesh compute shader — GPU-driven meshing #391

Description

Summary

Current Meshing Pipeline

Target: Compute Meshing

Overview

Compute Shader Design

Greedy Merge on GPU

Output Management

Implementation Plan

Step 1: Basic face culling compute shader

Step 2: Greedy merge in shared memory

Step 3: Cutout and fluid passes

Step 4: Neighbor data

Step 5: Integration with render graph

Step 6: CPU fallback

Files to Create

Files to Modify

Testing

Risks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions