Skip to content

[Batch 4] Dedicated transfer queue + staging buffer ring #384

@MichaelFisher1997

Description

@MichaelFisher1997

Summary

Create a dedicated Vulkan transfer queue with a ring-buffered staging allocator for chunk mesh uploads. Currently mesh uploads happen on the graphics queue and may cause stalls. A dedicated transfer queue allows uploads to overlap with rendering.

Depends on: #376 (new vertex format — staging buffer stride must match)

Current State

Chunk mesh uploads happen via GlobalVertexAllocator.upload():

  1. Worker thread builds mesh vertices on CPU
  2. Main thread calls rm.updateBuffer() or maps vertex buffer memory
  3. Vertex data copied to GPU-visible memory
  4. Chunk becomes renderable

The rm.updateBuffer() path may stall the graphics queue if the buffer is in use by a previous frame. At 128+ chunks with constant streaming, this creates frame hitches.

Target Architecture

Dedicated Transfer Queue

  • Query Vulkan for a queue family with TRANSFER_BIT (but not GRAPHICS_BIT, if available)
  • Fallback: share graphics queue if no dedicated transfer queue exists
  • Create command pool + command buffer for transfer operations

Staging Buffer Ring

const StagingRing = struct {
    buffer: VkBuffer,
    memory: VkDeviceMemory,
    mapped: [*]u8,
    capacity: usize,          // total ring size (e.g., 64MB)
    head: usize,              // write cursor
    tail: usize,              // read cursor (last completed fence)
    frame_fences: [MAX_FRAMES_IN_FLIGHT]VkFence,
    frame_offsets: [MAX_FRAMES_IN_FLIGHT]usize,
    
    pub fn beginFrame(self: *StagingRing, frame_index: usize) usize;
    pub fn allocate(self: *StagingRing, size: usize, alignment: usize) ![]u8;
    pub fn submit(self: *StagingRing, transfers: []Transfer) !void;
};
  • Ring buffer: HOST_VISIBLE memory, persistently mapped
  • allocate() returns a slice into the ring for the caller to write into
  • submit() issues vkCmdCopyBuffer commands on the transfer queue
  • beginFrame() advances tail based on previous frame's fence completion

Upload Flow

  1. Worker thread produces mesh vertex data on CPU
  2. Main thread: staging.allocate(size) → get CPU-visible slice
  3. Copy vertex data into staging slice
  4. Record vkCmdCopyBuffer(staging → megabuffer at offset)
  5. Submit to transfer queue with fence
  6. Next frame: fence completed → staging ring space reclaimed

Implementation Plan

Step 1: Transfer queue detection + setup

  • In VulkanDevice init: query queue families for dedicated transfer queue
  • Create command pool with TRANSIENT_BIT + RESET_COMMAND_BUFFER_BIT
  • If no dedicated queue: share graphics queue (no parallelism, but staging still works)

Step 2: Staging ring allocator

  • Allocate VK_BUFFER_USAGE_TRANSFER_SRC_BIT with HOST_VISIBLE | HOST_COHERENT
  • 64MB default capacity (configurable)
  • Ring-wrap: if allocation would wrap, pad to end and allocate from start
  • Frame-aware: each frame's allocations are released when that frame's fence completes

Step 3: Integration with GlobalVertexAllocator

  • upload() method: instead of direct map+copy, allocate from staging ring, copy, then issue GPU-side copy
  • The megabuffer stays device-local (not host-visible), which is better for GPU performance
  • vkCmdCopyBuffer from staging → device-local megabuffer

Step 4: Queue synchronization

  • Transfer queue fence: signals when copy completes
  • Semaphore: transfer complete → graphics queue can consume the data
  • Or: use vkQueueSubmit with timeline semaphore if available

Files to Create

  • src/engine/graphics/vulkan/transfer_queue.zig — queue management, staging ring

Files to Modify

  • src/engine/graphics/vulkan/device.zig — transfer queue detection
  • src/world/chunk_allocator.zigupload() uses staging ring
  • src/world/lod_upload_queue.zig — LOD uploads use staging ring
  • src/engine/graphics/vulkan/rhi_resource_lifecycle.zig — resource transitions for copies

Testing

  • Chunk meshes upload correctly via transfer queue
  • No frame hitches during heavy chunk streaming
  • Fallback works when no dedicated transfer queue
  • Staging ring wraps correctly without corruption
  • Memory usage bounded (ring doesn't grow unbounded)
  • LOD mesh uploads work through same path

Roadmap: docs/PERFORMANCE_ROADMAP.md — Batch 4, Issue 3B-1

Metadata

Metadata

Assignees

No one assigned

    Labels

    batch-4Batch 4: Advanced GPUdocumentationImprovements or additions to documentationengineenhancementNew feature or requesthotfixperf/renderingRendering pipeline performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions