Skip to content

Significantly worse performance on CUDA vs Vulkan #10

@LostRuins

Description

@LostRuins

Hi, I've been observing performance and I noticed that this project is significantly slower on CUDA as opposed to Vulkan or even CPU(!)

Not really sure why, but here are the detailed timings:

CUDA:

=== Detailed Generation Timing (20 frames) ===

  Prefill:
    Build graph:           5.4 ms
    Forward total:       232.2 ms
      Graph build:         0.2 ms
      Graph alloc:         0.4 ms
      Compute:           231.4 ms
      Data I/O:            0.2 ms

  Talker forward_step (total / per-frame):
    Total:              1042.1 ms   (52.1 ms/frame)
      Graph build:         3.1 ms   (0.2 ms/frame)
      Graph alloc:         6.7 ms   (0.3 ms/frame)
      Compute:          1029.2 ms   (51.5 ms/frame)
      Data I/O:            3.1 ms   (0.2 ms/frame)

  Code predictor (total / per-frame):
    Backend:          GGML
    Total:              2858.9 ms   (142.9 ms/frame)
      Init/KV/embed:       0.7 ms   (0.0 ms/frame)
      Prefill (2tok):    191.0 ms   (9.6 ms/frame)
      Steps (14):       2667.1 ms   (133.4 ms/frame)
      Graph build:        10.4 ms   (0.5 ms/frame)
      Graph alloc:        13.9 ms   (0.7 ms/frame)
      Compute:          2792.6 ms   (139.6 ms/frame)
      Data I/O:           41.2 ms   (2.1 ms/frame)
      CoreML total:        0.0 ms   (0.0 ms/frame)

  Embed lookups:          14.9 ms   (0.7 ms/frame)
  Other/overhead:          1.0 ms
  Total generate:       4154.4 ms
  Throughput:            207.7 ms/frame (4.8 frames/s)

Timing:
  Tokenization:    0 ms
  Speaker encode:  0 ms
  Code generation: 4189 ms
  Vocoder decode:  899 ms
  Total:           5088 ms
  Audio duration:  1.58 s
  Throughput:      0.31x realtime (RTF=3.227)

Memory:
  RSS start/end:   0.00 B -> 0.00 B
  RSS peak:        0.00 B
  Phys start/end:  0.00 B -> 0.00 B
  Phys peak:       0.00 B

TTS Generated audio in 5.10s.

Vulkan:

=== Detailed Generation Timing (20 frames) ===

  Prefill:
    Build graph:         354.3 ms
    Forward total:       704.5 ms
      Graph build:         0.2 ms
      Graph alloc:        10.3 ms
      Compute:           693.7 ms
      Data I/O:            0.3 ms

  Talker forward_step (total / per-frame):
    Total:               114.5 ms   (5.7 ms/frame)
      Graph build:         2.1 ms   (0.1 ms/frame)
      Graph alloc:        13.0 ms   (0.7 ms/frame)
      Compute:            95.3 ms   (4.8 ms/frame)
      Data I/O:            4.0 ms   (0.2 ms/frame)

  Code predictor (total / per-frame):
    Backend:          GGML
    Total:               781.6 ms   (39.1 ms/frame)
      Init/KV/embed:       1.3 ms   (0.1 ms/frame)
      Prefill (2tok):    271.3 ms   (13.6 ms/frame)
      Steps (14):        509.0 ms   (25.5 ms/frame)
      Graph build:         6.1 ms   (0.3 ms/frame)
      Graph alloc:        33.8 ms   (1.7 ms/frame)
      Compute:           712.0 ms   (35.6 ms/frame)
      Data I/O:           28.1 ms   (1.4 ms/frame)
      CoreML total:        0.0 ms   (0.0 ms/frame)

  Embed lookups:          17.0 ms   (0.8 ms/frame)
  Other/overhead:          2.2 ms
  Total generate:       1974.1 ms
  Throughput:             98.7 ms/frame (10.1 frames/s)

Timing:
  Tokenization:    0 ms
  Speaker encode:  0 ms
  Code generation: 2017 ms
  Vocoder decode:  1121 ms
  Total:           3138 ms
  Audio duration:  1.58 s
  Throughput:      0.50x realtime (RTF=1.990)

Memory:
  RSS start/end:   0.00 B -> 0.00 B
  RSS peak:        0.00 B
  Phys start/end:  0.00 B -> 0.00 B
  Phys peak:       0.00 B

TTS Generated audio in 3.15s.

This generally applies for longer sequences too. CUDA is on average half the speed of Vulkan.

Anyone else see this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions