Significantly worse performance on CUDA vs Vulkan

Hi, I've been observing performance and I noticed that this project is significantly slower on CUDA as opposed to Vulkan or even CPU(!) 

Not really sure why, but here are the detailed timings:

CUDA:
```
=== Detailed Generation Timing (20 frames) ===

  Prefill:
    Build graph:           5.4 ms
    Forward total:       232.2 ms
      Graph build:         0.2 ms
      Graph alloc:         0.4 ms
      Compute:           231.4 ms
      Data I/O:            0.2 ms

  Talker forward_step (total / per-frame):
    Total:              1042.1 ms   (52.1 ms/frame)
      Graph build:         3.1 ms   (0.2 ms/frame)
      Graph alloc:         6.7 ms   (0.3 ms/frame)
      Compute:          1029.2 ms   (51.5 ms/frame)
      Data I/O:            3.1 ms   (0.2 ms/frame)

  Code predictor (total / per-frame):
    Backend:          GGML
    Total:              2858.9 ms   (142.9 ms/frame)
      Init/KV/embed:       0.7 ms   (0.0 ms/frame)
      Prefill (2tok):    191.0 ms   (9.6 ms/frame)
      Steps (14):       2667.1 ms   (133.4 ms/frame)
      Graph build:        10.4 ms   (0.5 ms/frame)
      Graph alloc:        13.9 ms   (0.7 ms/frame)
      Compute:          2792.6 ms   (139.6 ms/frame)
      Data I/O:           41.2 ms   (2.1 ms/frame)
      CoreML total:        0.0 ms   (0.0 ms/frame)

  Embed lookups:          14.9 ms   (0.7 ms/frame)
  Other/overhead:          1.0 ms
  Total generate:       4154.4 ms
  Throughput:            207.7 ms/frame (4.8 frames/s)

Timing:
  Tokenization:    0 ms
  Speaker encode:  0 ms
  Code generation: 4189 ms
  Vocoder decode:  899 ms
  Total:           5088 ms
  Audio duration:  1.58 s
  Throughput:      0.31x realtime (RTF=3.227)

Memory:
  RSS start/end:   0.00 B -> 0.00 B
  RSS peak:        0.00 B
  Phys start/end:  0.00 B -> 0.00 B
  Phys peak:       0.00 B

TTS Generated audio in 5.10s.
```

Vulkan:
```
=== Detailed Generation Timing (20 frames) ===

  Prefill:
    Build graph:         354.3 ms
    Forward total:       704.5 ms
      Graph build:         0.2 ms
      Graph alloc:        10.3 ms
      Compute:           693.7 ms
      Data I/O:            0.3 ms

  Talker forward_step (total / per-frame):
    Total:               114.5 ms   (5.7 ms/frame)
      Graph build:         2.1 ms   (0.1 ms/frame)
      Graph alloc:        13.0 ms   (0.7 ms/frame)
      Compute:            95.3 ms   (4.8 ms/frame)
      Data I/O:            4.0 ms   (0.2 ms/frame)

  Code predictor (total / per-frame):
    Backend:          GGML
    Total:               781.6 ms   (39.1 ms/frame)
      Init/KV/embed:       1.3 ms   (0.1 ms/frame)
      Prefill (2tok):    271.3 ms   (13.6 ms/frame)
      Steps (14):        509.0 ms   (25.5 ms/frame)
      Graph build:         6.1 ms   (0.3 ms/frame)
      Graph alloc:        33.8 ms   (1.7 ms/frame)
      Compute:           712.0 ms   (35.6 ms/frame)
      Data I/O:           28.1 ms   (1.4 ms/frame)
      CoreML total:        0.0 ms   (0.0 ms/frame)

  Embed lookups:          17.0 ms   (0.8 ms/frame)
  Other/overhead:          2.2 ms
  Total generate:       1974.1 ms
  Throughput:             98.7 ms/frame (10.1 frames/s)

Timing:
  Tokenization:    0 ms
  Speaker encode:  0 ms
  Code generation: 2017 ms
  Vocoder decode:  1121 ms
  Total:           3138 ms
  Audio duration:  1.58 s
  Throughput:      0.50x realtime (RTF=1.990)

Memory:
  RSS start/end:   0.00 B -> 0.00 B
  RSS peak:        0.00 B
  Phys start/end:  0.00 B -> 0.00 B
  Phys peak:       0.00 B

TTS Generated audio in 3.15s.
```

This generally applies for longer sequences too. CUDA is on average half the speed of Vulkan.

Anyone else see this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significantly worse performance on CUDA vs Vulkan #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Significantly worse performance on CUDA vs Vulkan #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions