Hi, I've been observing performance and I noticed that this project is significantly slower on CUDA as opposed to Vulkan or even CPU(!)
Not really sure why, but here are the detailed timings:
CUDA:
=== Detailed Generation Timing (20 frames) ===
Prefill:
Build graph: 5.4 ms
Forward total: 232.2 ms
Graph build: 0.2 ms
Graph alloc: 0.4 ms
Compute: 231.4 ms
Data I/O: 0.2 ms
Talker forward_step (total / per-frame):
Total: 1042.1 ms (52.1 ms/frame)
Graph build: 3.1 ms (0.2 ms/frame)
Graph alloc: 6.7 ms (0.3 ms/frame)
Compute: 1029.2 ms (51.5 ms/frame)
Data I/O: 3.1 ms (0.2 ms/frame)
Code predictor (total / per-frame):
Backend: GGML
Total: 2858.9 ms (142.9 ms/frame)
Init/KV/embed: 0.7 ms (0.0 ms/frame)
Prefill (2tok): 191.0 ms (9.6 ms/frame)
Steps (14): 2667.1 ms (133.4 ms/frame)
Graph build: 10.4 ms (0.5 ms/frame)
Graph alloc: 13.9 ms (0.7 ms/frame)
Compute: 2792.6 ms (139.6 ms/frame)
Data I/O: 41.2 ms (2.1 ms/frame)
CoreML total: 0.0 ms (0.0 ms/frame)
Embed lookups: 14.9 ms (0.7 ms/frame)
Other/overhead: 1.0 ms
Total generate: 4154.4 ms
Throughput: 207.7 ms/frame (4.8 frames/s)
Timing:
Tokenization: 0 ms
Speaker encode: 0 ms
Code generation: 4189 ms
Vocoder decode: 899 ms
Total: 5088 ms
Audio duration: 1.58 s
Throughput: 0.31x realtime (RTF=3.227)
Memory:
RSS start/end: 0.00 B -> 0.00 B
RSS peak: 0.00 B
Phys start/end: 0.00 B -> 0.00 B
Phys peak: 0.00 B
TTS Generated audio in 5.10s.
Vulkan:
=== Detailed Generation Timing (20 frames) ===
Prefill:
Build graph: 354.3 ms
Forward total: 704.5 ms
Graph build: 0.2 ms
Graph alloc: 10.3 ms
Compute: 693.7 ms
Data I/O: 0.3 ms
Talker forward_step (total / per-frame):
Total: 114.5 ms (5.7 ms/frame)
Graph build: 2.1 ms (0.1 ms/frame)
Graph alloc: 13.0 ms (0.7 ms/frame)
Compute: 95.3 ms (4.8 ms/frame)
Data I/O: 4.0 ms (0.2 ms/frame)
Code predictor (total / per-frame):
Backend: GGML
Total: 781.6 ms (39.1 ms/frame)
Init/KV/embed: 1.3 ms (0.1 ms/frame)
Prefill (2tok): 271.3 ms (13.6 ms/frame)
Steps (14): 509.0 ms (25.5 ms/frame)
Graph build: 6.1 ms (0.3 ms/frame)
Graph alloc: 33.8 ms (1.7 ms/frame)
Compute: 712.0 ms (35.6 ms/frame)
Data I/O: 28.1 ms (1.4 ms/frame)
CoreML total: 0.0 ms (0.0 ms/frame)
Embed lookups: 17.0 ms (0.8 ms/frame)
Other/overhead: 2.2 ms
Total generate: 1974.1 ms
Throughput: 98.7 ms/frame (10.1 frames/s)
Timing:
Tokenization: 0 ms
Speaker encode: 0 ms
Code generation: 2017 ms
Vocoder decode: 1121 ms
Total: 3138 ms
Audio duration: 1.58 s
Throughput: 0.50x realtime (RTF=1.990)
Memory:
RSS start/end: 0.00 B -> 0.00 B
RSS peak: 0.00 B
Phys start/end: 0.00 B -> 0.00 B
Phys peak: 0.00 B
TTS Generated audio in 3.15s.
This generally applies for longer sequences too. CUDA is on average half the speed of Vulkan.
Anyone else see this?
Hi, I've been observing performance and I noticed that this project is significantly slower on CUDA as opposed to Vulkan or even CPU(!)
Not really sure why, but here are the detailed timings:
CUDA:
Vulkan:
This generally applies for longer sequences too. CUDA is on average half the speed of Vulkan.
Anyone else see this?