Averaged runtime benchmark of mlx operations, measured in milliseconds.
mlx_gpu: mlx framework with gpu backendmlx_cpu: mlx framework with cpu backendcpu: torch framework with cpu backendmps: torch framework with mps (gpu) backendmlx_gpu/mps speedup: runtime speedup of mlx_gpu compared to mpsmlx_gpu/mlx_cpu speedup: runtime speedup of mlx_gpu compared to mlx_cpucuda/cpu speedup: runtime speedup of cuda compared to cpu
M1 (cores: 4E+4P+8GPU)
| Operation | mlx_gpu | mlx_cpu | mps | cpu | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|
| Argmax | 1.81 | 10.63 | 2.87 | 8.05 | +58% | +486% |
| BCE | 5.51 | 51.81 | 12.19 | 10.87 | +121% | +840% |
| Concat | 19.13 | 100.88 | 19.28 | 49.63 | +0% | +427% |
| Conv1d | 3.83 | 4.53 | 3.73 | 116.13 | -2% | +18% |
| Conv2d | 30.12 | 436.68 | 7.06 | 45.54 | -76% | +1349% |
| LeakyReLU | 2.06 | 2.90 | 1.16 | 1.37 | -43% | +41% |
| Linear | 30.41 | 73.32 | 53.70 | 117.68 | +76% | +141% |
| MatMul | 26.38 | 93.82 | 47.87 | 504.47 | +81% | +255% |
| PReLU | 3.50 | 4.54 | 1.15 | 1.32 | -67% | +29% |
| ReLU | 0.98 | 0.90 | 1.13 | 1.35 | +14% | -8% |
| SeLU | 7.81 | 14.73 | 1.14 | 7.72 | -85% | +88% |
| Sigmoid | 0.96 | 32.66 | 1.16 | 7.23 | +19% | +3287% |
| Softmax | 10.15 | 40.98 | 19.27 | 46.69 | +89% | +303% |
| Softplus | 1.07 | 33.08 | 1.73 | 10.99 | +60% | +2977% |
| Sort | 18.49 | 713.23 | 73.24 | 70.11 | +296% | +3756% |
| Sum | 11.33 | 12.70 | 16.35 | 13.43 | +44% | +12% |
| SumAll | 6.91 | 6.85 | 7.40 | 7.00 | +7% | 0% |
M1 Pro (2E+8P+16GPU+16GB) - mlx: 0.5.0
| Operation | mlx_gpu | mlx_gpu_compile | mlx_cpu | mps | cpu | mlx_gpu_compile/mlx_gpu speedup | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|---|---|
| Argmax | 1.75 | 1.74 | 10.55 | 1.02 | 8.19 | +0% | -41% | +503% |
| BCE | 2.18 | 0.97 | 59.50 | 0.84 | 8.48 | +125% | -61% | +2629% |
| Concat | 6.14 | 6.13 | 87.88 | 6.21 | 36.74 | +0% | +1% | +1332% |
| Conv1d | 1.76 | 1.64 | 3.42 | 1.01 | 154.38 | +7% | -42% | +94% |
| Conv2d | 5.71 | 5.67 | 443.83 | 2.52 | 42.12 | +0% | -55% | +7669% |
| Gather | 3.15 | 3.17 | 4.95 | 18.87 | 9.03 | 0% | +498% | +57% |
| LeakyReLU | 0.46 | 0.44 | 0.80 | 0.47 | 1.21 | +4% | +2% | +74% |
| Linear | 9.57 | 9.76 | 34.65 | 33.21 | 127.82 | -1% | +246% | +261% |
| MatMul | 10.52 | 10.65 | 38.29 | 22.76 | 498.70 | -1% | +116% | +263% |
| PReLU | 0.48 | 0.46 | 3.37 | 0.55 | 1.07 | +3% | +15% | +607% |
| ReLU | 0.47 | 0.43 | 0.63 | 0.55 | 1.08 | +9% | +18% | +34% |
| Scatter | 0.59 | 0.57 | 30.02 | 3.38 | 1.94 | +2% | +473% | +5002% |
| ScatterSum | 0.03 | 0.04 | 0.01 | nan | 1.47 | -14% | nan% | -71% |
| ScatterMax | 0.03 | 0.04 | 0.01 | nan | 1.44 | -10% | nan% | -69% |
| SeLU | 0.51 | 0.46 | 4.86 | 0.47 | 6.72 | +12% | -8% | +849% |
| Sigmoid | 0.44 | 0.44 | 4.58 | 0.55 | 6.39 | +0% | +23% | +931% |
| Softmax | 9.44 | 7.32 | 41.66 | 5.96 | 30.23 | +28% | -36% | +341% |
| Softplus | 0.46 | 0.49 | 35.26 | 0.49 | 8.97 | -7% | +6% | +7646% |
| Sort | 1.69 | 1.72 | 258.35 | 37.76 | 58.56 | -1% | +2129% | +15156% |
| Sum | 3.38 | 3.46 | 9.25 | 6.06 | 10.02 | -2% | +79% | +173% |
| SumAll | 2.52 | 2.63 | 6.83 | 2.48 | 3.46 | -4% | -1% | +171% |
M1 Max (64GB) mlx 0.2.0
| Operation | mlx_gpu | mlx_gpu_compile | mlx_cpu | mps | cpu | mlx_gpu_compile/mlx_gpu speedup | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|---|---|
| Argmax | 2.14 | 1.69 | 10.80 | 1.93 | 9.17 | +27% | -10% | +403% |
| BCE | 1.30 | 0.65 | 50.27 | 1.01 | 8.09 | +98% | -22% | +3777% |
| Concat | 3.20 | 3.20 | 92.35 | 3.27 | 24.79 | +0% | +2% | +2782% |
| Conv1d | 2.20 | 0.98 | 3.34 | 1.18 | 157.26 | +124% | -46% | +51% |
| Conv2d | 8.18 | 7.24 | 455.47 | 1.98 | 35.56 | +13% | -75% | +5468% |
| Gather | 2.51 | 2.37 | 5.94 | 9.78 | 8.92 | +5% | +289% | +136% |
| LeakyReLU | 0.54 | 0.34 | 4.40 | 0.45 | 0.63 | +59% | -15% | +719% |
| Linear | 6.73 | 6.49 | 32.46 | 16.44 | 39.44 | +3% | +144% | +382% |
| MatMul | 4.66 | 4.64 | 47.17 | 11.16 | 88.32 | +0% | +139% | +913% |
| PReLU | 0.82 | 0.36 | 2.64 | 0.44 | 0.57 | +127% | -46% | +222% |
| ReLU | 0.36 | 0.33 | 0.82 | 0.44 | 0.60 | +9% | +21% | +125% |
| Scatter | 4.11 | 4.09 | 30.31 | 1.85 | 1.78 | +0% | -55% | +637% |
| ScatterSum | 0.05 | 0.03 | 0.01 | nan | 1.35 | +42% | nan% | -81% |
| ScatterMax | 0.05 | 0.03 | 0.01 | nan | 1.35 | +34% | nan% | -81% |
| SeLU | 1.53 | 0.36 | 7.05 | 0.46 | 5.97 | +323% | -69% | +362% |
| Sigmoid | 0.38 | 0.36 | 32.57 | 0.50 | 5.43 | +7% | +30% | +8409% |
| Softmax | 4.84 | 3.71 | 43.48 | 3.88 | 28.93 | +30% | -19% | +798% |
| Softplus | 0.57 | 0.34 | 32.79 | 0.67 | 8.25 | +65% | +17% | +5642% |
| Sort | 1.08 | 0.97 | 257.18 | 20.18 | 49.30 | +10% | +1773% | +23780% |
| Sum | 1.75 | 1.74 | 8.87 | 3.10 | 10.35 | +0% | +77% | +406% |
| SumAll | 1.36 | 1.34 | 6.63 | 1.50 | 3.36 | +1% | +10% | +389% |
M1 Ultra (4E+16P+48GPU+64GB) mlx: 0.29.3 torch: 2.8.0
| Operation | mlx_gpu | mlx_gpu_compile | mlx_cpu | mps | cpu | mlx_gpu_compile/mlx_gpu speedup | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|---|---|
| Argmax | 1.71 | 1.72 | 10.68 | 0.80 | 9.35 | 0% | -53% | +523% |
| BCE | 0.77 | 0.51 | 21.38 | 0.56 | 3.79 | +49% | -27% | +2681% |
| Concat | 2.09 | 2.20 | 76.78 | 1.84 | 6.54 | -5% | -11% | +3575% |
| Conv1d | 0.43 | 0.42 | 4.57 | 0.67 | 194.82 | +0% | +57% | +970% |
| Conv2d | 1.40 | 1.39 | 31.35 | 0.72 | 32.71 | +1% | -48% | +2137% |
| Gather | 0.69 | 0.68 | 4.90 | 2.99 | 2.79 | +1% | +332% | +608% |
| LayerNorm | 0.30 | 0.29 | 4.87 | 0.43 | 0.72 | +1% | +45% | +1543% |
| LeakyReLU | 0.29 | 0.31 | 0.81 | 0.28 | 0.29 | -6% | -3% | +175% |
| Linear | 3.40 | 3.40 | 49.12 | 3.78 | 16.24 | +0% | +11% | +1344% |
| MatMul | 4.39 | 4.45 | 26.42 | 5.25 | 27.67 | -1% | +19% | +502% |
| PReLU | 0.31 | 0.28 | 2.70 | 0.34 | 0.29 | +10% | +11% | +777% |
| ReLU | 0.28 | 0.28 | 0.66 | 0.31 | 0.27 | +1% | +10% | +137% |
| ScaledDotProductAttention | 1.51 | 1.49 | 20.98 | 1.14 | 5.36 | +1% | -24% | +1284% |
| Scatter | 0.34 | 0.36 | 17.06 | 0.72 | 1.47 | -4% | +110% | +4856% |
| ScatterSum | 0.00 | 0.00 | 0.00 | 22.86 | 1.41 | +23% | +1138326% | +2% |
| ScatterMax | 0.00 | 0.00 | 0.00 | 0.66 | 1.42 | +24% | +34180% | +8% |
| SeLU | 0.37 | 0.34 | 3.80 | 0.33 | 1.72 | +8% | -9% | +935% |
| Sigmoid | 0.37 | 0.32 | 3.15 | 0.32 | 0.63 | +15% | -12% | +753% |
| Softmax | 2.81 | 2.20 | 45.86 | 1.85 | 8.25 | +28% | -34% | +1531% |
| Softplus | 0.35 | 0.38 | 31.49 | 0.43 | 1.99 | -7% | +21% | +8857% |
| Sort | 1.07 | 1.05 | 383.92 | 92.57 | 30.71 | +1% | +8534% | +35709% |
| Sum | 1.05 | 1.05 | 16.37 | 1.70 | 5.18 | 0% | +62% | +1464% |
| SumAll | 0.85 | 0.82 | 11.90 | 0.92 | 2.19 | +3% | +7% | +1295% |
M2 () - mlx 0.2.0
| Operation | mlx_gpu | mlx_cpu | mps | cpu | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|
| Argmax | 1.71 | 16.21 | 2.71 | 8.51 | +58% | +849% |
| BCE | 3.71 | 82.34 | 13.06 | 13.34 | +251% | +2118% |
| Concat | 12.14 | 161.07 | 12.51 | 46.33 | +3% | +1226% |
| Conv1d | 3.66 | 6.01 | 3.29 | 132.69 | -10% | +64% |
| Conv2d | 27.74 | 705.78 | 5.94 | 56.88 | -78% | +2444% |
| LeakyReLU | 1.50 | 2.44 | 1.10 | 1.33 | -26% | +62% |
| Linear | 25.01 | 99.71 | 57.29 | 183.02 | +129% | +298% |
| MatMul | 22.04 | 120.61 | 78.10 | 629.63 | +254% | +447% |
| PReLU | 2.43 | 4.58 | 1.04 | 1.35 | -57% | +88% |
| ReLU | 0.77 | 1.00 | 1.00 | 1.34 | +30% | +29% |
| SeLU | 5.31 | 17.25 | 1.11 | 8.24 | -79% | +224% |
| Sigmoid | 0.77 | 52.85 | 1.13 | 7.47 | +47% | +6797% |
| Softmax | 7.07 | 65.62 | 14.54 | 60.92 | +105% | +828% |
| Softplus | 0.91 | 53.94 | 1.73 | 12.12 | +90% | +5846% |
| Sort | 16.87 | 1243.25 | 46.69 | 79.31 | +176% | +7269% |
| Sum | 9.15 | 18.38 | 10.47 | 14.19 | +14% | +100% |
| SumAll | 4.31 | 7.79 | 4.96 | 6.11 | +14% | +80% |
M2 Pro (cores: 4E+6P+16GPU) mlx 0.12.2 torch 2.1.2
| Operation | mlx_gpu | mlx_gpu_compile | mlx_cpu | mps | cpu | mlx_gpu_compile/mlx_gpu speedup | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|---|---|
| Argmax | 1.55 | 1.52 | 9.98 | 1.28 | 7.88 | +2% | -17% | +542% |
| BCE | 2.03 | 0.82 | 59.53 | 0.74 | 8.01 | +146% | -63% | +2835% |
| Concat | 6.17 | 6.42 | 86.32 | 6.26 | 36.48 | -3% | +1% | +1299% |
| Conv1d | 1.62 | 1.49 | 3.05 | 0.90 | 147.44 | +8% | -44% | +88% |
| Conv2d | 5.20 | 5.19 | 410.99 | 2.11 | 43.67 | +0% | -59% | +7797% |
| Gather | 3.03 | 3.01 | 4.13 | 15.83 | 9.79 | +0% | +423% | +36% |
| LeakyReLU | 0.36 | 0.36 | 0.90 | 0.44 | 0.93 | 0% | +19% | +146% |
| Linear | 9.36 | 9.29 | 27.06 | 31.34 | 115.10 | +0% | +234% | +189% |
| MatMul | 10.93 | 9.89 | 35.71 | 21.59 | 754.10 | +10% | +97% | +226% |
| PReLU | 0.53 | 0.39 | 3.46 | 0.44 | 0.91 | +36% | -17% | +552% |
| ReLU | 0.41 | 0.37 | 0.73 | 0.43 | 0.92 | +11% | +4% | +79% |
| Scatter | 0.31 | 0.31 | 28.25 | 2.77 | 2.31 | 0% | +788% | +8959% |
| ScatterSum | 0.04 | 0.03 | 0.02 | nan | 1.38 | +3% | nan% | -50% |
| ScatterMax | 0.04 | 0.03 | 0.02 | nan | 1.38 | +7% | nan% | -49% |
| SeLU | 0.49 | 0.43 | 4.85 | 0.51 | 2.66 | +13% | +4% | +899% |
| Sigmoid | 0.37 | 0.37 | 4.33 | 0.48 | 2.23 | +1% | +26% | +1055% |
| Softmax | 9.25 | 6.99 | 39.72 | 4.88 | 25.00 | +32% | -47% | +329% |
| Softplus | 0.41 | 0.37 | 33.75 | 0.47 | 4.73 | +9% | +16% | +8220% |
| Sort | 1.48 | 1.49 | 242.55 | 22.40 | 51.73 | 0% | +1414% | +16295% |
| Sum | 3.24 | 3.22 | 9.11 | 3.09 | 10.11 | +0% | -4% | +180% |
| SumAll | 2.37 | 2.37 | 6.58 | 2.36 | 3.31 | +0% | 0% | +176% |
M2 Max (cores: 4E+8P+38GPU) mlx 0.5.0 torch 2.2.1
| Operation | mlx_gpu | mlx_gpu_compile | mlx_cpu | mps | cpu | mlx_gpu_compile/mlx_gpu speedup | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|---|---|
| Argmax | 1.50 | 1.51 | 10.10 | 0.68 | 8.63 | 0% | -54% | +571% |
| BCE | 1.00 | 0.44 | 59.91 | 0.60 | 9.01 | +126% | -40% | +5880% |
| Concat | 3.18 | 3.16 | 83.04 | 3.34 | 27.24 | +0% | +4% | +2507% |
| Conv1d | 0.86 | 0.76 | 3.03 | 0.53 | 160.40 | +13% | -38% | +250% |
| Conv2d | 2.45 | 2.44 | 424.52 | 1.15 | 34.30 | +0% | -53% | +17249% |
| Gather | 1.34 | 1.57 | 3.92 | 8.12 | 8.98 | -14% | +504% | +191% |
| LeakyReLU | 0.22 | 0.30 | 0.72 | 0.30 | 1.21 | -24% | +35% | +219% |
| Linear | 5.51 | 5.63 | 23.52 | 12.97 | 37.92 | -2% | +135% | +327% |
| MatMul | 3.77 | 3.83 | 27.42 | 9.78 | 83.55 | -1% | +159% | +627% |
| PReLU | 0.28 | 0.48 | 3.27 | 0.42 | 1.03 | -41% | +50% | +1062% |
| ReLU | 0.37 | 0.24 | 0.62 | 0.35 | 0.94 | +51% | -5% | +67% |
| Scatter | 0.22 | 0.24 | 28.88 | 1.47 | 1.82 | -9% | +567% | +12984% |
| ScatterSum | 0.03 | 0.03 | 0.01 | nan | 1.37 | +10% | nan% | -69% |
| ScatterMax | 0.03 | 0.03 | 0.01 | nan | 1.39 | +10% | nan% | -68% |
| SeLU | 0.29 | 0.36 | 4.62 | 0.49 | 7.07 | -20% | +69% | +1511% |
| Sigmoid | 0.24 | 0.27 | 4.34 | 0.36 | 6.35 | -10% | +52% | +1714% |
| Softmax | 4.62 | 3.60 | 40.03 | 3.07 | 33.25 | +28% | -33% | +766% |
| Softplus | 0.25 | 0.24 | 34.73 | 0.33 | 9.39 | +3% | +31% | +13696% |
| Sort | 0.73 | 0.75 | 248.89 | 10.65 | 58.88 | -2% | +1360% | +34026% |
| Sum | 1.61 | 1.64 | 9.22 | 1.96 | 12.05 | -1% | +21% | +472% |
| SumAll | 1.20 | 1.23 | 6.86 | 1.32 | 3.84 | -2% | +9% | +471% |
M2 Ultra (cores: 8E+16P+76GPU) mlx 0.7.0
| Operation | mlx_gpu | mlx_gpu_compile | mlx_cpu | mps | cpu | mlx_gpu_compile/mlx_gpu speedup | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|---|---|
| Argmax | 1.60 | 1.63 | 9.46 | 0.65 | 9.49 | -1% | -59% | +492% |
| BCE | 0.64 | 0.45 | 56.57 | 0.47 | 4.23 | +42% | -27% | +8702% |
| Concat | 1.69 | 1.69 | 81.95 | 1.66 | 38.93 | +0% | -1% | +4743% |
| Conv1d | 0.55 | 0.51 | 2.64 | 0.45 | 187.91 | +7% | -17% | +382% |
| Conv2d | 1.35 | 1.38 | 409.78 | 0.67 | 46.05 | -1% | -50% | +30276% |
| Gather | 0.77 | 0.79 | 3.83 | 3.92 | 11.82 | -2% | +407% | +395% |
| LeakyReLU | 0.32 | 0.25 | 0.85 | 0.21 | 1.99 | +28% | -34% | +162% |
| Linear | 2.26 | 2.23 | 16.83 | 6.67 | 39.12 | +1% | +195% | +645% |
| MatMul | 2.53 | 2.53 | 19.21 | 5.59 | 66.55 | 0% | +121% | +660% |
| PReLU | 0.37 | 0.45 | 3.15 | 0.32 | 1.61 | -18% | -13% | +759% |
| ReLU | 0.29 | 0.24 | 0.67 | 0.33 | 1.61 | +20% | +13% | +132% |
| Scatter | 0.25 | 0.25 | 27.04 | 0.73 | 1.49 | +0% | +193% | +10802% |
| ScatterSum | 0.03 | 0.03 | 0.01 | nan | 1.36 | -1% | nan% | -76% |
| ScatterMax | 0.03 | 0.03 | 0.01 | nan | 1.37 | +10% | nan% | -76% |
| SeLU | 0.46 | 0.28 | 4.50 | 0.29 | 1.86 | +65% | -36% | +877% |
| Sigmoid | 0.24 | 0.25 | 4.11 | 0.26 | 1.71 | -2% | +6% | +1606% |
| Softmax | 2.47 | 1.88 | 39.27 | 1.35 | 17.90 | +31% | -45% | +1488% |
| Softplus | 0.27 | 0.26 | 32.13 | 0.26 | 3.53 | +7% | -6% | +11598% |
| Sort | 0.48 | 0.49 | 229.84 | 6.41 | 33.91 | -1% | +1231% | +47639% |
| Sum | 0.90 | 0.91 | 9.22 | 0.95 | 6.80 | -1% | +6% | +925% |
| SumAll | 0.70 | 0.71 | 6.70 | 0.83 | 1.97 | -1% | +19% | +859% |
M3 (RAM: 16GB) - mlx 0.2.0
Average benchmark:
| Operation | mlx_gpu | mlx_cpu | mps | cpu | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|
| Argmax | 1.20 | 11.63 | 1.71 | 7.10 | +43% | +870% |
| BCE | 4.05 | 40.80 | 8.59 | 8.14 | +111% | +906% |
| Concat | 12.52 | 83.29 | 12.60 | 35.29 | +0% | +565% |
| Conv1d | 2.34 | 3.66 | 1.98 | 71.23 | -15% | +56% |
| Conv2d | 16.47 | 340.03 | 4.43 | 36.36 | -73% | +1965% |
| LeakyReLU | 1.43 | 3.05 | 1.01 | 1.07 | -29% | +113% |
| Linear | 21.55 | 71.89 | 15.84 | 122.32 | -26% | +233% |
| MatMul | 15.49 | 76.57 | 33.24 | 490.48 | +114% | +394% |
| PReLU | 2.36 | 2.76 | 0.99 | 1.11 | -58% | +16% |
| ReLU | 0.76 | 1.39 | 0.96 | 1.01 | +26% | +81% |
| SeLU | 5.23 | 7.72 | 1.02 | 6.88 | -80% | +47% |
| Sigmoid | 0.79 | 26.97 | 1.07 | 5.69 | +35% | +3309% |
| Softmax | 6.31 | 41.35 | 12.08 | 32.54 | +91% | +555% |
| Softplus | 0.73 | 26.82 | 1.08 | 9.09 | +47% | +3569% |
| Sort | 12.67 | 724.26 | 30.73 | 60.29 | +142% | +5616% |
| Sum | 6.96 | 11.24 | 6.61 | 12.27 | -5% | +61% |
| SumAll | 4.26 | 7.79 | 4.78 | 4.38 | +12% | +82% |
M3 (4E+4P+10GPU+16GB) - mlx 0.30.1
Average benchmark:
| Operation | mlx_gpu | mlx_gpu_compile | mlx_cpu | mps | cpu | mlx_gpu_compile/mlx_gpu speedup | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|---|---|
| Argmax | 1.74 | 1.75 | 8.92 | 1.55 | 5.30 | 0% | -11% | +411% |
| BCE | 3.81 | 1.54 | 18.08 | 0.93 | 8.21 | +146% | -75% | +375% |
| Concat | 12.05 | 12.07 | 105.18 | 12.06 | 18.32 | 0% | +0% | +772% |
| Conv1d | 0.69 | 0.69 | 4.55 | 1.04 | 61.59 | 0% | +51% | +561% |
| Conv2d | 3.50 | 3.53 | 39.37 | 1.39 | 17.42 | 0% | -60% | +1024% |
| Gather | 2.41 | 2.40 | 4.49 | 9.37 | 4.42 | +0% | +288% | +86% |
| LayerNorm | 1.03 | 1.03 | 3.50 | 1.03 | 1.05 | +0% | 0% | +239% |
| LeakyReLU | 0.72 | 0.79 | 0.49 | 0.70 | 0.41 | -9% | -2% | -31% |
| Linear | 14.05 | 14.05 | 62.67 | 14.33 | 45.74 | 0% | +1% | +346% |
| MatMul | 15.41 | 14.85 | 52.70 | 31.76 | 51.60 | +3% | +106% | +241% |
| PReLU | 0.73 | 0.72 | 2.12 | 0.77 | 0.45 | +0% | +5% | +191% |
| ReLU | 0.72 | 0.71 | 0.48 | 0.73 | 0.43 | +0% | +1% | -32% |
| ScaledDotProductAttention | 5.51 | 5.45 | 12.18 | 4.72 | 7.14 | +1% | -14% | +121% |
| Scatter | 0.63 | 0.64 | 11.76 | 1.72 | 1.09 | -1% | +173% | +1775% |
| ScatterSum | 0.00 | 0.00 | 0.00 | 0.55 | 1.15 | +12% | +41096% | +6% |
| ScatterMax | 0.00 | 0.00 | 0.00 | 0.55 | 1.16 | +19% | +40426% | +2% |
| SeLU | 0.73 | 0.74 | 3.00 | 0.74 | 3.36 | -1% | +1% | +310% |
| Sigmoid | 0.73 | 0.73 | 2.54 | 0.71 | 1.13 | +0% | -2% | +245% |
| Softmax | 18.05 | 13.50 | 43.25 | 6.05 | 16.17 | +33% | -66% | +139% |
| Softplus | 0.72 | 0.72 | 22.87 | 0.74 | 4.14 | 0% | +3% | +3091% |
| Sort | 3.32 | 3.29 | 312.48 | 179.91 | 88.58 | +0% | +5317% | +9309% |
| Sum | 5.78 | 5.71 | 11.55 | 6.13 | 12.62 | +1% | +6% | +100% |
| SumAll | 4.26 | 4.23 | 8.42 | 4.54 | 5.77 | +0% | +6% | +97% |
M3 Pro (cores: 6E+5P+14GPU)
| Operation | mlx_gpu | mlx_cpu | mps | cpu | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|
| Argmax | 0.98 | 11.21 | 1.24 | 6.14 | +25% | +1041% |
| BCE | 2.70 | 39.88 | 6.87 | 6.78 | +154% | +1374% |
| Concat | 8.25 | 78.33 | 8.87 | 38.10 | +7% | +849% |
| Conv1d | 2.15 | 3.36 | 2.07 | 83.18 | -3% | +56% |
| Conv2d | 12.06 | 333.03 | 3.09 | 33.87 | -74% | +2660% |
| LeakyReLU | 1.54 | 1.53 | 1.26 | 0.96 | -18% | 0% |
| Linear | 15.30 | 52.78 | 11.44 | 91.49 | -25% | +244% |
| MatMul | 16.04 | 69.27 | 22.53 | 390.04 | +40% | +331% |
| PReLU | 2.04 | 2.80 | 1.35 | 0.91 | -34% | +37% |
| ReLU | 0.94 | 0.61 | 1.37 | 0.92 | +45% | -34% |
| SeLU | 3.98 | 10.10 | 1.27 | 4.69 | -68% | +153% |
| Sigmoid | 1.03 | 26.28 | 1.30 | 4.28 | +25% | +2446% |
| Softmax | 4.62 | 32.54 | 9.32 | 29.78 | +101% | +604% |
| Softplus | 1.02 | 25.95 | 1.26 | 6.52 | +23% | +2444% |
| Sort | 8.67 | 711.98 | 21.37 | 46.71 | +146% | +8114% |
| Sum | 4.73 | 9.81 | 5.12 | 8.83 | +8% | +107% |
| SumAll | 3.17 | 4.71 | 3.69 | 3.44 | +16% | +48% |
M3 Max (cores: 4E+12P+40GPU) mlx 0.2.0
| Operation | mlx_gpu | mlx_gpu_compile | mlx_cpu | mps | cpu | mlx_gpu_compile/mlx_gpu speedup | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|---|---|
| Argmax | 1.57 | 1.56 | 8.34 | 1.02 | 6.14 | +0% | -35% | +430% |
| BCE | 1.12 | 0.52 | 38.72 | 0.59 | 3.73 | +114% | -47% | +3362% |
| Concat | 3.32 | 3.30 | 82.26 | 3.40 | 22.89 | +0% | +2% | +2380% |
| Conv1d | 0.85 | 0.75 | 2.40 | 0.92 | 156.00 | +13% | +8% | +182% |
| Conv2d | 4.21 | 4.14 | 329.47 | 1.42 | 31.25 | +1% | -66% | +7723% |
| Gather | 1.56 | 1.47 | 4.37 | 8.23 | 6.68 | +5% | +428% | +180% |
| LeakyReLU | 0.43 | 0.29 | 2.57 | 0.54 | 0.66 | +48% | +24% | +491% |
| Linear | 5.66 | 5.66 | 24.67 | 4.24 | 59.04 | +0% | -25% | +336% |
| MatMul | 4.20 | 4.19 | 25.57 | 7.62 | 585.74 | +0% | +81% | +508% |
| PReLU | 0.70 | 0.29 | 2.06 | 0.49 | 0.61 | +144% | -29% | +193% |
| ReLU | 0.51 | 0.35 | 0.65 | 0.78 | 0.62 | +45% | +53% | +28% |
| Scatter | 2.29 | 2.22 | 25.40 | 1.66 | 0.93 | +3% | -27% | +1009% |
| ScatterSum | 0.04 | 0.03 | 0.01 | nan | 1.22 | +52% | nan% | -81% |
| ScatterMax | 0.04 | 0.03 | 0.01 | nan | 1.23 | +52% | nan% | -81% |
| SeLU | 1.35 | 0.29 | 5.14 | 0.48 | 2.93 | +361% | -64% | +281% |
| Sigmoid | 0.30 | 0.29 | 26.28 | 0.49 | 2.85 | +4% | +62% | +8629% |
| Softmax | 4.75 | 3.59 | 35.79 | 3.40 | 16.50 | +32% | -28% | +653% |
| Softplus | 0.35 | 0.29 | 26.02 | 0.51 | 4.00 | +21% | +43% | +7257% |
| Sort | 0.77 | 0.76 | 229.39 | 8.04 | 32.43 | +1% | +942% | +29646% |
| Sum | 1.55 | 1.54 | 6.53 | 1.90 | 6.99 | +0% | +22% | +322% |
| SumAll | 1.19 | 1.19 | 4.78 | 1.32 | 3.22 | +0% | +10% | +300% |
M4 (6E+4P+10GPU+16GB) mlx: 0.20.0
| Operation | mlx_gpu | mlx_gpu_compile | mlx_cpu | mps | cpu | mlx_gpu_compile/mlx_gpu speedup | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|---|---|
| Argmax | 1.56 | 1.49 | 8.33 | 1.43 | 5.39 | +4% | -7% | +434% |
| BCE | 3.73 | 1.61 | 35.59 | 1.19 | 8.15 | +131% | -68% | +853% |
| Concat | 12.61 | 12.42 | 50.48 | 12.59 | 29.69 | +1% | 0% | +300% |
| Conv1d | 1.77 | 1.73 | 4.55 | 1.16 | 58.55 | +2% | -34% | +156% |
| Conv2d | 4.94 | 4.99 | 42.63 | 1.48 | 25.15 | -1% | -70% | +763% |
| Gather | 3.57 | 3.53 | 3.24 | 34.09 | 9.04 | +1% | +854% | -9% |
| LeakyReLU | 0.76 | 0.76 | 0.69 | 0.82 | 0.83 | +0% | +8% | -9% |
| Linear | 12.62 | 12.67 | 60.38 | 13.17 | 116.89 | 0% | +4% | +378% |
| MatMul | 18.27 | 17.17 | 42.77 | 32.16 | 133.45 | +6% | +75% | +134% |
| PReLU | 0.91 | 0.90 | 2.15 | 0.82 | 0.79 | +1% | -9% | +136% |
| ReLU | 0.78 | 0.74 | 0.54 | 0.75 | 1.33 | +5% | -3% | -29% |
| Scatter | 0.82 | 0.79 | 9.34 | 5.89 | 0.98 | +3% | +621% | +1043% |
| ScatterSum | 0.00 | 0.00 | 0.00 | nan | 1.08 | +27% | nan% | -7% |
| ScatterMax | 0.00 | 0.00 | 0.00 | nan | 1.14 | +36% | nan% | -5% |
| SeLU | 0.89 | 0.88 | 3.65 | 0.81 | 1.65 | +1% | -8% | +308% |
| Sigmoid | 0.75 | 0.75 | 3.48 | 0.81 | 1.42 | +0% | +7% | +364% |
| Softmax | 18.11 | 13.82 | 38.51 | 6.02 | 28.30 | +31% | -66% | +112% |
| Softplus | 0.83 | 0.76 | 21.28 | 0.78 | 3.51 | +9% | -6% | +2464% |
| Sort | 1.99 | 1.99 | 218.30 | 32.71 | 98.28 | 0% | +1545% | +10884% |
| Sum | 5.90 | 6.18 | 9.00 | 6.70 | 12.98 | -4% | +13% | +52% |
| SumAll | 4.32 | 4.56 | 6.58 | 4.84 | 5.41 | -5% | +12% | +52% |
M4 Pro (4E+8P+16GPU+24GB) mlx: 0.20.0
| Operation | mlx_gpu | mlx_gpu_compile | mlx_cpu | mps | cpu | mlx_gpu_compile/mlx_gpu speedup | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|---|---|
| Argmax | 1.48 | 1.43 | 7.96 | 1.02 | 5.15 | +3% | -30% | +437% |
| BCE | 1.47 | 0.70 | 34.00 | 0.70 | 4.26 | +110% | -52% | +2208% |
| Concat | 5.59 | 5.33 | 48.85 | 5.03 | 27.93 | +5% | -10% | +773% |
| Conv1d | 1.04 | 1.00 | 4.24 | 0.66 | 85.48 | +3% | -36% | +307% |
| Conv2d | 3.05 | 3.08 | 32.51 | 0.80 | 29.57 | 0% | -73% | +967% |
| Gather | 2.28 | 2.23 | 3.18 | 13.48 | 7.39 | +2% | +491% | +39% |
| LeakyReLU | 0.30 | 0.30 | 0.64 | 0.35 | 0.79 | +0% | +17% | +112% |
| Linear | 7.61 | 7.56 | 40.24 | 7.45 | 63.95 | +0% | -2% | +428% |
| MatMul | 8.24 | 7.54 | 21.94 | 13.81 | 137.11 | +9% | +67% | +166% |
| PReLU | 0.43 | 0.46 | 2.13 | 0.37 | 0.84 | -6% | -14% | +394% |
| ReLU | 0.29 | 0.33 | 0.42 | 0.36 | 1.17 | -12% | +25% | +45% |
| Scatter | 0.52 | 0.51 | 9.16 | 2.29 | 0.83 | +2% | +339% | +1658% |
| ScatterSum | 0.00 | 0.00 | 0.00 | nan | 1.06 | +34% | nan% | -3% |
| ScatterMax | 0.00 | 0.00 | 0.00 | nan | 1.03 | +19% | nan% | -5% |
| SeLU | 0.46 | 0.43 | 3.62 | 0.40 | 1.13 | +7% | -11% | +693% |
| Sigmoid | 0.28 | 0.29 | 3.46 | 0.34 | 1.01 | -5% | +23% | +1150% |
| Softmax | 7.23 | 5.56 | 30.51 | 3.08 | 18.53 | +30% | -57% | +321% |
| Softplus | 0.35 | 0.32 | 21.29 | 0.34 | 2.28 | +11% | -4% | +5944% |
| Sort | 1.26 | 1.23 | 214.41 | 15.29 | 56.47 | +2% | +1112% | +16912% |
| Sum | 2.25 | 2.28 | 5.91 | 2.61 | 6.09 | -1% | +16% | +163% |
| SumAll | 1.69 | 1.70 | 4.28 | 1.77 | 1.82 | 0% | +4% | +153% |
M4 Pro (4E+10P+20GPU+24GB) mlx: 0.24.1
| Operation | mlx_gpu | mlx_gpu_compile | mlx_cpu | mps | cpu | mlx_gpu_compile/mlx_gpu speedup | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|---|---|
| Argmax | 1.42 | 1.45 | 7.67 | 0.78 | 5.01 | -1% | -45% | +438% |
| BCE | 1.46 | 0.65 | 14.54 | 0.44 | 3.65 | +124% | -69% | +896% |
| Concat | 5.54 | 5.22 | 48.61 | 4.96 | 29.00 | +6% | -10% | +777% |
| Conv1d | 0.82 | 0.82 | 3.82 | 0.47 | 113.39 | 0% | -42% | +364% |
| Conv2d | 2.52 | 2.54 | 31.65 | 0.67 | 30.70 | 0% | -73% | +1154% |
| Gather | 1.30 | 1.32 | 3.13 | 13.28 | 6.85 | -1% | +919% | +140% |
| LayerNorm | 0.42 | 0.43 | 2.62 | 0.84 | 1.16 | -1% | +96% | +517% |
| LeakyReLU | 0.43 | 0.33 | 0.60 | 0.31 | 0.63 | +30% | -27% | +39% |
| Linear | 6.37 | 6.22 | 39.72 | 6.20 | 36.36 | +2% | -2% | +523% |
| MatMul | 9.81 | 6.86 | 22.04 | 13.28 | 67.90 | +43% | +35% | +124% |
| PReLU | 0.57 | 0.31 | 3.02 | 0.34 | 0.62 | +83% | -39% | +431% |
| ReLU | 0.31 | 0.30 | 0.37 | 0.41 | 0.78 | +1% | +33% | +21% |
| ScaledDotProductAttention | 2.62 | 2.59 | 10.11 | 1.81 | 5.38 | +1% | -30% | +285% |
| Scatter | 0.37 | 0.32 | 9.06 | 2.29 | 0.75 | +17% | +512% | +2320% |
| ScatterSum | 0.00 | 0.00 | 0.00 | 0.28 | 0.97 | +18% | +22629% | +2% |
| ScatterMax | 0.00 | 0.00 | 0.00 | 0.29 | 0.98 | +33% | +24052% | +0% |
| SeLU | 0.95 | 0.34 | 6.10 | 0.32 | 0.93 | +176% | -66% | +541% |
| Sigmoid | 0.36 | 0.36 | 2.19 | 0.28 | 0.84 | +1% | -22% | +504% |
| Softmax | 7.17 | 5.37 | 32.41 | 2.77 | 13.97 | +33% | -61% | +352% |
| Softplus | 0.32 | 0.32 | 20.25 | 0.30 | 1.86 | -1% | -6% | +6274% |
| Sort | 1.29 | 1.24 | 209.96 | 14.67 | 45.28 | +4% | +1039% | +16202% |
| Sum | 2.25 | 2.21 | 9.85 | 2.49 | 5.12 | +1% | +10% | +338% |
| SumAll | 1.67 | 1.68 | 7.18 | 1.71 | 1.62 | 0% | +2% | +329% |
M4 Max (4E+12P+40GPU+128GB) mlx: 0.20.0
| Operation | mlx_gpu | mlx_gpu_compile | mlx_cpu | mps | cpu | mlx_gpu_compile/mlx_gpu speedup | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|---|---|
| Argmax | 1.41 | 1.43 | 8.01 | 0.70 | 4.69 | -1% | -50% | +468% |
| BCE | 0.88 | 0.89 | 33.88 | 0.52 | 2.93 | -1% | -40% | +3770% |
| Concat | 2.86 | 2.87 | 47.64 | 2.67 | 19.94 | 0% | -6% | +1563% |
| Conv1d | 0.59 | 0.51 | 3.81 | 0.40 | 110.98 | +15% | -31% | +542% |
| Conv2d | 1.43 | 1.43 | 32.19 | 0.60 | 26.71 | +0% | -58% | +2152% |
| Gather | 1.08 | 1.04 | 3.07 | 7.37 | 6.20 | +3% | +584% | +185% |
| LeakyReLU | 0.24 | 0.23 | 0.67 | 0.19 | 0.75 | +2% | -17% | +181% |
| Linear | 3.35 | 3.83 | 39.12 | 3.21 | 45.63 | -12% | -4% | +1069% |
| MatMul | 4.12 | 4.21 | 21.00 | 6.36 | 139.36 | -2% | +54% | +409% |
| PReLU | 0.38 | 0.29 | 2.10 | 0.28 | 0.87 | +30% | -26% | +445% |
| ReLU | 0.23 | 0.22 | 0.42 | 0.27 | 0.87 | +3% | +21% | +86% |
| Scatter | 0.28 | 0.27 | 9.08 | 1.25 | 0.65 | +2% | +343% | +3122% |
| ScatterSum | 0.00 | 0.00 | 0.00 | nan | 1.06 | +28% | nan% | -7% |
| ScatterMax | 0.00 | 0.00 | 0.00 | nan | 1.06 | +10% | nan% | -3% |
| SeLU | 0.34 | 0.29 | 3.63 | 0.35 | 1.12 | +18% | +2% | +967% |
| Sigmoid | 0.21 | 0.22 | 3.45 | 0.27 | 0.93 | -6% | +27% | +1551% |
| Softmax | 3.97 | 3.40 | 29.65 | 1.52 | 13.74 | +16% | -61% | +647% |
| Softplus | 0.29 | 0.26 | 21.32 | 0.25 | 1.88 | +11% | -13% | +7339% |
| Sort | 0.58 | 0.59 | 207.83 | 7.18 | 41.13 | 0% | +1128% | +35475% |
| Sum | 1.25 | 1.23 | 6.28 | 1.44 | 4.54 | +1% | +15% | +403% |
| SumAll | 0.95 | 0.93 | 4.61 | 1.07 | 1.48 | +1% | +13% | +387% |
M3 Ultra (8E+20P+60GPU+96GB) mlx: 0.24.1
| Operation | mlx_gpu | mlx_gpu_compile | mlx_cpu | mps | cpu | mlx_gpu_compile/mlx_gpu speedup | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|---|---|
| Argmax | 1.77 | 1.70 | 8.40 | 0.59 | 7.99 | +4% | -66% | +373% |
| BCE | 0.68 | 0.37 | 15.70 | 0.54 | 2.77 | +82% | -21% | +2201% |
| Concat | 1.88 | 1.85 | 84.41 | 1.80 | 40.87 | +1% | -4% | +4379% |
| Conv1d | 0.52 | 0.52 | 4.16 | 0.44 | 148.64 | +0% | -14% | +700% |
| Conv2d | 1.32 | 1.30 | 31.36 | 0.55 | 33.67 | +1% | -58% | +2272% |
| Gather | 0.60 | 0.72 | 3.38 | 4.11 | 14.71 | -15% | +579% | +458% |
| LeakyReLU | 0.26 | 0.26 | 0.67 | 0.22 | 1.47 | 0% | -13% | +158% |
| Linear | 2.74 | 2.77 | 52.00 | 2.53 | 45.65 | -1% | -7% | +1800% |
| MatMul | 3.38 | 3.54 | 16.76 | 4.96 | 231.43 | -4% | +46% | +395% |
| PReLU | 0.62 | 0.48 | 3.15 | 0.26 | 1.67 | +29% | -58% | +412% |
| ReLU | 0.26 | 0.35 | 0.42 | 0.22 | 1.47 | -25% | -12% | +63% |
| Scatter | 0.26 | 0.26 | 10.27 | 0.82 | 0.93 | +2% | +211% | +3792% |
| ScatterSum | 0.00 | 0.00 | 0.00 | 0.41 | 1.25 | +43% | +18803% | -33% |
| ScatterMax | 0.00 | 0.00 | 0.00 | 1.22 | 1.23 | +53% | +60915% | -27% |
| SeLU | 0.60 | 0.43 | 6.39 | 0.25 | 1.29 | +40% | -58% | +963% |
| Sigmoid | 0.26 | 0.26 | 2.21 | 0.27 | 1.25 | +2% | +2% | +740% |
| Softmax | 2.67 | 2.08 | 33.54 | 1.22 | 11.51 | +28% | -54% | +1155% |
| Softplus | 0.29 | 0.25 | 24.63 | 0.23 | 1.76 | +14% | -19% | +8357% |
| Sort | 0.73 | 0.60 | 213.11 | 6.18 | 33.69 | +22% | +746% | +29113% |
| Sum | 0.96 | 0.96 | 10.85 | 1.09 | 4.35 | 0% | +13% | +1029% |
| SumAll | 0.73 | 0.76 | 7.93 | 0.84 | 1.67 | -3% | +15% | +982% |
M5 Max (6S+12P+40GPU+128GB) mlx: 0.31.1 torch: 2.11.0
| Operation | mlx_gpu | mlx_gpu_compile | mlx_cpu | mps | cpu | mlx_gpu_compile/mlx_gpu speedup | mlx_gpu/mps speedup | mlx_gpu/mlx_cpu speedup |
|---|---|---|---|---|---|---|---|---|
| Argmax | 0.98 | 0.98 | 7.79 | 0.62 | 5.63 | 0% | -35% | +698% |
| BCE | 0.81 | 0.43 | 13.70 | 0.42 | 4.93 | +88% | -48% | +1590% |
| Concat | 2.37 | 2.28 | 35.06 | 2.25 | 5.38 | +3% | -4% | +1378% |
| Conv1d | 0.28 | 0.26 | 3.57 | 0.53 | 53.75 | +5% | +91% | +1195% |
| Conv2d | 0.80 | 0.80 | 20.69 | 0.55 | 18.46 | +0% | -31% | +2478% |
| Gather | 0.64 | 0.64 | 3.57 | 2.15 | 2.77 | 0% | +238% | +461% |
| LayerNorm | 0.31 | 0.31 | 2.56 | 0.40 | 0.47 | -1% | +29% | +725% |
| LeakyReLU | 0.33 | 0.43 | 0.47 | 0.24 | 0.19 | -22% | -27% | +42% |
| Linear | 1.53 | 1.51 | 48.03 | 3.16 | 30.80 | +0% | +107% | +3046% |
| MatMul | 3.61 | 3.33 | 31.26 | 5.47 | 30.53 | +8% | +51% | +766% |
| PReLU | 0.33 | 0.34 | 2.05 | 0.36 | 0.18 | -2% | +8% | +512% |
| ReLU | 0.26 | 0.28 | 0.43 | 0.26 | 0.16 | -8% | +2% | +69% |
| ScaledDotProductAttention | 1.41 | 1.39 | 8.50 | 1.11 | 2.73 | +1% | -21% | +502% |
| Scatter | 0.30 | 0.29 | 9.36 | 0.51 | 1.52 | +1% | +71% | +3062% |
| ScatterSum | 0.00 | 0.00 | 0.00 | 0.30 | 0.90 | +25% | +23703% | -7% |
| ScatterMax | 0.00 | 0.00 | 0.00 | 0.36 | 0.91 | +21% | +34158% | +3% |
| SeLU | 0.37 | 0.33 | 2.65 | 0.25 | 2.07 | +12% | -31% | +621% |
| Sigmoid | 0.32 | 0.30 | 2.02 | 0.24 | 0.60 | +5% | -24% | +533% |
| Softmax | 3.28 | 2.52 | 35.25 | 1.20 | 8.36 | +30% | -63% | +973% |
| Softplus | 0.26 | 0.28 | 17.34 | 0.29 | 2.23 | -8% | +13% | +6612% |
| Sort | 0.88 | 0.87 | 256.20 | 25.37 | 59.17 | +2% | +2768% | +28868% |
| Sum | 1.18 | 1.17 | 9.20 | 1.37 | 6.40 | +1% | +16% | +680% |
| SumAll | 0.89 | 0.89 | 6.74 | 0.98 | 1.44 | 0% | +10% | +658% |
Tesla V100 PCIe (32Go / Intel Xeon Gold 5120 14 cores / 28 threads @ 2.2GHz (Skylake), 60Go)
| Operation | cpu | cuda | cuda/cpu speedup |
|---|---|---|---|
| Argmax | 34.34 | 0.10 | +33411% |
| BCE | 198.19 | 0.19 | +102820% |
| Concat | 380.98 | 1.67 | +22679% |
| Conv1d | 30.21 | 0.33 | +9027% |
| Conv2d | 52.73 | 0.87 | +5938% |
| Gather | 96.61 | 0.42 | +22636% |
| LeakyReLU | 5.51 | 0.08 | +7010% |
| Linear | 901.98 | 3.79 | +23722% |
| MatMul | 1241.12 | 2.80 | +44293% |
| PReLU | 5.55 | 0.08 | +7159% |
| ReLU | 5.50 | 0.08 | +7032% |
| Scatter | 6.92 | 0.12 | +5875% |
| ScatterSum | 4.25 | 0.08 | +5058% |
| ScatterMax | nan | nan | nan% |
| SeLU | 11.56 | 0.08 | +14709% |
| Sigmoid | 9.46 | 0.08 | +12023% |
| Softmax | 221.43 | 0.71 | +31300% |
| Softplus | 22.13 | 0.08 | +27658% |
| Sort | 526.33 | 2.59 | +20202% |
| Sum | 67.43 | 0.70 | +9472% |
| SumAll | 29.82 | 0.50 | +5822% |
Tesla V100 NVLink (32Go / Intel Xeon Gold 6148 20 cores, 40 threads @ 2.4 GHz (Skylake), 60Go)
| Operation | cpu | cuda | cuda/cpu speedup |
|---|---|---|---|
| Argmax | 28.23 | 0.10 | +28460% |
| BCE | 186.05 | 0.19 | +97956% |
| Concat | 531.34 | 1.67 | +31744% |
| Conv1d | 22.37 | 0.31 | +7033% |
| Conv2d | 52.89 | 0.83 | +6257% |
| Gather | 161.56 | 0.41 | +39152% |
| LeakyReLU | 16.95 | 0.08 | +21591% |
| Linear | 666.79 | 3.58 | +18532% |
| MatMul | 998.29 | 2.68 | +37198% |
| PReLU | 15.55 | 0.08 | +20584% |
| ReLU | 14.07 | 0.08 | +18496% |
| Scatter | 6.19 | 0.11 | +5548% |
| ScatterSum | 6.83 | 0.08 | +8757% |
| ScatterMax | nan | nan | nan% |
| SeLU | 20.94 | 0.08 | +27171% |
| Sigmoid | 19.82 | 0.08 | +25331% |
| Softmax | 253.76 | 0.70 | +36156% |
| Softplus | 29.21 | 0.08 | +37131% |
| Sort | 422.98 | 2.48 | +16933% |
| Sum | 69.38 | 0.70 | +9861% |
| SumAll | 31.13 | 0.50 | +6152% |
RTX4090 ((Desktop) / 10th Gen Intel Core i9-10940X @ 3.30GHz 128GB)
| Operation | cpu | cuda | cuda/cpu speedup |
|---|---|---|---|
| Argmax | 6.67 | 0.04 | +14782% |
| BCE | 23.74 | 0.14 | +16992% |
| Concat | 52.08 | 1.29 | +3922% |
| Conv1d | 2.84 | 0.15 | +1753% |
| Conv2d | 6.60 | 0.25 | +2559% |
| Gather | 19.75 | 0.27 | +7340% |
| LeakyReLU | 2.44 | 0.03 | +7439% |
| Linear | 62.27 | 1.01 | +6057% |
| MatMul | 87.47 | 1.36 | +6322% |
| PReLU | 2.28 | 0.04 | +5297% |
| ReLU | 2.47 | 0.03 | +7216% |
| Scatter | 1.84 | 0.07 | +2652% |
| ScatterSum | 3.86 | 0.06 | +5919% |
| ScatterMax | 3.86 | 0.08 | +4790% |
| SeLU | 2.71 | 0.04 | +6952% |
| Sigmoid | 2.63 | 0.05 | +5626% |
| Softmax | 27.75 | 0.59 | +4634% |
| Softplus | 3.50 | 0.04 | +8149% |
| Sort | 46.67 | 0.90 | +5077% |
| Sum | 12.19 | 0.62 | +1866% |
| SumAll | 6.95 | 0.45 | +1428% |
A100 80GB 80GB PCIe ((Server) / Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz, 754GB)
| Operation | cpu | cuda | cuda/cpu speedup |
|---|---|---|---|
| Argmax | 5.04 | 0.06 | +7856% |
| BCE | 18.22 | 0.11 | +16097% |
| Concat | 30.47 | 0.74 | +4036% |
| Conv1d | 1029.44 | 0.13 | +811270% |
| Conv2d | 531.83 | 0.26 | +205989% |
| Gather | 9.59 | 0.30 | +3045% |
| LeakyReLU | 1.68 | 0.06 | +2579% |
| Linear | 47.44 | 2.17 | +2090% |
| MatMul | 50.91 | 2.07 | +2355% |
| PReLU | 1.60 | 0.05 | +3332% |
| ReLU | 1.43 | 0.04 | +3380% |
| Scatter | 1.61 | 0.11 | +1358% |
| ScatterSum | 4.95 | 0.06 | +7547% |
| ScatterMax | 5.39 | 0.33 | +1511% |
| SeLU | 1.82 | 0.04 | +4259% |
| Sigmoid | 3.03 | 0.04 | +7553% |
| Softmax | 18.18 | 0.36 | +5003% |
| Softplus | 2.87 | 0.04 | +6412% |
| Sort | 52.86 | 1.16 | +4449% |
| Sum | 11.38 | 0.37 | +2947% |
| SumAll | 6.85 | 0.29 | +2226% |