run_unit_test.sh
您好,我现在修改了run_unit_test.sh,for gqa in 4; do,--kv-heads 8 \后在A100平台下进行了测试for seqlen in 2048 8192 16384 32768 65536 131072; do
我实际测试如下,我有两个想请教您的问题
1.我在8k下测试结果表明FSA比NSA在Forward,Backward,1F1B Total 下都比NSA慢,2k也慢于NSA,是我测试文件的问题吗?
2.FSA论文图2里提到内存访问量FSA优势明显,但在显存占用上是不是不占优势?我测试的结果表明FSA在Memory Usage上都逊于NSA。
刚了解注意力机制方面,上述提问有不专业的地方还请谅解。
Configuration: seqlen=2048, block-size=64, topk=16, gqa=4
ℹ️ 📊 Performance Breakdown (ms):
┌─────────────┬──────────┬──────────┬─────────────┐
│ Phase │ NSA │ FSA │ Speedup │
├─────────────┼──────────┼──────────┼─────────────┤
│ Forward │ 9.36 │ 21.47 │ 0.44x │
│ Backward │ 10.17 │ 33.62 │ 0.30x │
│ 1F1B Total │ 19.54 │ 55.09 │ 0.35x │
└─────────────┴──────────┴──────────┴─────────────┘
ℹ️ 💾 Memory Usage Analysis:
ℹ️ Forward Memory Usage: NSA=0.38GB, FSA=0.39GB (FSA uses: 0.00GB more memory)
ℹ️ 1F1B Memory Usage: NSA=0.45GB, FSA=0.45GB (FSA uses: 0.00GB more memory)
Configuration: seqlen=8192, block-size=64, topk=16, gqa=4
📊 Performance Breakdown (ms):
┌─────────────┬──────────┬──────────┬─────────────┐
│ Phase │ NSA │ FSA │ Speedup │
├─────────────┼──────────┼──────────┼─────────────┤
│ Forward │ 22.97 │ 26.06 │ 0.88x │
│ Backward │ 34.79 │ 50.45 │ 0.69x │
│ 1F1B Total │ 57.76 │ 76.51 │ 0.75x │
└─────────────┴──────────┴──────────┴─────────────┘
ℹ️ 💾 Memory Usage Analysis:
ℹ️ Forward Memory Usage: NSA=1.11GB, FSA=1.11GB (FSA uses: 0.00GB more memory)
ℹ️ 1F1B Memory Usage: NSA=1.17GB, FSA=1.18GB (FSA uses: 0.00GB more memory)
Configuration: seqlen=16384, block-size=64, topk=16, gqa=4
ℹ️ 📊 Performance Breakdown (ms):
┌─────────────┬──────────┬──────────┬─────────────┐
│ Phase │ NSA │ FSA │ Speedup │
├─────────────┼──────────┼──────────┼─────────────┤
│ Forward │ 46.98 │ 40.42 │ 1.16x │
│ Backward │ 73.51 │ 68.89 │ 1.07x │
│ 1F1B Total │ 120.49 │ 109.30 │ 1.10x │
└─────────────┴──────────┴──────────┴─────────────┘
ℹ️ 💾 Memory Usage Analysis:
ℹ️ Forward Memory Usage: NSA=2.09GB, FSA=2.11GB (FSA uses: 0.02GB more memory)
ℹ️ 1F1B Memory Usage: NSA=2.14GB, FSA=2.15GB (FSA uses: 0.01GB more memory)
Configuration: seqlen=32768, block-size=64, topk=16, gqa=4
ℹ️ 📊 Performance Breakdown (ms):
┌─────────────┬──────────┬──────────┬─────────────┐
│ Phase │ NSA │ FSA │ Speedup │
├─────────────┼──────────┼──────────┼─────────────┤
│ Forward │ 99.58 │ 79.07 │ 1.26x │
│ Backward │ 162.84 │ 137.36 │ 1.19x │
│ 1F1B Total │ 262.41 │ 216.43 │ 1.21x │
└─────────────┴──────────┴──────────┴─────────────┘
ℹ️ 💾 Memory Usage Analysis:
ℹ️ Forward Memory Usage: NSA=4.08GB, FSA=4.37GB (FSA uses: 0.29GB more memory)
ℹ️ 1F1B Memory Usage: NSA=4.07GB, FSA=4.09GB (FSA uses: 0.02GB more memory)
Configuration: seqlen=65536, block-size=64, topk=16, gqa=4
ℹ️ 📊 Performance Breakdown (ms):
┌─────────────┬──────────┬──────────┬─────────────┐
│ Phase │ NSA │ FSA │ Speedup │
├─────────────┼──────────┼──────────┼─────────────┤
│ Forward │ 239.90 │ 199.96 │ 1.20x │
│ Backward │ 375.04 │ 294.62 │ 1.27x │
│ 1F1B Total │ 614.94 │ 494.59 │ 1.24x │
└─────────────┴──────────┴──────────┴─────────────┘
ℹ️ 💾 Memory Usage Analysis:
ℹ️ Forward Memory Usage: NSA=8.06GB, FSA=9.72GB (FSA uses: 1.67GB more memory)
ℹ️ 1F1B Memory Usage: NSA=7.94GB, FSA=7.97GB (FSA uses: 0.03GB more memory)
Configuration: seqlen=131072, block-size=64, topk=16, gqa=4
ℹ️ 📊 Performance Breakdown (ms):
┌─────────────┬──────────┬──────────┬─────────────┐
│ Phase │ NSA │ FSA │ Speedup │
├─────────────┼──────────┼──────────┼─────────────┤
│ Forward │ 639.56 │ 590.68 │ 1.08x │
│ Backward │ 891.85 │ 711.37 │ 1.25x │
│ 1F1B Total │ 1531.41 │ 1302.04 │ 1.18x │
└─────────────┴──────────┴──────────┴─────────────┘
ℹ️ 💾 Memory Usage Analysis:
ℹ️ Forward Memory Usage: NSA=16.01GB, FSA=23.50GB (FSA uses: 7.48GB more memory)
ℹ️ 1F1B Memory Usage: NSA=15.68GB, FSA=17.64GB (FSA uses: 1.97GB more memory)
run_unit_test.sh
您好,我现在修改了run_unit_test.sh,for gqa in 4; do,--kv-heads 8 \后在A100平台下进行了测试for seqlen in 2048 8192 16384 32768 65536 131072; do
我实际测试如下,我有两个想请教您的问题
1.我在8k下测试结果表明FSA比NSA在Forward,Backward,1F1B Total 下都比NSA慢,2k也慢于NSA,是我测试文件的问题吗?
2.FSA论文图2里提到内存访问量FSA优势明显,但在显存占用上是不是不占优势?我测试的结果表明FSA在Memory Usage上都逊于NSA。
刚了解注意力机制方面,上述提问有不专业的地方还请谅解。
Configuration: seqlen=2048, block-size=64, topk=16, gqa=4
ℹ️ 📊 Performance Breakdown (ms):
┌─────────────┬──────────┬──────────┬─────────────┐
│ Phase │ NSA │ FSA │ Speedup │
├─────────────┼──────────┼──────────┼─────────────┤
│ Forward │ 9.36 │ 21.47 │ 0.44x │
│ Backward │ 10.17 │ 33.62 │ 0.30x │
│ 1F1B Total │ 19.54 │ 55.09 │ 0.35x │
└─────────────┴──────────┴──────────┴─────────────┘
ℹ️ 💾 Memory Usage Analysis:
ℹ️ Forward Memory Usage: NSA=0.38GB, FSA=0.39GB (FSA uses: 0.00GB more memory)
ℹ️ 1F1B Memory Usage: NSA=0.45GB, FSA=0.45GB (FSA uses: 0.00GB more memory)
Configuration: seqlen=8192, block-size=64, topk=16, gqa=4
📊 Performance Breakdown (ms):
┌─────────────┬──────────┬──────────┬─────────────┐
│ Phase │ NSA │ FSA │ Speedup │
├─────────────┼──────────┼──────────┼─────────────┤
│ Forward │ 22.97 │ 26.06 │ 0.88x │
│ Backward │ 34.79 │ 50.45 │ 0.69x │
│ 1F1B Total │ 57.76 │ 76.51 │ 0.75x │
└─────────────┴──────────┴──────────┴─────────────┘
ℹ️ 💾 Memory Usage Analysis:
ℹ️ Forward Memory Usage: NSA=1.11GB, FSA=1.11GB (FSA uses: 0.00GB more memory)
ℹ️ 1F1B Memory Usage: NSA=1.17GB, FSA=1.18GB (FSA uses: 0.00GB more memory)
Configuration: seqlen=16384, block-size=64, topk=16, gqa=4
ℹ️ 📊 Performance Breakdown (ms):
┌─────────────┬──────────┬──────────┬─────────────┐
│ Phase │ NSA │ FSA │ Speedup │
├─────────────┼──────────┼──────────┼─────────────┤
│ Forward │ 46.98 │ 40.42 │ 1.16x │
│ Backward │ 73.51 │ 68.89 │ 1.07x │
│ 1F1B Total │ 120.49 │ 109.30 │ 1.10x │
└─────────────┴──────────┴──────────┴─────────────┘
ℹ️ 💾 Memory Usage Analysis:
ℹ️ Forward Memory Usage: NSA=2.09GB, FSA=2.11GB (FSA uses: 0.02GB more memory)
ℹ️ 1F1B Memory Usage: NSA=2.14GB, FSA=2.15GB (FSA uses: 0.01GB more memory)
Configuration: seqlen=32768, block-size=64, topk=16, gqa=4
ℹ️ 📊 Performance Breakdown (ms):
┌─────────────┬──────────┬──────────┬─────────────┐
│ Phase │ NSA │ FSA │ Speedup │
├─────────────┼──────────┼──────────┼─────────────┤
│ Forward │ 99.58 │ 79.07 │ 1.26x │
│ Backward │ 162.84 │ 137.36 │ 1.19x │
│ 1F1B Total │ 262.41 │ 216.43 │ 1.21x │
└─────────────┴──────────┴──────────┴─────────────┘
ℹ️ 💾 Memory Usage Analysis:
ℹ️ Forward Memory Usage: NSA=4.08GB, FSA=4.37GB (FSA uses: 0.29GB more memory)
ℹ️ 1F1B Memory Usage: NSA=4.07GB, FSA=4.09GB (FSA uses: 0.02GB more memory)
Configuration: seqlen=65536, block-size=64, topk=16, gqa=4
ℹ️ 📊 Performance Breakdown (ms):
┌─────────────┬──────────┬──────────┬─────────────┐
│ Phase │ NSA │ FSA │ Speedup │
├─────────────┼──────────┼──────────┼─────────────┤
│ Forward │ 239.90 │ 199.96 │ 1.20x │
│ Backward │ 375.04 │ 294.62 │ 1.27x │
│ 1F1B Total │ 614.94 │ 494.59 │ 1.24x │
└─────────────┴──────────┴──────────┴─────────────┘
ℹ️ 💾 Memory Usage Analysis:
ℹ️ Forward Memory Usage: NSA=8.06GB, FSA=9.72GB (FSA uses: 1.67GB more memory)
ℹ️ 1F1B Memory Usage: NSA=7.94GB, FSA=7.97GB (FSA uses: 0.03GB more memory)
Configuration: seqlen=131072, block-size=64, topk=16, gqa=4
ℹ️ 📊 Performance Breakdown (ms):
┌─────────────┬──────────┬──────────┬─────────────┐
│ Phase │ NSA │ FSA │ Speedup │
├─────────────┼──────────┼──────────┼─────────────┤
│ Forward │ 639.56 │ 590.68 │ 1.08x │
│ Backward │ 891.85 │ 711.37 │ 1.25x │
│ 1F1B Total │ 1531.41 │ 1302.04 │ 1.18x │
└─────────────┴──────────┴──────────┴─────────────┘
ℹ️ 💾 Memory Usage Analysis:
ℹ️ Forward Memory Usage: NSA=16.01GB, FSA=23.50GB (FSA uses: 7.48GB more memory)
ℹ️ 1F1B Memory Usage: NSA=15.68GB, FSA=17.64GB (FSA uses: 1.97GB more memory)