Question about cudagraph_runtime_mode=None #46
Unanswered
brandonmmusic-max
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I've been experimenting with tryin to train a dflash model on qwen 3.5 397b running through VLLM. Training results were pretty good, but I get a very buggy 0 percent acceptance during actual inference . I've been debugging and i'm thinking the cudagraphs capture the dflash forward with 0 context tokens during warmup, and then replay that frozen graph during real inference--actual context hidden sates never reach the model. Anyone ever encounter that during testing? I was thinking of disabling cuda graph just during the dflash proposed call, but I'm out of my league on this one . lol, but the results of your models look great, and thank you all for sharing !
Beta Was this translation helpful? Give feedback.
All reactions