Hi, I was trying to implement your work on VGG16. However, I found that while it was possible to run BP and FA, OOM was encountered in the case of DFA, sDFA, and DRTP. The added parameters from traininghook increase remarkably with the size of the input and output, as well as the depth and width of the model. If I want to extend your work on a larger scale, is there any advice?
Thanks for your fantastic work!