Regarding the GQA implementation of the HSTU backward operator, does it ensure consistency in the number of heads (nhead) between input and output through intra-group reduction? Additionally, as there is currently no corresponding CPU reference implementation for the backward GQA, could it be supplemented?