We should probably add the gradnorm to the tensorboard plots to see what the convergence is like over time.