Mixed Precision Training

Hi,

Thanks for this great work.

I noticed that the model class has the option to use FP16, but it's not used by default.

Was FP32 found to be necessary to achieve good performance? If so, was there hypotheses for which part of the architecture required high precision?

Thanks