In this code, It seemed training is called after a simulation ended. I think the model trains after an episode ended. Did you used MC method?