I just came across a strange problem. I slightly modified some parts of adam.lua as follows:
-- Initialization
state.t = state.t or 0
-- Exponential moving average of gradient values
state.m = state.m or x.new(x:size()):zero()
-- Exponential moving average of squared gradient values
state.v = state.v or x.new(x:size()):zero()
-- A tmp tensor to hold the sqrt(v) + epsilon
state.denom = state.denom or x.new(x:size()):zero()
-- (3) learning rate decay (annealing)
local clr = lr / (1 + state.t*lrd)
state.t = state.t + 1
local biasCorrection1 = 1 - beta1^state.t
local biasCorrection2 = 1 - beta2^state.t
-- (1) evaluate f(x) and df/dx
local fx, dfdx = opfunc(x)
-- (2) weight decay
if wd ~= 0 then
dfdx:add(wd, x)
end
I changed the order of (1), (2) and (3), and placed
local biasCorrection1 = 1 - beta1^state.t
local biasCorrection2 = 1 - beta2^state.t
after state.t = state.t + 1. With such changes, the training losses can not be ensured the same even though I used the same seed. If I added a print() between state.t = state.t + 1 and local biasCorrection1 = 1 - beta1^state.t, then I can obtain the same training losses with multiple runs. The original adam.lua can produce the same results with multiple runs.
Does anyone have any idea about what might be happening?
I just came across a strange problem. I slightly modified some parts of adam.lua as follows:
I changed the order of (1), (2) and (3), and placed
after
state.t = state.t + 1. With such changes, the training losses can not be ensured the same even though I used the same seed. If I added aprint()betweenstate.t = state.t + 1andlocal biasCorrection1 = 1 - beta1^state.t, then I can obtain the same training losses with multiple runs. The original adam.lua can produce the same results with multiple runs.Does anyone have any idea about what might be happening?