Hello,
Thank you for your implementation. I have a question about the SublayerConnection for the residual connection. Based on the original paper, one use
dropout(norm(x+ sublayer(x))). But in your implementation one has x+dropout(sublayer(norm(x))) instead. Is my understanding correct? Thanks
Hello,
Thank you for your implementation. I have a question about the SublayerConnection for the residual connection. Based on the original paper, one use
dropout(norm(x+ sublayer(x))). But in your implementation one has x+dropout(sublayer(norm(x))) instead. Is my understanding correct? Thanks