We know that for qkv attention, the result of q @ k should be divided by sqrt(d), will this also be same for efficientVit?
Does relu-based-linear-attention need layernorm or position embedding?
Does relu-based-linear-attention need multi-head attention?
We know that for qkv attention, the result of
q @ kshould be divided bysqrt(d), will this also be same for efficientVit?Does relu-based-linear-attention need layernorm or position embedding?
Does relu-based-linear-attention need multi-head attention?