Brief question about model structure

We know that for qkv attention, the result of `q @ k` should be divided by `sqrt(d)`, will this also be same for efficientVit?

Does relu-based-linear-attention need layernorm or position embedding?

Does relu-based-linear-attention need multi-head attention?