I downloaded them from Huggingface and upon inspection I noticed that layers with "local_attn" in their names are just 0 for 14B 480p model. This is not the case for 1.3B model. I was wondering if this is intended, or should I load it differently? I used regular torch.load to obtain state dictionary.
I downloaded them from Huggingface and upon inspection I noticed that layers with "local_attn" in their names are just 0 for 14B 480p model. This is not the case for 1.3B model. I was wondering if this is intended, or should I load it differently? I used regular torch.load to obtain state dictionary.