Hi there,
I am using facebook/pe-av-large following the example code provided in the model card (using the dot product: audio_embeds @ visual_embeds.T).
I noticed that the resulting similarity scores often exceed 1.0 (e.g., I am seeing scores around 1.1). This suggests the embeddings are not L2-normalized by default.
- Are the embeddings intended to be used as unnormalized dot products?
- Is there a known range for these scores? I am trying to set a threshold to filter "good" vs "bad" pairs. Should I manually L2-normalize the embeddings to interpret them as Cosine Similarity (-1 to 1)?
Thanks!