Hello
I would like to thank you for providing such a great paper to the community.
I had a few questions
- In your paper you mention that for text to image retrieval you train on both ITC(Contrastive Loss) and ITC(Sigmoid Loss). Did you try finetuning on just 1 loss and evaluating for the same? How were the results?
- In you huggingface implementation how does one go about training with the strategy you adopted here?