Hi,
Thanks for this amazing work, really appreciate it!
I am wondering what your training data look like. My understanding is you have a coin dataset containing only coin images without text, and for each type of coin you have images different in perspective, light conditioning etc, and you used this dataset to conduct contrastive learning on the visual encode of the pretrained clip model. Is that correct?
Hi,
Thanks for this amazing work, really appreciate it!
I am wondering what your training data look like. My understanding is you have a coin dataset containing only coin images without text, and for each type of coin you have images different in perspective, light conditioning etc, and you used this dataset to conduct contrastive learning on the visual encode of the pretrained clip model. Is that correct?