the main Agent that runs all the operations todo turn it into a bunch of small modules that will run one after another, with human inspection in each stage so first stage is predicting flag positions, then a human manually looks at them to ensure everthing is correct if an image got multiple flags make it separate them into multiple images so it can fit in teh next pipeline
then the next pipeline would take those flags detected, get the center point of the box, cast it back to original image dimensions then get both the lat, long of the flag and return the top predicted matches for thaat flag
the synthetic_dataset.py and train.py where used for debugging purposes notebooks/raw_generation.ipynb was used to generate different scales of an image to check how clear the flag was on each notebooks/kaggle-trainer.pynb was the main kaggle trainer
as of now pure image coordinate can be extracted from https://exif.tools/ we need to find a python api for this
zero shot model, get your flags to refrence used here: for flags https://flagpedia.net/download/images for institutions: https://flagpedia.net/organization or here: https://commons.wikimedia.org/wiki/Category:Flags_of_international_organizations
if you changed them make sure to rerun embed_generation.ipynb to regenerate the embedings i made a dataset publicaly available here: https://www.kaggle.com/datasets/zeyadcode/country-and-institutions-flags-reference
Video can be extracted directly using OpenGoPro API
Instal clip: pip install git+https://github.com/openai/CLIP.git