Pipeline combining the usage of BLIP ViT(https://huggingface.co/Salesforce/blip-image-captioning-large) ,fine-tuned version of SmolLM2 360M (https://huggingface.co/HuggingFaceTB/SmolLM2-360M) and Stable-diffusion-v1-4 (https://huggingface.co/CompVis/stable-diffusion-v1-4) for the purpose of image captioning , porviding grad-CAM overlay, self-attention and generating new images based on the extracted caption . The app focuses on forwarding generated captions into the SmolLM2 in order to explain the word/object in the image in more detail as weel as forwarding into stable diffisuion to generate new images and includeing XAI (grad-cam, self-attention) for the whole process. Finet notebook provides a simple workflow for fine-tuning the models along with integrated wandb logger.
App can be tested at the following link : https://huggingface.co/spaces/Fine-Tuning-DLSE-Smol2/dlasw-pipeline-deploy?logs=container. Or by running inisde a Docker container locally:
docker run -it -p 7860:7860 --gpus all --platform=linux/amd64 registry.hf.space/fine-tuning-dlse-smol2-dlasw-pipeline-deploy:latest python app.py






