#Nano LLaVA OneVision Practice
Hi, everyone, this practics is going to guide you to train a LLaVA OneVision with the ViT-L encoder and Qwen-2 0.5B model. There is two part including:
- The step-by-step training of your own LLaVA-onevision model.
- The evaluation and inference of the given model.
Or you could just used my completed checkpoint in https://huggingface.co/RuaZhou/Nano_LLaMA_Onevision for inference & evaluation.
The whole training process will take time near a day with the RTX 4070 Ti level graphical card. I modified the original LLaVA-Next code a little bit so that it could be trained on 16GB-consummer-level card. For me, it was implemented with RTX 5060ti.
The training process includes four part:
- Stage-1: Pre-training of projector (about 21h)
- Stage-1.5: Mid-stage training (about 3h)
- Stage-2: Single image training (about 10h)
- Stage-2: OneVision training (about 6h)
Then we are going to explain each part of it in details.
And it's strongly recommended that you may use the following command to shutdown graphical interface temporarily while training:
sudo systemctl isolate multi-user.target
then recovering the interface:
sudo systemctl isolate graphical.target
This will bring nearly 12% speed up.
Firstly, you need to clone this modified repository and navigate to the LLaVA folder:
git clone https://github.com/Meur3ault/LLaVA-NeXT.git
cd LLaVA-NeXT
Then install the reference packages:
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip # Enable PEP 660 support.
pip install -e ".[train]"
pip install -r requirements.txt
If you want to use the flash attention packages for evaluation and speeding up, please make sure the cuda verison>=12.0. Then run
(You have to upgrade your torch version first)
pip install flash-attn --no-build-isolation
If you want to record the training-related data automatically like the training curves above, please register an account at wandb.ai for a token , then
- Set the WANDB_API_KEY environment variable.
export WANDB_API_KEY=<your_api_key>
- Install the wandb library (it is already in the requirement, so you may run wandb login directly) and log in.
pip install wandb
wandb login
Finally, due to the imcompleted list of requirement, you may need to run the following command or any other command noticing to install the missing package:
pip install sqlitedict tenacity Levenshtein pytablewriter
Firstly, the data set for pre-training stage should be prepared by downloading the blip_558K data set. Then the image.zip and json data should be unzipped and placed with the correspoding images and data path as LLaVA-NeXT/scripts/train/pretrain_siglip.sh indicates:
Then, under the main path, run the command:
./scripts/train/pretrain_siglip.sh
If it is inexecutable, please change it into executable file:
chmod +x ./scripts/train/pretrain_siglip.sh
The pretraining stage will begin to train the projector for aligning the modalities between vision tokens and languages tokens.
The mid-stage data set synthdog_en should be downloaded here and be put into the path order as the figure shows below:
Then under the root path of this project, run the following command to training the stage 1.5:
./scripts/train/mid_stage.sh
And you could modify the code below. Especially if you haven't installed flash attention 2, you shall change the --attn_implementation from flash_attention_2 back to torch-based sdpa.
After finishing the stage 1.5, i.e., mid stage, the following single-image data sets ( images & jsons ) need to be downloaded and be ordered as below:
The selected dataset includes:
- ai2d_gpt4v
- ai2d_internvl
- ai2d_llava_format
- chart2text
- chartqa
- dvqa
- image_textualization
- infographic_vqa
- infographic_vqa_gpt4v
- infographic_vqa_llava_format
The first 20% percent of the selected data set is used, near 73K samples. If you want to change the training content, please modify the file LLaVA-NeXT/scripts/train/nano_single_image.yaml .
And its configurations have been modified to fit in 16GB GPU device (so as the mid-stage training configs and OneVision training configs):
-
Image_aspect_ratio has been shrinked. The anyres_max_# basically means that it allows the basic brief tokens ×729 from the resized 384×384 figure andadditional #×729 tokens from AnyRes while every 384×384 split is generating 729 tokens. Once it is surpassed, the additional tokens from AnyRes will be bilinear interpolated for downsampling untile the given size.
-
The learning rate and batch size have been reduced.
-
The image_grid_pinpoints has been modifed. It means what the model would choose the closest resizing resolution for the given image. For example, for [384, 600], the model will choose [384, 768] to split the image for AnyRes. Here the options are restricted into "[[384,384],[384,768],[768,384]]".
-
The tunable parts have been limited within the projector and the LLM backbone.
Run the following code to execute the script:
./scripts/train/finetune_si.sh
After finishing the stage 2 of single images, the selected multi-image set and video set need to be downloaded from multi-image data sets and lmms-lab/LLaVA-Video-178K/30_60_s_nextqa. They should be ordered as below.
The selected multi-image data sets include:
- DocVQA
- OCR-VQA
- RAVEN_train_images
- Spot-the-Diff
The video data set includes:
- NextQA
Finally, the customized data set inlcudes near 49.5K multi-image samples and near 1.7k video samples. The first 20% samples were used, corresponding to 9.9K multi-images data and 0.34K video data. If you want to change the training content, please modify the file LLaVA-NeXT/scripts/train/nano_onevision.yaml .
Beside of the modification in other stages, the OneVision stage further reduce the image_grid_pinpoints, image_aspect_ratio, and frame_upbound. Meanwhile, the model_max_length is increased to ensure processing multi-image tokens.
Run the following command to complete the training:
./scripts/train/finetune_ov.sh
Install the jupyter if the environment does not have:
pip install jupyter
Run the code of ./Test.ipynb
The single-image tuning model is much more stable then OneVision tuning, which may attribute to the data set quality and coarse config in ./scripts/train/finetune_ov.sh .
To evaluate the trained model, you may need a huggingface token to log in to download the evaluation data set.
huggingface-cli login
Installation:
cd to the root dir of the project:
cd ./LLaVA-NeXT
Install lmm-eval for evaluation:
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
Run the following command for evaluation, if error for import packages, just run pip install xxxxxx to fix.
And the evaluation sets flash attention 2 as the default options, if not, you may modify it a lot to evaluate.
If you want to evaluate the OneVision checkpoint trained by this repository (the huggingface checkpoint is already revised), please replace the config of LLaVA-NeXT/checkpoints/onevision/llava-onevision-google_siglip-so400m-patch14-384-Qwen_Qwen2-0.5B-Instruct-ov_stage_am9/config.json (or the config of your checkpoints) from:
to:
to avoid the zero-division error.
1. Single image bench mark
accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
--model=llava_onevision \
--model_args=pretrained=../checkpoints/onevision/llava-onevision-google_siglip-so400m-patch14-384-Qwen_Qwen2-0.5B-Instruct-ov_stage_am9,conv_template=qwen_1_5,device_map=cuda,model_name=llava_qwen \
--tasks=ai2d,chartqa,docvqa_val,mmmu_pro \
--batch_size=1
You can substitute the model with my already trained checkpoint:
accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
--model=llava_onevision \
--model_args=pretrained=RuaZhou/Nano_LLaMA_Onevision,conv_template=qwen_1_5,device_map=cuda,model_name=llava_qwen \
--tasks=ai2d,chartqa,docvqa_val,mmmu_pro \
--batch_size=1
2. Video bench mark
accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
--model=llava_onevision \
--model_args=pretrained=../checkpoints/onevision/llava-onevision-google_siglip-so400m-patch14-384-Qwen_Qwen2-0.5B-Instruct-ov_stage_am9,conv_template=qwen_1_5,device_map=cuda,model_name=llava_qwen \
--tasks=seedbench,ocrbench \
--batch_size=1
You can substitute the model with my already trained checkpoint:
accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
--model=llava_onevision \
--model_args=pretrained=RuaZhou/Nano_LLaMA_Onevision,conv_template=qwen_1_5,device_map=cuda,model_name=llava_qwen \
--tasks=seedbench,ocrbench \
--batch_size=1
The result is
Whereas, the original result of LLaVA OneVision 0.5B, is:
Images:
- AI2D : 57.1% (ours is 46.86%)
- ChartQA: 61.4% (ours is 7.88%)
- DocVQA_val: 70.0% (ours is 15.90%)
- SeedBench (image): 65.5% (ours is 49.18%)
Videos:
- SeedBench (video): 44.2% (ours is 42.03%)