1 Training

#Nano LLaVA OneVision Practice

Hi, everyone, this practics is going to guide you to train a LLaVA OneVision with the ViT-L encoder and Qwen-2 0.5B model. There is two part including:

The step-by-step training of your own LLaVA-onevision model.
The evaluation and inference of the given model.

Or you could just used my completed checkpoint in https://huggingface.co/RuaZhou/Nano_LLaMA_Onevision for inference & evaluation.

The whole training process will take time near a day with the RTX 4070 Ti level graphical card. I modified the original LLaVA-Next code a little bit so that it could be trained on 16GB-consummer-level card. For me, it was implemented with RTX 5060ti.

1 Training

The training process includes four part:

Stage-1: Pre-training of projector (about 21h)
Stage-1.5: Mid-stage training (about 3h)
Stage-2: Single image training (about 10h)
Stage-2: OneVision training (about 6h)

Then we are going to explain each part of it in details.

And it's strongly recommended that you may use the following command to shutdown graphical interface temporarily while training:

sudo systemctl isolate multi-user.target

then recovering the interface:

sudo systemctl isolate graphical.target

This will bring nearly 12% speed up.

1.1 Installation (temporarily for Linux, not Windows)

Firstly, you need to clone this modified repository and navigate to the LLaVA folder:

git clone https://github.com/Meur3ault/LLaVA-NeXT.git
cd LLaVA-NeXT

Then install the reference packages:

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e ".[train]"
pip install -r requirements.txt

If you want to use the flash attention packages for evaluation and speeding up, please make sure the cuda verison>=12.0. Then run

(You have to upgrade your torch version first)

pip install flash-attn --no-build-isolation

If you want to record the training-related data automatically like the training curves above, please register an account at wandb.ai for a token , then

Set the WANDB_API_KEY environment variable.

     export WANDB_API_KEY=<your_api_key>

Install the wandb library (it is already in the requirement, so you may run wandb login directly) and log in.

     pip install wandb
     wandb login

Finally, due to the imcompleted list of requirement, you may need to run the following command or any other command noticing to install the missing package:

pip install sqlitedict tenacity Levenshtein pytablewriter

1.2 Stage-1: Pre-training of projector

Firstly, the data set for pre-training stage should be prepared by downloading the blip_558K data set. Then the image.zip and json data should be unzipped and placed with the correspoding images and data path as LLaVA-NeXT/scripts/train/pretrain_siglip.sh indicates:

Then, under the main path, run the command:

./scripts/train/pretrain_siglip.sh

If it is inexecutable, please change it into executable file:

chmod +x ./scripts/train/pretrain_siglip.sh

The pretraining stage will begin to train the projector for aligning the modalities between vision tokens and languages tokens.

1.3 Stage-1.5: Mid-stage training

The mid-stage data set synthdog_en should be downloaded here and be put into the path order as the figure shows below:

Then under the root path of this project, run the following command to training the stage 1.5:

./scripts/train/mid_stage.sh

And you could modify the code below. Especially if you haven't installed flash attention 2, you shall change the --attn_implementation from flash_attention_2 back to torch-based sdpa.

1.4 Stage-2: Single image training

After finishing the stage 1.5, i.e., mid stage, the following single-image data sets ( images & jsons ) need to be downloaded and be ordered as below:

The selected dataset includes:

ai2d_gpt4v
ai2d_internvl
ai2d_llava_format
chart2text
chartqa
dvqa
image_textualization
infographic_vqa
infographic_vqa_gpt4v
infographic_vqa_llava_format

The first 20% percent of the selected data set is used, near 73K samples. If you want to change the training content, please modify the file LLaVA-NeXT/scripts/train/nano_single_image.yaml .

And its configurations have been modified to fit in 16GB GPU device (so as the mid-stage training configs and OneVision training configs):

Image_aspect_ratio has been shrinked. The anyres_max_# basically means that it allows the basic brief tokens ×729 from the resized 384×384 figure andadditional #×729 tokens from AnyRes while every 384×384 split is generating 729 tokens. Once it is surpassed, the additional tokens from AnyRes will be bilinear interpolated for downsampling untile the given size.
The learning rate and batch size have been reduced.
The image_grid_pinpoints has been modifed. It means what the model would choose the closest resizing resolution for the given image. For example, for [384, 600], the model will choose [384, 768] to split the image for AnyRes. Here the options are restricted into "[[384,384],[384,768],[768,384]]".
The tunable parts have been limited within the projector and the LLM backbone.

Run the following code to execute the script:

./scripts/train/finetune_si.sh

1.5 Stage-2: OneVision training

After finishing the stage 2 of single images, the selected multi-image set and video set need to be downloaded from multi-image data sets and lmms-lab/LLaVA-Video-178K/30_60_s_nextqa. They should be ordered as below.

The selected multi-image data sets include:

DocVQA
OCR-VQA
RAVEN_train_images
Spot-the-Diff

The video data set includes:

NextQA

Finally, the customized data set inlcudes near 49.5K multi-image samples and near 1.7k video samples. The first 20% samples were used, corresponding to 9.9K multi-images data and 0.34K video data. If you want to change the training content, please modify the file LLaVA-NeXT/scripts/train/nano_onevision.yaml .

Beside of the modification in other stages, the OneVision stage further reduce the image_grid_pinpoints, image_aspect_ratio, and frame_upbound. Meanwhile, the model_max_length is increased to ensure processing multi-image tokens.

Run the following command to complete the training:

./scripts/train/finetune_ov.sh

2 Inference & Evaluation

2.1 Inference for images and videos

Install the jupyter if the environment does not have:

pip install jupyter

Run the code of ./Test.ipynb

The single-image tuning model is much more stable then OneVision tuning, which may attribute to the data set quality and coarse config in ./scripts/train/finetune_ov.sh .

2.2 Evaluation

To evaluate the trained model, you may need a huggingface token to log in to download the evaluation data set.

huggingface-cli login

Installation:

cd to the root dir of the project:

cd ./LLaVA-NeXT

Install lmm-eval for evaluation:

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval

cd lmms-eval

Run the following command for evaluation, if error for import packages, just run pip install xxxxxx to fix.

And the evaluation sets flash attention 2 as the default options, if not, you may modify it a lot to evaluate.

If you want to evaluate the OneVision checkpoint trained by this repository (the huggingface checkpoint is already revised), please replace the config of LLaVA-NeXT/checkpoints/onevision/llava-onevision-google_siglip-so400m-patch14-384-Qwen_Qwen2-0.5B-Instruct-ov_stage_am9/config.json (or the config of your checkpoints) from:

to:

to avoid the zero-division error.

1. Single image bench mark

accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
    --model=llava_onevision \
    --model_args=pretrained=../checkpoints/onevision/llava-onevision-google_siglip-so400m-patch14-384-Qwen_Qwen2-0.5B-Instruct-ov_stage_am9,conv_template=qwen_1_5,device_map=cuda,model_name=llava_qwen \
    --tasks=ai2d,chartqa,docvqa_val,mmmu_pro \
    --batch_size=1

You can substitute the model with my already trained checkpoint:

accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
    --model=llava_onevision \
    --model_args=pretrained=RuaZhou/Nano_LLaMA_Onevision,conv_template=qwen_1_5,device_map=cuda,model_name=llava_qwen \
    --tasks=ai2d,chartqa,docvqa_val,mmmu_pro \
    --batch_size=1

2. Video bench mark

accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
    --model=llava_onevision \
    --model_args=pretrained=../checkpoints/onevision/llava-onevision-google_siglip-so400m-patch14-384-Qwen_Qwen2-0.5B-Instruct-ov_stage_am9,conv_template=qwen_1_5,device_map=cuda,model_name=llava_qwen \
    --tasks=seedbench,ocrbench \
    --batch_size=1

You can substitute the model with my already trained checkpoint:

accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
    --model=llava_onevision \
    --model_args=pretrained=RuaZhou/Nano_LLaMA_Onevision,conv_template=qwen_1_5,device_map=cuda,model_name=llava_qwen \
    --tasks=seedbench,ocrbench \
    --batch_size=1

The result is

Whereas, the original result of LLaVA OneVision 0.5B, is:

Images:

AI2D : 57.1% (ours is 46.86%)
ChartQA: 61.4% (ours is 7.88%)
DocVQA_val: 70.0% (ours is 15.90%)
SeedBench (image): 65.5% (ours is 49.18%)

Videos:

SeedBench (video): 44.2% (ours is 42.03%)

Name		Name	Last commit message	Last commit date
Latest commit History 794 Commits
docs		docs
llava-critic-r1		llava-critic-r1
llava		llava
playground		playground
scripts		scripts
trl		trl
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Test.ipynb		Test.ipynb
analyze_oom_logs.py		analyze_oom_logs.py
check_dataset_distribution.py		check_dataset_distribution.py
check_memory.py		check_memory.py
cog.yaml		cog.yaml
debug_oom.py		debug_oom.py
gcc_wrapper.sh		gcc_wrapper.sh
log.txt		log.txt
monitor_memory.py		monitor_memory.py
predict.py		predict.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1 Training

1.1 Installation (temporarily for Linux, not Windows)

1.2 Stage-1: Pre-training of projector

1.3 Stage-1.5: Mid-stage training

1.4 Stage-2: Single image training

1.5 Stage-2: OneVision training

2 Inference & Evaluation

2.1 Inference for images and videos

2.2 Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

1 Training

1.1 Installation (temporarily for Linux, not Windows)

1.2 Stage-1: Pre-training of projector

1.3 Stage-1.5: Mid-stage training

1.4 Stage-2: Single image training

1.5 Stage-2: OneVision training

2 Inference & Evaluation

2.1 Inference for images and videos

2.2 Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages