-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
OOM experiments
| epoch | framework_config | gradient_accumulation_steps | mem_nvidia_mem_reserved | model_name_or_path | num_gpus | per_device_train_batch_size | torch_dtype | train_loss | train_runtime | train_samples_per_second | train_steps_per_second | train_tokens_per_second |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| none | 16 | 78783.5 | mistralai/Mixtral-8x7B-Instruct-v0.1 | 8 | 1 | bfloat16 |
Failed experiments - number of gpus not divisible by ep degree
| epoch | framework_config | gradient_accumulation_steps | mem_nvidia_mem_reserved | model_name_or_path | num_gpus | per_device_train_batch_size | torch_dtype | train_loss | train_runtime | train_samples_per_second | train_steps_per_second | train_tokens_per_second |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| moe-scattermoe-granite-ep2 | 16 | 0 | ibm-granite/granite-3.0-3b-a800m-instruct | 1 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep4 | 16 | 0 | ibm-granite/granite-3.0-3b-a800m-instruct | 1 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep2-padding-free-foak | 16 | 0 | ibm-research/moe-7b-1b-active-shared-experts | 1 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep2-padding-free | 16 | 0 | ibm-granite/granite-3.0-3b-a800m-instruct | 1 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep2-padding-free-foak | 16 | 0 | ibm-granite/granite-3.0-3b-a800m-instruct | 1 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep4-padding-free | 16 | 0 | ibm-granite/granite-3.0-3b-a800m-instruct | 1 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep4-padding-free-foak | 16 | 0 | ibm-granite/granite-3.0-3b-a800m-instruct | 1 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep2 | 16 | 0 | ibm-research/moe-7b-1b-active-shared-experts | 1 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep2-padding-free | 16 | 0 | ibm-research/moe-7b-1b-active-shared-experts | 1 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep4 | 8 | 2276 | ibm-granite/granite-3.0-3b-a800m-instruct | 2 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep4-padding-free | 8 | 2276 | ibm-granite/granite-3.0-3b-a800m-instruct | 2 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep4-padding-free-foak | 8 | 2277 | ibm-granite/granite-3.0-3b-a800m-instruct | 2 | 8 | bfloat16 |
Failed experiments - number of experts not divisible by ep degree
| epoch | framework_config | gradient_accumulation_steps | mem_nvidia_mem_reserved | model_name_or_path | num_gpus | per_device_train_batch_size | torch_dtype | train_loss | train_runtime | train_samples_per_second | train_steps_per_second | train_tokens_per_second |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| moe-scattermoe-granite-ep4 | 16 | 0 | ibm-research/moe-7b-1b-active-shared-experts | 1 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep4 | 8 | 2276 | ibm-research/moe-7b-1b-active-shared-experts | 2 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep4 | 4 | 2564 | ibm-research/moe-7b-1b-active-shared-experts | 4 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep4-padding-free | 16 | 0 | ibm-research/moe-7b-1b-active-shared-experts | 1 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep4-padding-free | 8 | 2276 | ibm-research/moe-7b-1b-active-shared-experts | 2 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep4-padding-free | 4 | 2564 | ibm-research/moe-7b-1b-active-shared-experts | 4 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep4-padding-free-foak | 16 | 0 | ibm-research/moe-7b-1b-active-shared-experts | 1 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep4-padding-free-foak | 8 | 2277 | ibm-research/moe-7b-1b-active-shared-experts | 2 | 8 | bfloat16 | ||||||
| moe-scattermoe-granite-ep4-padding-free-foak | 4 | 2564.5 | ibm-research/moe-7b-1b-active-shared-experts | 4 | 8 | bfloat16 |
Delta with previous experiments on OOM
| epoch | framework_config | gradient_accumulation_steps | mem_nvidia_mem_reserved | model_name_or_path | num_gpus | per_device_train_batch_size | torch_dtype | train_loss | train_runtime | train_samples_per_second | train_steps_per_second | train_tokens_per_second |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| none | 16 | 78783.5 | mistralai/Mixtral-8x7B-Instruct-v0.1 | 8 | 1 | bfloat16 |
Regression test is done as part of the PR - #126. The change in metrics may be attributed to transformers==4.49 but needs investigation.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels