Liger kernel brake fine tuning #5542

arit2 · 2024-09-25T10:21:09Z

Reminder

I have read the README and searched the existing issues.

System Info

LLaMA Factory, version 0.9.1.dev0
liger_kernel 0.3.0
transformers 4.45.0.dev0

Reproduction

llamafactory-cli train ./examples/train_lora/qwen2vl_loraplus_dpo_2b_20_09.yaml

model

model_name_or_path: Qwen/Qwen2-VL-2B-Instruct

method

stage: dpo
do_train: true
finetuning_type: lora
lora_target: all
pref_beta: 0.3
pref_loss: sigmoid

dataset

dataset: obrazy_rlhf_v__proba
buffer_size: 1
preprocessing_batch_size: 1
streaming: true
val_size: 260
#accelerator_config:
dispatch_batches: false

template: qwen2_vl
cutoff_len: 2748
#max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 1

output

output_dir: saves/qwen2_vl-2b_loraplus/25v1_beta0_5_orig
logging_steps: 500
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 1
gradient_checkpointing: true
gradient_accumulation_steps: 1
learning_rate: 5.0e-6
num_train_epochs: 3.0

flash_attn: auto
lr_scheduler_type: cosine
max_grad_norm: 1.0
loraplus_lr_ratio: 16.0
enable_liger_kernel: true

warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
max_steps: 2200

eval

per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 200

Expected behavior

Unfortunately, running the training with liger kernel causes the following error:

[rank0]: AttributeError: 'NoneType' object has no attribute 'to'

My liger_kernel 0.3.0
llamafactory 0.9.1.dev0
transformers 4.45.0.dev0

09/25/2024 12:07:58 - INFO - llamafactory.model.model_utils.liger_kernel - Liger kernel has been applied to the model.
09/25/2024 12:07:58 - INFO - llamafactory.model.model_utils.liger_kernel - Liger kernel has been applied to the model.
[INFO|modeling_utils.py:3702] 2024-09-25 12:07:58,644 >> loading weights file model.safetensors from cache at /home/python/.cache/huggingface/hub/models--Qwen--Qwen2-VL-2B-Instruct/snapshots/aca78372505e6cb469c4fa6a35c60265b00ff5a4/model.safetensors.index.json
[INFO|modeling_utils.py:1621] 2024-09-25 12:07:58,653 >> Instantiating Qwen2VLForConditionalGeneration model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1097] 2024-09-25 12:07:58,654 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151645
}

[WARNING|logging.py:328] 2024-09-25 12:07:58,688 >> Qwen2VLRotaryEmbedding can now be fully parameterized by passing the model config through the config argument. All other arguments will be removed in v4.46
Qwen2VLRotaryEmbedding can now be fully parameterized by passing the model config through the config argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.88s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.88s/it]
[INFO|modeling_utils.py:4544] 2024-09-25 12:08:10,541 >> All model checkpoint weights were used when initializing Qwen2VLForConditionalGeneration.

[INFO|modeling_utils.py:4552] 2024-09-25 12:08:10,541 >> All the weights of Qwen2VLForConditionalGeneration were initialized from the model checkpoint at Qwen/Qwen2-VL-2B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2VLForConditionalGeneration for predictions without further training.
[INFO|configuration_utils.py:1052] 2024-09-25 12:08:10,685 >> loading configuration file generation_config.json from cache at /home/python/.cache/huggingface/hub/models--Qwen--Qwen2-VL-2B-Instruct/snapshots/aca78372505e6cb469c4fa6a35c60265b00ff5a4/generation_config.json
[INFO|configuration_utils.py:1097] 2024-09-25 12:08:10,685 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"temperature": 0.01,
"top_k": 1,
"top_p": 0.001
}

09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
09/25/2024 12:08:10 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
09/25/2024 12:08:10 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
09/25/2024 12:08:10 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
09/25/2024 12:08:10 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.misc - Found linear modules: o_proj,down_proj,q_proj,k_proj,gate_proj,up_proj,v_proj
09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.misc - Found linear modules: q_proj,v_proj,o_proj,gate_proj,down_proj,k_proj,up_proj
09/25/2024 12:08:11 - INFO - llamafactory.model.loader - trainable params: 9,232,384 || all params: 2,218,217,984 || trainable%: 0.4162
09/25/2024 12:08:11 - INFO - llamafactory.model.loader - trainable params: 9,232,384 || all params: 2,218,217,984 || trainable%: 0.4162
max_steps is given, it will override any value given in num_train_epochs
[WARNING|trainer.py:617] 2024-09-25 12:08:11,039 >> max_steps is given, it will override any value given in num_train_epochs
[INFO|trainer.py:667] 2024-09-25 12:08:11,039 >> Using auto half precision backend
09/25/2024 12:08:11 - INFO - llamafactory.train.trainer_utils - Using LoRA+ optimizer with loraplus lr ratio 16.00.
09/25/2024 12:08:11 - INFO - llamafactory.train.trainer_utils - Using LoRA+ optimizer with loraplus lr ratio 16.00.
[INFO|trainer.py:2212] 2024-09-25 12:08:13,575 >> ***** Running training *****
[INFO|trainer.py:2213] 2024-09-25 12:08:13,575 >> Num examples = 4,400
[INFO|trainer.py:2214] 2024-09-25 12:08:13,575 >> Num Epochs = 9,223,372,036,854,775,807
[INFO|trainer.py:2215] 2024-09-25 12:08:13,575 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2218] 2024-09-25 12:08:13,575 >> Total train batch size (w. parallel, distributed & accumulation) = 2
[INFO|trainer.py:2219] 2024-09-25 12:08:13,575 >> Gradient Accumulation steps = 1
[INFO|trainer.py:2220] 2024-09-25 12:08:13,575 >> Total optimization steps = 2,200
[INFO|trainer.py:2221] 2024-09-25 12:08:13,578 >> Number of trainable parameters = 9,232,384
0%| | 0/2200 [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 56, in run_exp
[rank0]: run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
[rank0]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 81, in run_dpo
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 2021, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 2357, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 3454, in training_step
[rank0]: loss = self.compute_loss(model, inputs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/python/factory/env/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py", line 1408, in compute_loss
[rank0]: loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 232, in get_batch_loss_metrics
[rank0]: ) = self.concatenated_forward(model, batch)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 182, in concatenated_forward
[rank0]: all_logits: "torch.Tensor" = model(**batch, return_dict=True, use_cache=False).logits.to(torch.float32)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'NoneType' object has no attribute 'to'
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank1]: launch()
[rank1]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 56, in run_exp
[rank1]: run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
[rank1]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 81, in run_dpo
[rank1]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 2021, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 2357, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 3454, in training_step
[rank1]: loss = self.compute_loss(model, inputs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/python/factory/env/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py", line 1408, in compute_loss
[rank1]: loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 232, in get_batch_loss_metrics
[rank1]: ) = self.concatenated_forward(model, batch)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 182, in concatenated_forward
[rank1]: all_logits: "torch.Tensor" = model(**batch, return_dict=True, use_cache=False).logits.to(torch.float32)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: AttributeError: 'NoneType' object has no attribute 'to'
0%| | 0/2200 [00:13<?, ?it/s]
E0925 12:08:30.915000 140353497219136 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3061541) of binary: /home/python/factory/env/bin/python3
Traceback (most recent call last):
File "/home/python/factory/env/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/python/factory/env/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/python/factory/env/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/python/factory/env/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/python/factory/env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))

Others

No response

d223302 · 2024-09-27T06:54:25Z

I encounter the same issue when using DPO to fine-tune qwen2-vl.
Here is my environment:

- `llamafactory` version: 0.9.1.dev0
- Platform: Linux-6.6.13-1-lts-x86_64-with-glibc2.31
- Python version: 3.11.9
- PyTorch version: 2.4.0+cu121
- Transformers version: 4.45.0.dev0
- Datasets version: 2.21.0
- Accelerate version: 0.34.2
- PEFT version: 0.12.0
- TRL version: 0.9.6

Muyang12345 · 2024-09-30T04:47:51Z

I also encounter the same issue, the question seems to be casued by 'enable_liger_kernel: true', I want to use this parameter to reduce the memory footprint.

hiyouga · 2024-09-30T15:29:01Z

fixed

rabiitmiao · 2024-10-03T14:52:21Z

how to fix it

* 'main' of github.com:hurongliang/LLaMA-Factory: (61 commits) update wechat fix hiyouga#5542 add patch processor func lint Update constants.py Update template.py fix chat template Exaone3.0 Update README_zh.md Update README.md update docs Support model Exaone3.0 add Exaone3.0 template Update common.py Update README_zh.md Update README.md Update README.md Update constants.py Update test_mm_plugin.py fix template fix template fix constants ...

camposs1979 · 2024-12-25T08:10:52Z

I also encounter the same issue, the question seems to be casued by 'enable_liger_kernel: true', I want to use this parameter to reduce the memory footprint.

fixed

Not yet， the last code (download at 12.25) also has the same issue:
[WARNING|:208] 2024-12-25 16:05:55,347 >> ==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1
\ /| Num examples = 17,970 | Num Epochs = 3
O^O/ _/ \ Batch size per device = 1 | Gradient Accumulation steps = 1
\ / Total batch size = 1 | Total steps = 53,910
"-____-" Number of trainable parameters = 67,108,864
0%| | 0/53910 [00:00<?, ?it/s]Traceback (most recent call last):
File "/root/miniconda3/envs/llamaf-env/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/root/autodl-tmp/LLaMA-Factory-main/src/llamafactory/cli.py", line 112, in main
run_exp()
File "/root/autodl-tmp/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 65, in run_exp
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/root/autodl-tmp/LLaMA-Factory-main/src/llamafactory/train/dpo/workflow.py", line 83, in run_dpo
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "", line 157, in train
File "", line 374, in _fast_inner_training_loop
File "", line 31, in _unsloth_training_step
File "/root/autodl-tmp/LLaMA-Factory-main/src/llamafactory/train/dpo/trainer.py", line 280, in compute_loss
loss = super().compute_loss(model, inputs, return_outputs)
File "/root/miniconda3/envs/llamaf-env/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1408, in compute_loss
loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
File "/root/autodl-tmp/LLaMA-Factory-main/src/llamafactory/train/dpo/trainer.py", line 244, in get_batch_loss_metrics
) = self.concatenated_forward(model, batch)
File "/root/autodl-tmp/LLaMA-Factory-main/src/llamafactory/train/dpo/trainer.py", line 194, in concatenated_forward
all_logits: "torch.Tensor" = model(**batch, return_dict=True, use_cache=False).logits.to(torch.float32)
AttributeError: 'NoneType' object has no attribute 'to'

github-actions bot added the pending This problem is yet to be addressed label Sep 25, 2024

hiyouga added the bug Something isn't working label Sep 30, 2024

hiyouga closed this as completed in fe7ffcc Sep 30, 2024

hiyouga added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liger kernel brake fine tuning #5542

Liger kernel brake fine tuning #5542

arit2 commented Sep 25, 2024

d223302 commented Sep 27, 2024

Muyang12345 commented Sep 30, 2024

hiyouga commented Sep 30, 2024

rabiitmiao commented Oct 3, 2024

camposs1979 commented Dec 25, 2024

Liger kernel brake fine tuning #5542

Liger kernel brake fine tuning #5542

Comments

arit2 commented Sep 25, 2024

Reminder

System Info

Reproduction

model

method

dataset

output

train

eval

Expected behavior

Others

d223302 commented Sep 27, 2024

Muyang12345 commented Sep 30, 2024

hiyouga commented Sep 30, 2024

rabiitmiao commented Oct 3, 2024

camposs1979 commented Dec 25, 2024