-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
微调llama3时指定eval_dataset并开启predict_with_generate后验证报错 #5292
Labels
solved
This problem has been already solved
Comments
试了qwen2模型也会报相同的错,相同模型使用lora和qlora微调时,就可以正常运行;使用Bloomz系列模型进行全量微调也可以正常运行 |
经过测试,这个报错是由于使用了deepspeed zero3造成的,使用该模式在predict_with_generate=True验证时候会报错 |
不支持 DeepSpeed zero3 |
hiyouga
added
solved
This problem has been already solved
and removed
pending
This problem is yet to be addressed
labels
Aug 29, 2024
是llama、qwen模型本身不支持吗?因为有的模型可以成功跑下来像bloomz系列的 |
yuwangnexusera
pushed a commit
to yuwangnexusera/LLaMA-Factory
that referenced
this issue
Sep 5, 2024
yuwangnexusera
pushed a commit
to yuwangnexusera/LLaMA-Factory
that referenced
this issue
Sep 5, 2024
yuwangnexusera
pushed a commit
to yuwangnexusera/LLaMA-Factory
that referenced
this issue
Sep 5, 2024
yuwangnexusera
pushed a commit
to yuwangnexusera/LLaMA-Factory
that referenced
this issue
Sep 5, 2024
yuwangnexusera
pushed a commit
to yuwangnexusera/LLaMA-Factory
that referenced
this issue
Sep 5, 2024
yuwangnexusera
pushed a commit
to yuwangnexusera/LLaMA-Factory
that referenced
this issue
Sep 5, 2024
yuwangnexusera
pushed a commit
to yuwangnexusera/LLaMA-Factory
that referenced
this issue
Sep 5, 2024
yuwangnexusera
pushed a commit
to yuwangnexusera/LLaMA-Factory
that referenced
this issue
Sep 5, 2024
yuwangnexusera
pushed a commit
to yuwangnexusera/LLaMA-Factory
that referenced
this issue
Sep 5, 2024
yuwangnexusera
pushed a commit
to yuwangnexusera/LLaMA-Factory
that referenced
this issue
Sep 5, 2024
yuwangnexusera
pushed a commit
to yuwangnexusera/LLaMA-Factory
that referenced
this issue
Sep 5, 2024
yuwangnexusera
pushed a commit
to yuwangnexusera/LLaMA-Factory
that referenced
this issue
Sep 5, 2024
deepspeed2 似乎也不支持啊 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Reminder
System Info
####训练参数如下
model
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json
dataset
dataset: identity,alpaca_en_demo
eval_dataset: identity
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
output
output_dir: saves/llama3-8b/full/sft
report_to: tensorboard
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
train
per_device_train_batch_size: 2
gradient_accumulation_steps: 2
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
eval
do_eval: true
predict_with_generate: true
#val_size: 0.1
per_device_eval_batch_size: 1
#eval_strategy: steps
#eval_steps: 500
Reproduction
报错信息如下
***** Running Evaluation *****
[INFO|trainer.py:3821] 2024-08-28 08:26:53,453 >> Num examples = 91
[INFO|trainer.py:3824] 2024-08-28 08:26:53,453 >> Batch size = 1
[rank2]: Traceback (most recent call last):
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/train.py", line 28, in
[rank2]: main()
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/train.py", line 19, in main
[rank2]: run_exp()
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank2]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 107, in run_sft
[rank2]: metrics = trainer.evaluate(metric_key_prefix="eval", **gen_kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank2]: return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer.py", line 3666, in evaluate
[rank2]: output = eval_loop(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer.py", line 3857, in evaluation_loop
[rank2]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 99, in prediction_step
[rank2]: loss, generated_tokens, _ = super().prediction_step( # ignore the returned labels (may be truncated)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 310, in prediction_step
[rank2]: generated_tokens = self.model.generate(**generation_inputs, **gen_kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]: return func(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 1989, in generate
[rank2]: result = self._sample(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2932, in _sample
[rank2]: outputs = self(**model_inputs, return_dict=True)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]: result = forward_call(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1141, in forward
[rank2]: outputs = self.model(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]: result = forward_call(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 944, in forward
[rank2]: layer_outputs = decoder_layer(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]: result = forward_call(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 677, in forward
[rank2]: hidden_states, self_attn_weights, present_key_value = self.self_attn(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]: result = forward_call(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 603, in forward
[rank2]: attn_output = torch.nn.functional.scaled_dot_product_attention(
[rank2]: RuntimeError: The expanded size of the tensor (32) must match the existing size (31) at non-singleton dimension 3. Target sizes: [1, 32, 1, 32]. Tensor sizes: [1, 1, 1, 31]
Expected behavior
运行命令:
CUDA_VISIBLE_DEVICES=0,1,2,3 FORCE_TORCHRUN=1 torchrun --nnodes 1 --node_rank 0 --nproc_per_node 4 src/train.py examples/demo/llama3_full_sft_ds3.yaml
Others
No response
The text was updated successfully, but these errors were encountered: