-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PPO阶段学习率更新异常 #424
Comments
@hiyouga 作者单卡测试时是直接用python启动的吗?我也曾用单卡测试过,不过也是通过accelerate启动的:export CUDA_VISIBLE_DEVICES=0 accelerate launch,还是会出现上下波动的情况,不知道是不是accelerate的原因 |
好的,我再测试一下 |
Hi, @hannlp, there are many reasons why the learning rate fluctuates. The main reason is
After
where,
If you want change
|
The original intention of gradient accumulation is to simulate a larger batch_size with limited GPU memory by accumulating operations. For example, if a GPU card holds 'per_device_train_batch_size' samples, and the gradient accumulation is set as 'gradient_accumulation_steps', it can simulate a batch size of 'per_device_train_batch_size * gradient_accumulation_steps'. However, in the code, the 'batch_size' in PPOConfig is 'training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps', which means a GPU card directly holds so many samples. This doesn't seem to align with the original purpose of gradient accumulation. It feels like the more changes are made to TRL, the more problems arise. @mmbwf |
@hannlp TRL uses |
Hi, @hannlp, @hiyouga is right. The following is a quote from the original author's explanation, I only made some adjustments to align with the current variable names. For further details, please refer to this PR huggingface/trl#546 (comment).
|
@mmbwf @hiyouga Thank you for your answers! But I have another question: In my environment(A100 40G), why does the setting of per_device_train_batch_size=8, gradient_accumulation_steps=2 not cause OOM (Out of Memory), but the setting of per_device_train_batch_size=4, gradient_accumulation_steps=16 does cause OOM? Logically, shouldn't the number of samples on a GPU be determined by per_device_train_batch_size? This was also the cause of my previous question. Also, under the current circumstances, how can we simulate a larger batch size? |
Hi, @hannlp, the cause of OOM is not in the PPO training stage, but in the stage model generates responses and rewards. In this inference stage, the total batch size is So, the memory cannot support a large batch for inference. If you want to increase the batch size, you can split the large batch into mini-batches for inference then merge. The following is a rough implementation, just for reference:
|
@mmbwf You're absolutely right, thank you so much for your patient response and code examples! Hope you have a blast in your life! |
@hiyouga 你好,更新之后的代码PPO阶段,多卡学习率变成了一个周期的cosine,而不是1/4周期 |
|
@luyuntao92 理论上不能这么改,单卡会出问题,请尝试用 accelerate 而非 Deepspeed |
在第三阶段使用--lr_scheduler_type cosine,训练过程中学习率变化如下:
但是,默认的num cycles是0.5,所以曲线不应该会有上下波动,下图是我认为正常的变化曲线
请问这有可能是哪里的bug呢?
一些超参:
--per_device_train_batch_size 1
--gradient_accumulation_steps 16
export CUDA_VISIBLE_DEVICES=0,1,2,3
accelerate launch
The text was updated successfully, but these errors were encountered: