PPO阶段学习率更新异常 #424

hannlp · 2023-08-09T06:46:14Z

在第三阶段使用--lr_scheduler_type cosine，训练过程中学习率变化如下：

但是，默认的num cycles是0.5，所以曲线不应该会有上下波动，下图是我认为正常的变化曲线

请问这有可能是哪里的bug呢？

一些超参：
--per_device_train_batch_size 1
--gradient_accumulation_steps 16
export CUDA_VISIBLE_DEVICES=0,1,2,3
accelerate launch

hiyouga · 2023-08-10T08:58:39Z

我用单卡测试是正常的，怀疑是多卡问题

--per_device_train_batch_size 1
--gradient_accumulation_steps 4

hannlp · 2023-08-11T02:28:43Z

@hiyouga 作者单卡测试时是直接用python启动的吗？我也曾用单卡测试过，不过也是通过accelerate启动的：export CUDA_VISIBLE_DEVICES=0 accelerate launch，还是会出现上下波动的情况，不知道是不是accelerate的原因

hiyouga · 2023-08-11T03:32:25Z

好的，我再测试一下

mmbwf · 2023-09-27T07:27:54Z

Hi, @hannlp, there are many reasons why the learning rate fluctuates. The main reason is AcceleratedScheduler set step_with_optimizer=True, split_batches=False default. So the lr_scheduler will step with optimizer and will do num_processes steps per training step.

def step(self, *args, **kwargs):
        if not self.step_with_optimizer:
            # No link between scheduler and optimizer -> just step
            self.scheduler.step(*args, **kwargs)
            return

        # Otherwise, first make sure the optimizer was stepped.
        if not self.gradient_state.sync_gradients:
            if self.gradient_state.adjust_scheduler:
                self.scheduler._step_count += 1
            return

        for opt in self.optimizers:
            if opt.step_was_skipped:
                return
        if self.split_batches:
            # Split batches -> the training dataloader batch size is not changed so one step per training step
            self.scheduler.step(*args, **kwargs)
        else:
            # Otherwise the training dataloader batch size was multiplied by `num_processes`, so we need to do
            # num_processes steps per training step
            num_processes = AcceleratorState().num_processes
            for _ in range(num_processes):
                # Special case when using OneCycle and `drop_last` was not used
                if hasattr(self.scheduler, "total_steps"):
                    if self.scheduler._step_count <= self.scheduler.total_steps:
                        self.scheduler.step(*args, **kwargs)
                else:
                    self.scheduler.step(*args, **kwargs)

After trl==0.5.0, they refactor the whole ppo step logic, several new concepts of batch size have been introduced.

for i in range(ppo_epochs):
    for j in range(batch_size // backward_batch_size):
        for k in range(backward_batch_size // mini_batch_size):
            with self.accelerator.accumulate(self.model):
                ...
                self.accelerator.backward(loss)
                optimizer.step()
                optimizer.zero_grad()

...

lr_scehduler.step()

where, backward_batch_size = mini_batch_size * gradient_accumulation_steps, batch_size = n * backward_batch_size.
So, the total training steps calculated in this repository might not match the actual training steps. Only when batch_size=backward_batch_size and ppo_epochs=1, they are equal.

ppo_config = PPOConfig(
        model_name=model_args.model_name_or_path,
        learning_rate=training_args.learning_rate,
        mini_batch_size=training_args.per_device_train_batch_size,
        batch_size=training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps,
        gradient_accumulation_steps=training_args.gradient_accumulation_steps,
        ppo_epochs=1,
        max_grad_norm=training_args.max_grad_norm,
        seed=training_args.seed,
        optimize_cuda_cache=True,
    )

...

total_train_batch_size = (
        training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps * training_args.world_size
    )
num_training_steps = training_args.num_train_epochs * math.ceil(len(dataset) / total_train_batch_size)
lr_scheduler = get_scheduler(
        training_args.lr_scheduler_type,
        optimizer=optimizer,
        num_warmup_steps=training_args.get_warmup_steps(num_training_steps),
        num_training_steps=num_training_steps
    )

If you want change batch_size or ppo_epochs, you can set step_with_optimizer=False, then lr_scehduler only step when you call it.

ppo_config = PPOConfig(
        model_name=model_args.model_name_or_path,
        learning_rate=training_args.learning_rate,
        mini_batch_size=training_args.per_device_train_batch_size,
        batch_size=training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps,
        gradient_accumulation_steps=training_args.gradient_accumulation_steps,
        accelerator_kwargs={"step_scheduler_with_optimizer": False},
        ppo_epochs=4,
        max_grad_norm=training_args.max_grad_norm,
        seed=training_args.seed,
        optimize_cuda_cache=True,
    )

hiyouga · 2023-09-27T14:50:35Z

@mmbwf Thanks for pointing out it! It has been fixed in 35fa947

hannlp · 2023-11-09T11:50:14Z

The original intention of gradient accumulation is to simulate a larger batch_size with limited GPU memory by accumulating operations. For example, if a GPU card holds 'per_device_train_batch_size' samples, and the gradient accumulation is set as 'gradient_accumulation_steps', it can simulate a batch size of 'per_device_train_batch_size * gradient_accumulation_steps'. However, in the code, the 'batch_size' in PPOConfig is 'training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps', which means a GPU card directly holds so many samples. This doesn't seem to align with the original purpose of gradient accumulation. It feels like the more changes are made to TRL, the more problems arise. @mmbwf

hiyouga · 2023-11-09T12:32:44Z

@hannlp TRL uses mini_batch_size to indicate the batch size in each forward process. Therefore, the total batch size is mini_batch_size * gradient_accumulation_steps with gradient accumulation [1]. The batch_size in PPOConfig only represents the number of examples in a single ppo_trainer.step() call.

mmbwf · 2023-11-10T12:28:38Z

Hi, @hannlp, @hiyouga is right. backward_batch_size = mini_batch_size * gradient_accumulation_steps is the real number of samples optimized in an optimizer.step() call.

The following is a quote from the original author's explanation, I only made some adjustments to align with the current variable names. For further details, please refer to this PR huggingface/trl#546 (comment).

I guess this is more of a stylistic thing. We really have three levels of "batch sizes":

batch_size is the amount of rollout data, e,g., 8 data points

backward_batch_size is the amount of data that you are actually doing a zero_grad(), loss.backward(), and optimizer.step(), e.g., 4 data points

mini_batch_size if the backward_batch_size is too large to fit in memory, so we zero_grad(), partition the backward_batch and do multiple loss.backward() and then optimizer.step().

Having the terminology like this makes it clear what the real backward_batch_size is, so that we do not confound it with gradient_accumulation_steps.

From the user's perspective, one set of hyperparameters should always reliably reproduce similar learning curves. By the new API design in this PR, batch_size=8, backward_batch_size=4 will reliably reproduce similar learning curves regardless of how many gradient_accumulation_steps there is, so when doing experiment management / analysis we can just group / filter by batch_size and backward_batch_size. The existing implementation, however, would require us to calculate what's the "real size of the data we perform an optimizer.step() on".

hannlp · 2023-11-11T10:13:11Z

@mmbwf @hiyouga Thank you for your answers! But I have another question: In my environment(A100 40G), why does the setting of per_device_train_batch_size=8, gradient_accumulation_steps=2 not cause OOM (Out of Memory), but the setting of per_device_train_batch_size=4, gradient_accumulation_steps=16 does cause OOM? Logically, shouldn't the number of samples on a GPU be determined by per_device_train_batch_size? This was also the cause of my previous question. Also, under the current circumstances, how can we simulate a larger batch size?

mmbwf · 2023-11-13T06:38:05Z

Hi, @hannlp, the cause of OOM is not in the PPO training stage, but in the stage model generates responses and rewards. In this inference stage, the total batch size is training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps.

So, the memory cannot support a large batch for inference. If you want to increase the batch size, you can split the large batch into mini-batches for inference then merge.

The following is a rough implementation, just for reference:

# Get inputs
queries, responses, rewards = [], [], []
for mini_batch_start in range(0, self.config.batch_size, self.config.mini_batch_size):
    mini_batch_end = mini_batch_start + self.config.mini_batch_size
    mini_batch_queries, mini_batch_responses = self.get_inputs(batch[mini_batch_start:mini_batch_end])
    queries.extend(mini_batch_queries)
    responses.extend(mini_batch_responses)

    self.tokenizer.padding_side = "right" # change padding side for rewards inference.
    mini_batch_rewards = self.get_rewards(mini_batch_queries, mini_batch_responses, unwrapped_model)
    rewards.extend(mini_batch_rewards)
    self.tokenizer.padding_side = "left" # change padding side for responses inference.

hannlp · 2023-11-13T07:13:59Z

@mmbwf You're absolutely right, thank you so much for your patient response and code examples! Hope you have a blast in your life!

hiyouga · 2023-11-13T14:43:12Z

a more accurate version in 87390ae @hannlp

luyuntao92 · 2024-01-03T02:16:11Z

@hiyouga 你好，更新之后的代码PPO阶段，多卡学习率变成了一个周期的cosine，而不是1/4周期

luyuntao92 · 2024-01-04T06:29:47Z

@hiyouga 你好，更新之后的代码PPO阶段，多卡学习率变成了一个周期的cosine，而不是1/4周期

ppo_epochs默认不为1的情况，是不是总的training_steps要*ppo_epochs

hiyouga · 2024-01-04T15:46:03Z

@luyuntao92 理论上不能这么改，单卡会出问题，请尝试用 accelerate 而非 Deepspeed

hiyouga added the pending This problem is yet to be addressed label Aug 9, 2023

hannlp mentioned this issue Aug 9, 2023

关于PPO训练梯度累计及学习率衰减的问题 hiyouga/ChatGLM-Efficient-Tuning#299

Closed

hiyouga closed this as completed in 35fa947 Sep 27, 2023

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 27, 2023

hiyouga added a commit that referenced this issue Nov 13, 2023

fix #424

87390ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO阶段学习率更新异常 #424

PPO阶段学习率更新异常 #424

hannlp commented Aug 9, 2023 •

edited

Loading

hiyouga commented Aug 10, 2023

hannlp commented Aug 11, 2023 •

edited

Loading

hiyouga commented Aug 11, 2023

mmbwf commented Sep 27, 2023 •

edited

Loading

hiyouga commented Sep 27, 2023

hannlp commented Nov 9, 2023

hiyouga commented Nov 9, 2023

mmbwf commented Nov 10, 2023 •

edited

Loading

hannlp commented Nov 11, 2023

mmbwf commented Nov 13, 2023 •

edited

Loading

hannlp commented Nov 13, 2023

hiyouga commented Nov 13, 2023

luyuntao92 commented Jan 3, 2024

luyuntao92 commented Jan 4, 2024

hiyouga commented Jan 4, 2024 •

edited

Loading

PPO阶段学习率更新异常 #424

PPO阶段学习率更新异常 #424

Comments

hannlp commented Aug 9, 2023 • edited Loading

hiyouga commented Aug 10, 2023

hannlp commented Aug 11, 2023 • edited Loading

hiyouga commented Aug 11, 2023

mmbwf commented Sep 27, 2023 • edited Loading

hiyouga commented Sep 27, 2023

hannlp commented Nov 9, 2023

hiyouga commented Nov 9, 2023

mmbwf commented Nov 10, 2023 • edited Loading

hannlp commented Nov 11, 2023

mmbwf commented Nov 13, 2023 • edited Loading

hannlp commented Nov 13, 2023

hiyouga commented Nov 13, 2023

luyuntao92 commented Jan 3, 2024

luyuntao92 commented Jan 4, 2024

hiyouga commented Jan 4, 2024 • edited Loading

hannlp commented Aug 9, 2023 •

edited

Loading

hannlp commented Aug 11, 2023 •

edited

Loading

mmbwf commented Sep 27, 2023 •

edited

Loading

mmbwf commented Nov 10, 2023 •

edited

Loading

mmbwf commented Nov 13, 2023 •

edited

Loading

hiyouga commented Jan 4, 2024 •

edited

Loading