Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use Quantization bit 4 for prediction #1735

Closed
1 task done
yhyu13 opened this issue Dec 4, 2023 · 1 comment
Closed
1 task done

Cannot use Quantization bit 4 for prediction #1735

yhyu13 opened this issue Dec 4, 2023 · 1 comment
Labels
duplicate This issue or pull request already exists

Comments

@yhyu13
Copy link
Contributor

yhyu13 commented Dec 4, 2023

Reminder

  • I have read the README and searched the existing issues.

Reproduction

Here is my Training & Eval & Prediction script

The training for LoRA is done with Quantization bit 4, so it would nice if we can predict trained results using llama_factory's pipeline with bits and bytes quantizations right away

#!/bin/bash

eval "$(conda shell.bash hook)"
conda activate llama_factory

MODEL_NAME=Qwen-1_8B-Chat
STAGE=sft
EPOCH=.01 #3.0
DATA=alpaca_gpt4_zh
SAVE_PATH=./models/$STAGE/$MODEL_NAME-$STAGE-$DATA-$EPOCH
SAVE_PATH_PREDICT=./models/$STAGE/$MODEL_NAME-$STAGE-$DATA-$EPOCH/Predict
MODEL_PATH=./models/$MODEL_NAME
LoRA_TARGET=c_attn #q_proj,v_proj
TEMPLATE=qwen #default

if [ ! -d $MODEL_PATH ]; then
    echo "Model not found: $MODEL_PATH"
    return 1
fi

if [ ! -d $SAVE_PATH ]; then
    mkdir -p $SAVE_PATH
fi

if [ ! -d $SAVE_PATH_PREDICT ]; then
    mkdir -p $SAVE_PATH_PREDICT
fi

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --seed 42 \
    --stage $STAGE \
    --model_name_or_path $MODEL_PATH \
    --dataset $DATA \
    --val_size .1 \
    --template $TEMPLATE \
    --finetuning_type lora \
    --do_train \
    --lora_target $LoRA_TARGET \
    --output_dir $SAVE_PATH \
    --overwrite_output_dir \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs $EPOCH \
    --do_eval \
    --evaluation_strategy epoch \
    --per_device_eval_batch_size 4 \
    --prediction_loss_only \
    --plot_loss \
    --quantization_bit 4 \
    | tee $SAVE_PATH/training_log.txt

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage $STAGE \
    --model_name_or_path $MODEL_PATH \
    --do_predict \
    --max_samples 100 \
    --predict_with_generate \
    --dataset $DATA \
    --template $TEMPLATE \
    --finetuning_type lora \
    --checkpoint_dir $SAVE_PATH \
    --output_dir $SAVE_PATH_PREDICT \
    --per_device_eval_batch_size 4 \
    --quantization_bit 4 \
    | tee $SAVE_PATH_PREDICT/predict_log.txt

It spits out the following output:

/home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/trl/trainer/ppo_config.py:141: UserWarning: The `optimize_cuda_cache` arguement will be deprecated soon, please use `optimize_device_cache` instead.
  warnings.warn(
12/04/2023 14:00:16 - WARNING - llmtuner.model.parser - Evaluating model in 4/8-bit mode may cause lower scores.
12/04/2023 14:00:16 - WARNING - llmtuner.model.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
[INFO|training_args.py:1345] 2023-12-04 14:00:16,347 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1798] 2023-12-04 14:00:16,347 >> PyTorch: setting up devices
/home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/training_args.py:1711: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
  warnings.warn(
12/04/2023 14:00:16 - INFO - llmtuner.model.parser - Process rank: 0, device: cuda:0, n_gpu: 1
  distributed training: True, compute dtype: None
12/04/2023 14:00:16 - INFO - llmtuner.model.parser - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=True,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./models/sft/Qwen-1_8B-Chat-sft-alpaca_gpt4_zh-.01/Predict/runs/Dec04_14-00-16_yhyu13fuwuqi,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
output_dir=./models/sft/Qwen-1_8B-Chat-sft-alpaca_gpt4_zh-.01/Predict,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=4,
per_device_train_batch_size=8,
predict_with_generate=True,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=./models/sft/Qwen-1_8B-Chat-sft-alpaca_gpt4_zh-.01/Predict,
save_on_each_node=False,
save_safetensors=False,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
12/04/2023 14:00:16 - INFO - llmtuner.data.loader - Loading dataset alpaca_gpt4_data_zh.json...
Using custom data configuration default-d0b7f73168407ceb
Loading Dataset Infos from /home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/datasets/packaged_modules/json
Overwrite dataset info from restored data version if exists.
Loading Dataset info from /home/hangyu5/.cache/huggingface/datasets/json/default-d0b7f73168407ceb/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
Found cached dataset json (/home/hangyu5/.cache/huggingface/datasets/json/default-d0b7f73168407ceb/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Loading Dataset info from /home/hangyu5/.cache/huggingface/datasets/json/default-d0b7f73168407ceb/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
[INFO|tokenization_utils_base.py:2013] 2023-12-04 14:00:17,307 >> loading file qwen.tiktoken
[INFO|tokenization_utils_base.py:2013] 2023-12-04 14:00:17,307 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2013] 2023-12-04 14:00:17,307 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2013] 2023-12-04 14:00:17,307 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2013] 2023-12-04 14:00:17,307 >> loading file tokenizer.json
[INFO|configuration_utils.py:713] 2023-12-04 14:00:17,767 >> loading configuration file ./models/Qwen-1_8B-Chat/config.json
[INFO|configuration_utils.py:713] 2023-12-04 14:00:17,768 >> loading configuration file ./models/Qwen-1_8B-Chat/config.json
[INFO|configuration_utils.py:775] 2023-12-04 14:00:17,769 >> Model config QWenConfig {
  "_name_or_path": "./models/Qwen-1_8B-Chat",
  "architectures": [
    "QWenLMHeadModel"
  ],
  "attn_dropout_prob": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_qwen.QWenConfig",
    "AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel"
  },
  "bf16": false,
  "emb_dropout_prob": 0.0,
  "fp16": false,
  "fp32": false,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "kv_channels": 128,
  "layer_norm_epsilon": 1e-06,
  "max_position_embeddings": 8192,
  "model_type": "qwen",
  "no_bias": true,
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "onnx_safe": null,
  "rotary_emb_base": 10000,
  "rotary_pct": 1.0,
  "scale_attn_weights": true,
  "seq_length": 8192,
  "softmax_in_fp32": false,
  "tie_word_embeddings": false,
  "tokenizer_class": "QWenTokenizer",
  "transformers_version": "4.34.1",
  "use_cache": true,
  "use_cache_kernel": false,
  "use_cache_quantization": false,
  "use_dynamic_ntk": true,
  "use_flash_attn": "auto",
  "use_logn_attn": true,
  "vocab_size": 151936
}

12/04/2023 14:00:17 - INFO - llmtuner.model.loader - Quantizing model to 4 bit.
[INFO|modeling_utils.py:2990] 2023-12-04 14:00:17,792 >> loading weights file ./models/Qwen-1_8B-Chat/model.safetensors.index.json
[INFO|modeling_utils.py:1220] 2023-12-04 14:00:17,792 >> Instantiating QWenLMHeadModel model under default dtype torch.float16.
[INFO|configuration_utils.py:770] 2023-12-04 14:00:17,792 >> Generate config GenerationConfig {}

Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
[INFO|modeling_utils.py:3103] 2023-12-04 14:00:18,107 >> Detected 4-bit loading: activating 4-bit loading for this model
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.49it/s]
[INFO|modeling_utils.py:3775] 2023-12-04 14:00:19,537 >> All model checkpoint weights were used when initializing QWenLMHeadModel.

[INFO|modeling_utils.py:3783] 2023-12-04 14:00:19,538 >> All the weights of QWenLMHeadModel were initialized from the model checkpoint at ./models/Qwen-1_8B-Chat.
If your task is similar to the task the model of the checkpoint was trained on, you can already use QWenLMHeadModel for predictions without further training.
[INFO|configuration_utils.py:728] 2023-12-04 14:00:19,539 >> loading configuration file ./models/Qwen-1_8B-Chat/generation_config.json
[INFO|configuration_utils.py:770] 2023-12-04 14:00:19,540 >> Generate config GenerationConfig {
  "chat_format": "chatml",
  "do_sample": true,
  "eos_token_id": 151643,
  "max_new_tokens": 512,
  "max_window_size": 6144,
  "pad_token_id": 151643,
  "repetition_penalty": 1.1,
  "top_k": 0,
  "top_p": 0.8
}

12/04/2023 14:00:19 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
/home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/peft/tuners/lora/bnb.py:213: UserWarning: Merge lora module to 4-bit linear may get different generations due to rounding errors.
  warnings.warn(
12/04/2023 14:00:20 - INFO - llmtuner.model.adapter - Merged 1 model checkpoint(s).
12/04/2023 14:00:20 - INFO - llmtuner.model.adapter - Loaded fine-tuned model from checkpoint(s): ./models/sft/Qwen-1_8B-Chat-sft-alpaca_gpt4_zh-.01
12/04/2023 14:00:20 - INFO - llmtuner.model.loader - trainable params: 0 || all params: 1836828672 || trainable%: 0.0000
12/04/2023 14:00:20 - INFO - llmtuner.model.loader - This IS expected that the trainable params is 0 if you are using model for inference only.
12/04/2023 14:00:20 - INFO - llmtuner.data.template - Add eos token: <|endoftext|>
12/04/2023 14:00:20 - INFO - llmtuner.data.template - Add pad token: <|endoftext|>
Loading cached processed dataset at /home/hangyu5/.cache/huggingface/datasets/json/default-d0b7f73168407ceb/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-71179182d092b457.arrow
[INFO|training_args.py:1345] 2023-12-04 14:00:21,076 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1798] 2023-12-04 14:00:21,076 >> PyTorch: setting up devices
Traceback (most recent call last):
  File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/train_bash.py", line 14, in <module>
input_ids:
[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 100662, 108136, 101124, 45139, 1773, 151645, 198, 151644, 77091, 198]
inputs:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
保持健康的三个提示。<|im_end|>
<|im_start|>assistant

    main()
  File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 50, in run_sft
    trainer = CustomSeq2SeqTrainer(
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer_seq2seq.py", line 56, in __init__
    super().__init__(
  File "/home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer.py", line 412, in __init__
    raise ValueError(
ValueError: You cannot perform fine-tuning on purely quantized models. Please attach trainable adapters on top of the quantized model to correctly perform fine-tuning. Please see: https://huggingface.co/docs/transformers/peft for more details

But prediction and training cannot be enabled in the same time:

Traceback (most recent call last):
  File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/llmtuner/train/tuner.py", line 20, in run_exp
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^
  File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/llmtuner/model/parser.py", line 112, in get_train_args
    raise ValueError("`predict_with_generate` cannot be set as True while training.")
ValueError: `predict_with_generate` cannot be set as True while training.

Expected behavior

We should find a way to support evluating and prediction using quantizations as well.

System Info

Ubutu 22.04 3090 pytorch 2.1.1 cuda12.1 flash attn2
Latest LLaMA_factory d3dccd0
Base model https://huggingface.co/Qwen/Qwen-1_8B-Chat

Others

No response

@hiyouga hiyouga added the pending This problem is yet to be addressed label Dec 5, 2023
@hiyouga
Copy link
Owner

hiyouga commented Dec 5, 2023

See #1462
Besides, we recommend to use --fp16 together with --quantization_bit argument

@hiyouga hiyouga added duplicate This issue or pull request already exists and removed pending This problem is yet to be addressed labels Dec 5, 2023
@hiyouga hiyouga closed this as completed Dec 5, 2023
hiyouga added a commit that referenced this issue Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants