You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have read the README and searched the existing issues.
Reproduction
Here is my Training & Eval & Prediction script
The training for LoRA is done with Quantization bit 4, so it would nice if we can predict trained results using llama_factory's pipeline with bits and bytes quantizations right away
/home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/trl/trainer/ppo_config.py:141: UserWarning: The `optimize_cuda_cache` arguement will be deprecated soon, please use `optimize_device_cache` instead.
warnings.warn(
12/04/2023 14:00:16 - WARNING - llmtuner.model.parser - Evaluating model in 4/8-bit mode may cause lower scores.
12/04/2023 14:00:16 - WARNING - llmtuner.model.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
[INFO|training_args.py:1345] 2023-12-04 14:00:16,347 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1798] 2023-12-04 14:00:16,347 >> PyTorch: setting up devices
/home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/training_args.py:1711: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
warnings.warn(
12/04/2023 14:00:16 - INFO - llmtuner.model.parser - Process rank: 0, device: cuda:0, n_gpu: 1
distributed training: True, compute dtype: None
12/04/2023 14:00:16 - INFO - llmtuner.model.parser - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=True,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./models/sft/Qwen-1_8B-Chat-sft-alpaca_gpt4_zh-.01/Predict/runs/Dec04_14-00-16_yhyu13fuwuqi,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
output_dir=./models/sft/Qwen-1_8B-Chat-sft-alpaca_gpt4_zh-.01/Predict,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=4,
per_device_train_batch_size=8,
predict_with_generate=True,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=./models/sft/Qwen-1_8B-Chat-sft-alpaca_gpt4_zh-.01/Predict,
save_on_each_node=False,
save_safetensors=False,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
12/04/2023 14:00:16 - INFO - llmtuner.data.loader - Loading dataset alpaca_gpt4_data_zh.json...
Using custom data configuration default-d0b7f73168407ceb
Loading Dataset Infos from /home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/datasets/packaged_modules/json
Overwrite dataset info from restored data version if exists.
Loading Dataset info from /home/hangyu5/.cache/huggingface/datasets/json/default-d0b7f73168407ceb/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
Found cached dataset json (/home/hangyu5/.cache/huggingface/datasets/json/default-d0b7f73168407ceb/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Loading Dataset info from /home/hangyu5/.cache/huggingface/datasets/json/default-d0b7f73168407ceb/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
[INFO|tokenization_utils_base.py:2013] 2023-12-04 14:00:17,307 >> loading file qwen.tiktoken
[INFO|tokenization_utils_base.py:2013] 2023-12-04 14:00:17,307 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2013] 2023-12-04 14:00:17,307 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2013] 2023-12-04 14:00:17,307 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2013] 2023-12-04 14:00:17,307 >> loading file tokenizer.json
[INFO|configuration_utils.py:713] 2023-12-04 14:00:17,767 >> loading configuration file ./models/Qwen-1_8B-Chat/config.json
[INFO|configuration_utils.py:713] 2023-12-04 14:00:17,768 >> loading configuration file ./models/Qwen-1_8B-Chat/config.json
[INFO|configuration_utils.py:775] 2023-12-04 14:00:17,769 >> Model config QWenConfig {
"_name_or_path": "./models/Qwen-1_8B-Chat",
"architectures": [
"QWenLMHeadModel"
],
"attn_dropout_prob": 0.0,
"auto_map": {
"AutoConfig": "configuration_qwen.QWenConfig",
"AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel"
},
"bf16": false,
"emb_dropout_prob": 0.0,
"fp16": false,
"fp32": false,
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 11008,
"kv_channels": 128,
"layer_norm_epsilon": 1e-06,
"max_position_embeddings": 8192,
"model_type": "qwen",
"no_bias": true,
"num_attention_heads": 16,
"num_hidden_layers": 24,
"onnx_safe": null,
"rotary_emb_base": 10000,
"rotary_pct": 1.0,
"scale_attn_weights": true,
"seq_length": 8192,
"softmax_in_fp32": false,
"tie_word_embeddings": false,
"tokenizer_class": "QWenTokenizer",
"transformers_version": "4.34.1",
"use_cache": true,
"use_cache_kernel": false,
"use_cache_quantization": false,
"use_dynamic_ntk": true,
"use_flash_attn": "auto",
"use_logn_attn": true,
"vocab_size": 151936
}
12/04/2023 14:00:17 - INFO - llmtuner.model.loader - Quantizing model to 4 bit.
[INFO|modeling_utils.py:2990] 2023-12-04 14:00:17,792 >> loading weights file ./models/Qwen-1_8B-Chat/model.safetensors.index.json
[INFO|modeling_utils.py:1220] 2023-12-04 14:00:17,792 >> Instantiating QWenLMHeadModel model under default dtype torch.float16.
[INFO|configuration_utils.py:770] 2023-12-04 14:00:17,792 >> Generate config GenerationConfig {}
Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
[INFO|modeling_utils.py:3103] 2023-12-04 14:00:18,107 >> Detected 4-bit loading: activating 4-bit loading for this model
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.49it/s]
[INFO|modeling_utils.py:3775] 2023-12-04 14:00:19,537 >> All model checkpoint weights were used when initializing QWenLMHeadModel.
[INFO|modeling_utils.py:3783] 2023-12-04 14:00:19,538 >> All the weights of QWenLMHeadModel were initialized from the model checkpoint at ./models/Qwen-1_8B-Chat.
If your task is similar to the task the model of the checkpoint was trained on, you can already use QWenLMHeadModel for predictions without further training.
[INFO|configuration_utils.py:728] 2023-12-04 14:00:19,539 >> loading configuration file ./models/Qwen-1_8B-Chat/generation_config.json
[INFO|configuration_utils.py:770] 2023-12-04 14:00:19,540 >> Generate config GenerationConfig {
"chat_format": "chatml",
"do_sample": true,
"eos_token_id": 151643,
"max_new_tokens": 512,
"max_window_size": 6144,
"pad_token_id": 151643,
"repetition_penalty": 1.1,
"top_k": 0,
"top_p": 0.8
}
12/04/2023 14:00:19 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
/home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/peft/tuners/lora/bnb.py:213: UserWarning: Merge lora module to 4-bit linear may get different generations due to rounding errors.
warnings.warn(
12/04/2023 14:00:20 - INFO - llmtuner.model.adapter - Merged 1 model checkpoint(s).
12/04/2023 14:00:20 - INFO - llmtuner.model.adapter - Loaded fine-tuned model from checkpoint(s): ./models/sft/Qwen-1_8B-Chat-sft-alpaca_gpt4_zh-.01
12/04/2023 14:00:20 - INFO - llmtuner.model.loader - trainable params: 0 || all params: 1836828672 || trainable%: 0.0000
12/04/2023 14:00:20 - INFO - llmtuner.model.loader - This IS expected that the trainable params is 0 if you are using model for inference only.
12/04/2023 14:00:20 - INFO - llmtuner.data.template - Add eos token: <|endoftext|>
12/04/2023 14:00:20 - INFO - llmtuner.data.template - Add pad token: <|endoftext|>
Loading cached processed dataset at /home/hangyu5/.cache/huggingface/datasets/json/default-d0b7f73168407ceb/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-71179182d092b457.arrow
[INFO|training_args.py:1345] 2023-12-04 14:00:21,076 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1798] 2023-12-04 14:00:21,076 >> PyTorch: setting up devices
Traceback (most recent call last):
File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/train_bash.py", line 14, in <module>
input_ids:
[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 100662, 108136, 101124, 45139, 1773, 151645, 198, 151644, 77091, 198]
inputs:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
保持健康的三个提示。<|im_end|>
<|im_start|>assistant
main()
File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 50, in run_sft
trainer = CustomSeq2SeqTrainer(
^^^^^^^^^^^^^^^^^^^^^
File "/home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer_seq2seq.py", line 56, in __init__
super().__init__(
File "/home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer.py", line 412, in __init__
raise ValueError(
ValueError: You cannot perform fine-tuning on purely quantized models. Please attach trainable adapters on top of the quantized model to correctly perform fine-tuning. Please see: https://huggingface.co/docs/transformers/peft for more details
But prediction and training cannot be enabled in the same time:
Traceback (most recent call last):
File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/train_bash.py", line 14, in <module>
main()
File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/llmtuner/train/tuner.py", line 20, in run_exp
model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
^^^^^^^^^^^^^^^^^^^^
File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/llmtuner/model/parser.py", line 112, in get_train_args
raise ValueError("`predict_with_generate` cannot be set as True while training.")
ValueError: `predict_with_generate` cannot be set as True while training.
Expected behavior
We should find a way to support evluating and prediction using quantizations as well.
Reminder
Reproduction
Here is my Training & Eval & Prediction script
The training for LoRA is done with Quantization bit 4, so it would nice if we can predict trained results using llama_factory's pipeline with bits and bytes quantizations right away
It spits out the following output:
But prediction and training cannot be enabled in the same time:
Expected behavior
We should find a way to support evluating and prediction using quantizations as well.
System Info
Ubutu 22.04 3090 pytorch 2.1.1 cuda12.1 flash attn2
Latest LLaMA_factory d3dccd0
Base model https://huggingface.co/Qwen/Qwen-1_8B-Chat
Others
No response
The text was updated successfully, but these errors were encountered: