4 bits quantization of LLaMA using GPTQ
GPTQ is SOTA one-shot weight quantization method
This code is based on GPTQ
This version has been created and tested for use with KoboldAI
- Optimized CPU Offloading
- Optimized GPU Splitting
- Backwards Compatibility with older GPTQ-models
Currently, groupsize
and act-order
do not work together and you must choose one of them.
LLaMA-7B(click me)
LLaMA-7B | Bits | group-size | memory(MiB) | Wikitext2 | checkpoint size(GB) |
---|---|---|---|---|---|
FP16 | 16 | - | 13940 | 5.68 | 12.5 |
RTN | 4 | - | - | 6.29 | - |
GPTQ | 4 | - | 4740 | 6.09 | 3.5 |
RTN | 3 | - | - | 25.54 | - |
GPTQ | 3 | - | 3852 | 8.07 | 2.7 |
GPTQ | 3 | 128 | 4116 | 6.61 | 3.0 |
LLaMA-13B
LLaMA-13B | Bits | group-size | memory(MiB) | Wikitext2 | checkpoint size(GB) |
---|---|---|---|---|---|
FP16 | 16 | - | OOM | 5.09 | 24.2 |
RTN | 4 | - | - | 5.53 | - |
GPTQ | 4 | - | 8410 | 5.36 | 6.5 |
RTN | 3 | - | - | 11.40 | - |
GPTQ | 3 | - | 6870 | 6.63 | 5.1 |
GPTQ | 3 | 128 | 7277 | 5.62 | 5.4 |
LLaMA-33B
LLaMa-33B | Bits | group-size | memory(MiB) | Wikitext2 | checkpoint size(GB) |
---|---|---|---|---|---|
FP16 | 16 | - | OOM | 4.10 | 60.5 |
RTN | 4 | - | - | 4.54 | - |
GPTQ | 4 | - | 19493 | 4.45 | 15.7 |
RTN | 3 | - | - | 14.89 | - |
GPTQ | 3 | - | 15493 | 5.69 | 12.0 |
GPTQ | 3 | 128 | 16566 | 4.80 | 13.0 |
LLaMA-65B
LLaMA-65B | Bits | group-size | memory(MiB) | Wikitext2 | checkpoint size(GB) |
---|---|---|---|---|---|
FP16 | 16 | - | OOM | 3.53 | 121.0 |
RTN | 4 | - | - | 3.92 | - |
GPTQ | 4 | - | OOM | 3.84 | 31.1 |
RTN | 3 | - | - | 10.59 | - |
GPTQ | 3 | - | OOM | 5.04 | 23.6 |
GPTQ | 3 | 128 | OOM | 4.17 | 25.6 |
Quantization requires a large amount of CPU memory. However, the memory required can be reduced by using swap memory.
Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(IST-DASLab/gptq#1)
According to GPTQ paper, As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.
pip install git+https://github.com/0cc4m/GPTQ-for-LLaMa@c884b421a233f9603d8224c9b22c2d83dd2c1fc4
old instructions:
If you don't have conda, install it first.
conda create --name gptq python=3.9 -y
conda activate gptq
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
# Or, if you're having trouble with conda, use pip with python3.9:
# pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
git clone https://github.com/0cc4m/GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
pip install -r requirements.txt
python setup_cuda.py install
# Benchmark performance for FC2 layer of LLaMa-7B
CUDA_VISIBLE_DEVICES=0 python test_kernel.py
torch
: tested on v2.0.0+cu117transformers
: tested on v4.28.0.dev0datasets
: tested on v2.10.1safetensors
: tested on v0.3.0- (to run 4-bit kernels: setup for compiling PyTorch CUDA extensions, see also https://pytorch.org/tutorials/advanced/cpp_extension.html, tested on CUDA 11.7)
All experiments were run on a single NVIDIA RTX3090.
The format for using this version of GPTQ has changed from specifying python files, to specifying the module name.
Old Command | New Way |
---|---|
python llama.py | python -m gptq.llama |
python gptj.py | python -m gptq.gptj |
python opt.py | python -m gptq.opt |
python gptneox.py | python -m gptq.gptneox |
python llama_inference.py | python -m gptq.llama_inference |
python llama_inference_offload.py | python -m gptq.llama_inference_offload |
python convert_llama_weights_to_hf.py | python -m gptq. convert_llama_weights_to_hf |
#convert LLaMA to hf
python -m gptq.convert_llama_weights_to_hf --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir ./llama-hf
# Benchmark language generation with 4-bit LLaMA-7B:
# Save compressed model
CUDA_VISIBLE_DEVICES=0 python -m gptq.llama ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --act-order --save llama7b-4bit.pt
# Or save compressed `.safetensors` model
CUDA_VISIBLE_DEVICES=0 python -m gptq.llama ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --act-order --save_safetensors llama7b-4bit.safetensors
# Benchmark generating a 2048 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python -m gptq.llama ./llama-hf/llama-7b c4 --wbits 4 --load llama7b-4bit.pt --benchmark 2048 --check
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python -m gptq.llama ./llama-hf/llama-7b c4 --benchmark 2048 --check
# model inference with the saved model
CUDA_VISIBLE_DEVICES=0 python -m gptq.llama_inference ./llama-hf/llama-7b --wbits 4 --load llama7b-4bit.pt --text "this is llama"
# model inference with the saved model with offload(This is very slow. This is a simple implementation and could be improved with technologies like flexgen(https://github.com/FMInference/FlexGen).
CUDA_VISIBLE_DEVICES=0 python -m gptq.llama_inference_offload ./llama-hf/llama-7b --wbits 4 --load llama7b-4bit.pt --text "this is llama" --pre_layer 16
It takes about 180 seconds to generate 45 tokens(5->50 tokens) on single RTX3090 based on LLaMa-65B. pre_layer is set to 50.
CUDA Kernels support 2,3,4,8 bits and Faster CUDA Kernels support 2,3,4 bits.
Basically, 4-bit quantization and 128 groupsize are recommended.
This code is based on GPTQ
Thanks to Meta AI for releasing LLaMA, a powerful LLM.