-
-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize q4_matmul #275
Optimize q4_matmul #275
Conversation
Thanks for this. Interestingly, I have more or less the same optimization in V2 already. The difference in my tests has been minimal, though, and I'm doing it mostly for the sake of the new quant format, but I guess that's because I have nothing to test on that's older than Ampere. I'm really surprised there's this much of a difference here, given that the data you're explicitly caching in SMEM in these tests is all of 8 kB in total, and Turing is also supposed to have a shared architecture for L1 cache and SMEM, with about the same performance on both. I'll do some tests and merge this in a few hours if it doesn't break anything. But in the meantime, could you test if there's a further difference in performance with the |
Before this PR: $ python test_benchmark_inference.py -p -d models/LLaMA-7B-4bit-128g -cs --matmul_fused_remap
/home/qc/Workspace/NotMe/exllama/cuda_ext.py:82: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
none_tensor = torch.empty((1, 1), device = "meta")
-- Tokenizer: models/LLaMA-7B-4bit-128g/tokenizer.model
-- Model config: models/LLaMA-7B-4bit-128g/config.json
-- Model: models/LLaMA-7B-4bit-128g/llama-7b-4bit-128g.safetensors
-- Sequence length: 2048
-- Tuning:
-- --sdp_thd: 8
-- --matmul_recons_thd: 8
-- --fused_mlp_thd: 2
-- --matmul_fused_remap
-- --concurrent_streams
-- Options: ['perf']
** Time, Load model: 1.36 seconds
** Time, Load tokenizer: 0.01 seconds
-- Groupsize (inferred): 128
-- Act-order (inferred): yes
** VRAM, Model: [cuda:0] 3,638.47 MB
** VRAM, Cache: [cuda:0] 1,024.00 MB
-- Warmup pass 1...
** Time, Warmup: 0.44 seconds
-- Warmup pass 2...
** Time, Warmup: 0.42 seconds
-- Inference, first pass.
** Time, Inference: 0.59 seconds
** Speed: 3233.46 tokens/second
-- Generating 128 tokens, 1920 token prompt...
** Speed: 34.70 tokens/second
-- Generating 128 tokens, 4 token prompt...
** Speed: 46.09 tokens/second
** VRAM, Inference: [cuda:0] 143.92 MB
** VRAM, Total: [cuda:0] 4,806.38 MB After this PR: $ python test_benchmark_inference.py -p -d models/LLaMA-7B-4bit-128g -cs --matmul_fused_remap
/home/qc/Workspace/NotMe/exllama/cuda_ext.py:82: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
none_tensor = torch.empty((1, 1), device = "meta")
-- Tokenizer: models/LLaMA-7B-4bit-128g/tokenizer.model
-- Model config: models/LLaMA-7B-4bit-128g/config.json
-- Model: models/LLaMA-7B-4bit-128g/llama-7b-4bit-128g.safetensors
-- Sequence length: 2048
-- Tuning:
-- --sdp_thd: 8
-- --matmul_recons_thd: 8
-- --fused_mlp_thd: 2
-- --matmul_fused_remap
-- --concurrent_streams
-- Options: ['perf']
** Time, Load model: 1.40 seconds
** Time, Load tokenizer: 0.01 seconds
-- Groupsize (inferred): 128
-- Act-order (inferred): yes
** VRAM, Model: [cuda:0] 3,638.47 MB
** VRAM, Cache: [cuda:0] 1,024.00 MB
-- Warmup pass 1...
** Time, Warmup: 0.45 seconds
-- Warmup pass 2...
** Time, Warmup: 0.42 seconds
-- Inference, first pass.
** Time, Inference: 0.60 seconds
** Speed: 3216.22 tokens/second
-- Generating 128 tokens, 1920 token prompt...
** Speed: 79.21 tokens/second
-- Generating 128 tokens, 4 token prompt...
** Speed: 159.22 tokens/second
** VRAM, Inference: [cuda:0] 143.92 MB
** VRAM, Total: [cuda:0] 4,806.38 MB |
Your original memory access pattern cannot be coalesced. In addition, x_map values are calculated multiple times, which is redundant. |
Well, without the fused remap parameter the remapping is done exactly once, from global to global memory, but the state should be in at least L2 cache by that point, and subsequently reading from L1 should not be slower. However, I haven't done nearly as much profiling on the V1 kernel as I have on V2, so I may have missed a lot. Perhaps coalescing matters more than I thought in places. And upon some further testing, this is faster, even on the 4090, but only for some models. For others it's considerably slower, and I'll need a moment to figure out why, or if it should be switchable to get the best of both worlds depending on e.g. model size. |
Let me guess. Do those models have a huge group size or no group size? |
You're right, they have no group size, which is to say they have one group as large as the hidden dim of the model. So they'll be using a lot of SMEM and occupancy will be terrible with this approach. But no reason this couldn't just be limited to 128 rows or something, in that case, so performance should be the same. I am getting broken output, though, so there's definitely something amiss. Note that the perplexity test runs sequences larger than the threshold that triggers reconstruction, where the custom kernel is bypassed in favor of just temporarily reconstructing the FP16 weights and using cuBLAS, since it's invariably faster. If you run the benchmark script with
Obviously that isn't right. I'm really hoping it's fixable because those speeds are... well, technically they're higher than the theoretical maximum for 1 TB/s VRAM bandwidth, which I guess is concerning, too. I'm investigating. |
I'll look into it. |
@turboderp It's fixed. Quite a stupid mistake. 😢 And now the speed is much lower. But still faster than before. $ python test_benchmark_inference.py -p -d models/LLaMA-7B-4bit-128g -cs
/home/qc/Workspace/NotMe/exllama/cuda_ext.py:82: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
none_tensor = torch.empty((1, 1), device = "meta")
-- Tokenizer: models/LLaMA-7B-4bit-128g/tokenizer.model
-- Model config: models/LLaMA-7B-4bit-128g/config.json
-- Model: models/LLaMA-7B-4bit-128g/llama-7b-4bit-128g.safetensors
-- Sequence length: 2048
-- Tuning:
-- --sdp_thd: 8
-- --matmul_recons_thd: 8
-- --fused_mlp_thd: 2
-- --concurrent_streams
-- Options: ['perf']
** Time, Load model: 1.37 seconds
** Time, Load tokenizer: 0.01 seconds
-- Groupsize (inferred): 128
-- Act-order (inferred): yes
** VRAM, Model: [cuda:0] 3,638.47 MB
** VRAM, Cache: [cuda:0] 1,024.00 MB
-- Warmup pass 1...
** Time, Warmup: 0.44 seconds
-- Warmup pass 2...
** Time, Warmup: 0.42 seconds
-- Inference, first pass.
** Time, Inference: 0.59 seconds
** Speed: 3264.15 tokens/second
-- Generating 128 tokens, 1920 token prompt...
** Speed: 45.67 tokens/second
-- Generating 128 tokens, 4 token prompt...
** Speed: 67.89 tokens/second
** VRAM, Inference: [cuda:0] 143.92 MB
** VRAM, Total: [cuda:0] 4,806.38 MB
$ python test_benchmark_inference.py -v -d models/LLaMA-7B-4bit-128g
/home/qc/Workspace/NotMe/exllama/cuda_ext.py:82: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
none_tensor = torch.empty((1, 1), device = "meta")
-- Tokenizer: models/LLaMA-7B-4bit-128g/tokenizer.model
-- Model config: models/LLaMA-7B-4bit-128g/config.json
-- Model: models/LLaMA-7B-4bit-128g/llama-7b-4bit-128g.safetensors
-- Sequence length: 2048
-- Tuning:
-- --sdp_thd: 8
-- --matmul_recons_thd: 8
-- --fused_mlp_thd: 2
-- Options: ['validate']
** Time, Load model: 1.38 seconds
** Time, Load tokenizer: 0.01 seconds
-- Groupsize (inferred): 128
-- Act-order (inferred): yes
** VRAM, Model: [cuda:0] 3,638.47 MB
** VRAM, Cache: [cuda:0] 1,024.00 MB
-- Testing 8 chunks.
** Perplexity (reconstruct): 6.0643
-- Testing 8 chunks.
** Perplexity (quant, token): 6.0777
** Generation: 'To be or not to be, that is the question.\nThe answer is: Yes and no. The first part of this sentence is a simple' I'm now working on improving the performance of other group sizes. |
@turboderp Could you benchmark the latest commit on your models? |
I ran some tests on the 4090, but the latest version is about 10% slower than the original. However, if I change the blocksize back to 32, the performance is comparable, and somewhat better for 3B models:
Doing the same tests on the 3090 paints a different picture, with similar performance for 32 and 128 threads per block, and overall improvements on the order of 10%:
I'm inclined to say that with the 32 block dim it's an improvement overall, if not on Ada then at least on Ampere, which is still great. Does setting |
To improve performance on 30/40-series cards, maybe you can try something like |
I'm pretty sure it's latency bound right now, at least on the 4090. Getting around 700 MB/s throughput which, even assuming a little overhead, suggests there's room for improvement. But I'm also reaching 100% occupancy, which means to further hide the latency, the ratio of compute to memory access has to go up. Loading |
It's quite a bit slower on a 6700 XT with a 13B 128g. Around ~5% slow down. From experience, I expect this to also be a slow down on pascal cards. |
That's frustrating. Seems like we need more discussion before merging. Do you have any idea how it causes the slowdown? The only card I have is a 2070S so I cannot dig up further. |
I don't have the answer to that. But with exllama v2 around the corner, is it really worth it to spend time trying to optimize exllama v1? |
Fine. I didn't know the existence of exllama v2 at the time I wrote this PR. You can close this PR if you want. @turboderp |
I'm not exllama maintainer or developer, turboderp is the only one that should make the decision if this optimization is worth or not. What I wanted to convey by my message is that I, personally, don't want to spend the time investigating why it is a slow-down on my card when exllama v2 might get released soon. Maybe, exllama v2 will never get released or be different enough (like focusing on 2/3 bits quantization) to not make exllama v1 useless and then my decision would have been wrong. |
I think it's worth keeping, but it should probably be switchable one way or another if there's performance degradation on ROCm. Probably it could just switch at compile time based on As for V2, it supports GPTQ models as well as the new quant format, and performance is considerably better, at least on the cards I have access to. On 4090 it's 10-15% faster than V1 (so far), and the 3090-Ti is only 6% slower than the 4090, which tracks with the kernel being close to optimal, for some definition of optimal. So I also don't want to devote too much time to optimizing V1. If anything I'd rather back-port the new kernel at some point. Though in the meantime there's nothing wrong with making better use of SMEM, as in this PR. |
Pascal is nigh unusable for this regardless. I'd be happy with a speedup for supported cards. Looking forward to v2 to get higher perplexity quants perhaps. I am noticing the difference between GGML and GPTQ now but the latter is much better at memory management. |
Well, it has to be switchable regardless if it's an issue for ROCm. So I'll just switch on both the CUDA arch version and |
There, I finally figured out how github works! And thanks for the optimization @QuarticCat. It should now be enabled when CUDA_ARCH >= 700 but not on ROCm, using the old version as a fallback. It gets a bit messy of course, but I'm not expecting to add too much more functionality to this version anyway. If there's a further performance benefit to be had on certain arch versions, it should be simple enough to add some more conditional code to select a different |
Performance changes
Before:
After:
Benchmarked on RTX 2070 Super. Other models cannot fit in VRAM. Expect less speedup if the model contains less
x_map
.PPL changes
Before:
After:
Delta = 0.0005