Add support for QRWKV6 hybrid models & slight optimization for RWKV6 #11001

MollySophia · 2024-12-28T10:22:55Z

QRWKV6-32B is a new model by Recursal which is a combination of the Qwen2.5 architecture and RWKV6.
It 'converts' a Qwen2.5-32B-Instruct model's QKV attention into RWKV6 linear attention, keeping knowledges in the origin Qwen model while gaining the advantages of linear models (constant vram usage and flops, independent of ctxlen).
More info/model for testing: https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1
Some converted GGUF for testing: https://huggingface.co/mollysama/QRWKV6-32B-Instruct-Preview-GGUF

Changes in this PR:

Add OP gated linear attention with CPU and CUDA impl, which looks like a simplified version of RWKV6 wkv attention.
Model conversion and inferencing for QRWKV6-32B
RWKV6 optimizations: graph simplification; concated lerp weights to reduce cpu overhead during inference (credit to @compilade)

Testing details:

32B Q4_0/Q4_K quantized model running on a single 4090 with decent speed:

$ ./build/bin/llama-bench -m ../QRWKV6-32B-Instruct-Preview-v0.1-Q4_0.gguf -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| rwkv6qwen2 32B Q4_0            |  19.34 GiB |    34.74 B | CUDA       |  99 |         pp512 |        819.60 ± 1.01 |
| rwkv6qwen2 32B Q4_0            |  19.34 GiB |    34.74 B | CUDA       |  99 |         tg128 |         32.72 ± 0.01 |

build: 5a73dbcb (4397)

wikitext2 PPLs:

Quant type	PPL
f32	5.6987 +/- 0.03365
q8_0	5.7005 +/- 0.03370
q6_k	5.7126 +/- 0.03376
q5_k_s	5.7339 +/- 0.03393
q4_k_m	5.7921 +/- 0.03428
q4_0	5.8568 +/- 0.03481
q3_k_m	6.0677 +/- 0.03635
q2_k	7.4547 +/- 0.04597

Performance of QRWKV6-32B difference before/after concating lerp weights together:

(Sry for the image attachment)

before:
$ ./build/bin/llama-bench -m ../QRWKV6-32B-Instruct-Preview-v0.1/QRWKV6-32B-Instruct-Preview-v0.1-F16.gguf -sm none -mg 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 2: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 3: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 4: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 5: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 6: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 7: NVIDIA H800, compute capability 9.0, VMM: yes
| model                          |       size |     params | backend    | ngl |    sm |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | ------------: | -------------------: |
| rwkv6qwen2 32B F16             |  65.26 GiB |    34.74 B | CUDA       |  99 |  none |         pp512 |        697.64 ± 0.59 |
| rwkv6qwen2 32B F16             |  65.26 GiB |    34.74 B | CUDA       |  99 |  none |         tg128 |         21.91 ± 0.00 |

build: b7b45753 (4397)

after:
$ ./build/bin/llama-bench -m ../QRWKV6-32B-Instruct-Preview-v0.1/QRWKV6-32B-Instruct-Preview-v0.1-F16-fused-lerp.gguf -sm none -mg 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 2: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 3: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 4: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 5: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 6: NVIDIA H800, compute capability 9.0, VMM: yes
  Device 7: NVIDIA H800, compute capability 9.0, VMM: yes
| model                          |       size |     params | backend    | ngl |    sm |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | ------------: | -------------------: |
| rwkv6qwen2 32B F16             |  65.26 GiB |    34.74 B | CUDA       |  99 |  none |         pp512 |        731.32 ± 1.10 |
| rwkv6qwen2 32B F16             |  65.26 GiB |    34.74 B | CUDA       |  99 |  none |         tg128 |         26.51 ± 0.01 |

build: b7b45753 (4397)

Signed-off-by: Molly Sophia <[email protected]>

MollySophia · 2025-01-07T00:31:36Z

Hi! @ggerganov
May I request for a review? :3

ggerganov

I haven't tested the models. ggml-ci is passing on my CUDA machine.

ggml/src/ggml-cuda/gla.cu

Co-authored-by: Georgi Gerganov <[email protected]>

MollySophia added 9 commits January 3, 2025 16:56

WIP: Add support for RWKV6Qwen2

f298f03

Signed-off-by: Molly Sophia <[email protected]>

RWKV: Some graph simplification

385b611

Signed-off-by: Molly Sophia <[email protected]>

Add support for RWKV6Qwen2 with cpu and cuda GLA

fab0aa7

Signed-off-by: Molly Sophia <[email protected]>

RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead

bc930cd

Signed-off-by: Molly Sophia <[email protected]>

Fix some typos

f2c1a5c

Signed-off-by: Molly Sophia <[email protected]>

code format changes

aaa870e

Signed-off-by: Molly Sophia <[email protected]>

Fix wkv test & add gla test

00930e6

Signed-off-by: Molly Sophia <[email protected]>

Fix cuda warning

08cf560

Signed-off-by: Molly Sophia <[email protected]>

Update README.md

331581b

Signed-off-by: Molly Sophia <[email protected]>

MollySophia force-pushed the rwkv6qwen2 branch from 69148cf to 331581b Compare January 3, 2025 09:21

ggerganov approved these changes Jan 7, 2025

View reviewed changes

ggml/src/ggml-cuda/gla.cu Outdated Show resolved Hide resolved

ggerganov requested a review from compilade January 7, 2025 08:58

Update ggml/src/ggml-cuda/gla.cu

aed0afb

Co-authored-by: Georgi Gerganov <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for QRWKV6 hybrid models & slight optimization for RWKV6 #11001

Add support for QRWKV6 hybrid models & slight optimization for RWKV6 #11001

MollySophia commented Dec 28, 2024 •

edited

Loading

MollySophia commented Jan 7, 2025

ggerganov left a comment

Add support for QRWKV6 hybrid models & slight optimization for RWKV6 #11001

Are you sure you want to change the base?

Add support for QRWKV6 hybrid models & slight optimization for RWKV6 #11001

Conversation

MollySophia commented Dec 28, 2024 • edited Loading

MollySophia commented Jan 7, 2025

ggerganov left a comment

Choose a reason for hiding this comment

MollySophia commented Dec 28, 2024 •

edited

Loading