-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Command-R (CohereForAI model) tokenization disagrees with HF implementation #6104
Comments
The most likely reason is that Lines 9989 to 9999 in b5f4ae0
I would guess that Command-R uses some other regex. AFAIK each BPE-based model can use an arbitrary regex: The problem becomes more significant because in C++ we can't simply use built-in regex functions because they don't fully support unicode (or something like that, I'm no expert). So we have to do stuff like that: #4070. But then, this becomes very slow, so we have to custom-implement regex functions like the So long story short, the BPE tokenization in |
Aha. Thanks @ggerganov for the info. I'll be watching around on the regex stuff and maybe will help with the regex stuff depending on how much I'll have my own time to invest. |
Cohere and a lot of other models use HuggingFace's tokenizer, so a drop in fix is to use the library for tokenization (just feed corresponding tokenizer config, e.g. this) and avoid reimplementation and future maintenance in this project. The only issue that library is in rust. @ggerganov would you be open to brining rust into llama.cpp? |
3rd party projects can always use an external library to tokenize and pass the tokens to |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Command-R support was recently merged here: #6033
This issue is also discussed here where I initially thought it might be a bug on HF implementation side: https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/27
The model uses BPE; however something with tokenization is not exactly the same. I don't think it has any major impact on output quality but it does lead to the implementations disagreeing with top logits a little bit in some of my tests.
To test Command-R tokens we can use this with the HF model:
Llama.cpp comparison (I hacked
tokenize
to read the string from a file given by filepath to argv[2] instead of tokenizing argv[2]...do any of the cli tools print the tokens without having to do that?)To put the token lists side by side for readability:
The tokenizers don't seem exactly the same. It seems that
llama.cpp
is more eager to give 2126 for\n\n
than HF version.I verified with Cohere that their implementation is correct (https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/27) I initially thought
llama.cpp
was correct and theirs was buggy.The model might be slightly smarter if we match the tokenization, as then it would match how the model was trained. Empirically testing around I really don't think this impacts output quality in any material way, but it can influence ordering of top tokens a bit that can be noticable. I had the logits reorder themselves in a llama.cpp vs HF test prompt of about 2200 tokens where 7 tokens diverged (all of them two 206s vs one 2126). Maybe with particular kinds of prompts the divergence in tokenization would be much greater and output much different.
I'll offer to investigate and do a PR with an ETA some time next week when I can invest more time. Haven't read the tokenization code on either HF or llama.cpp yet as of opening this issue.
The text was updated successfully, but these errors were encountered: