Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command-R (CohereForAI model) tokenization disagrees with HF implementation #6104

Closed
Noeda opened this issue Mar 16, 2024 · 5 comments
Closed

Comments

@Noeda
Copy link
Contributor

Noeda commented Mar 16, 2024

Command-R support was recently merged here: #6033

This issue is also discussed here where I initially thought it might be a bug on HF implementation side: https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/27

The model uses BPE; however something with tokenization is not exactly the same. I don't think it has any major impact on output quality but it does lead to the implementations disagreeing with top logits a little bit in some of my tests.

To test Command-R tokens we can use this with the HF model:

#!/usr/bin/env python3

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CohereForAI/c4ai-command-r-v01"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

test_string = """
This is a sentence.

### Sentence
"""

print(tokenizer.encode(test_string))

# -> [5, 206, 4184, 1801, 1671, 27281, 21, 206, 206, 2680, 195143, 206]

Llama.cpp comparison (I hacked tokenize to read the string from a file given by filepath to argv[2] instead of tokenizing argv[2]...do any of the cli tools print the tokens without having to do that?)

$ tokenize ~/text-generation-webui/models/commandr_dev_f16.gguf tokens_test2
<omitted output until token list>

     5 -> ''
   206 -> '
'
  4184 -> 'This'
  1801 -> ' is'
  1671 -> ' a'
 27281 -> ' sentence'
    21 -> '.'
  2126 -> '

'
  2680 -> '###'
195143 -> ' Sentence'
   206 -> '
'

To put the token lists side by side for readability:

HF:
# [5, 206, 4184, 1801, 1671, 27281, 21, 206, 206, 2680, 195143, 206]

llama.cpp
# [5, 206, 4184, 1801, 1671, 27281, 21, 2126,     2680, 195143, 206]

The part that's different is two 206s vs one 2126. (206 = '\n', 2126 = '\n\n').
As far as I can tell, both implementations will decode back to the original strings exactly always.

The tokenizers don't seem exactly the same. It seems that llama.cpp is more eager to give 2126 for \n\n than HF version.

I verified with Cohere that their implementation is correct (https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/27) I initially thought llama.cpp was correct and theirs was buggy.

The model might be slightly smarter if we match the tokenization, as then it would match how the model was trained. Empirically testing around I really don't think this impacts output quality in any material way, but it can influence ordering of top tokens a bit that can be noticable. I had the logits reorder themselves in a llama.cpp vs HF test prompt of about 2200 tokens where 7 tokens diverged (all of them two 206s vs one 2126). Maybe with particular kinds of prompts the divergence in tokenization would be much greater and output much different.

I'll offer to investigate and do a PR with an ETA some time next week when I can invest more time. Haven't read the tokenization code on either HF or llama.cpp yet as of opening this issue.

@ggerganov
Copy link
Owner

The most likely reason is that llama.cpp currently implements a specific BPE pre-process regex. I'm not even sure which one exactly anymore - likely the one used in GPT-2 per this comment, though one should verify:

llama.cpp/llama.cpp

Lines 9989 to 9999 in b5f4ae0

std::vector<std::string> bpe_gpt2_preprocess(const std::string & text) {
std::vector<std::string> bpe_words;
std::vector<std::string> bpe_encoded_words;
std::string token = "";
// GPT2 system regex: 's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
bool collecting_numeric = false;
bool collecting_letter = false;
bool collecting_special = false;
bool collecting_whitespace_lookahead = false;
bool collecting = false;

I would guess that Command-R uses some other regex. AFAIK each BPE-based model can use an arbitrary regex:

https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py?rgh-link-date=2024-03-04T08%3A25%3A04Z

The problem becomes more significant because in C++ we can't simply use built-in regex functions because they don't fully support unicode (or something like that, I'm no expert). So we have to do stuff like that: #4070. But then, this becomes very slow, so we have to custom-implement regex functions like the bpe_gpt2_preprocess() above

So long story short, the BPE tokenization in llama.cpp has to be improved. I've create an issue and sketched a rough plan what needs to be done: #5981

@Noeda
Copy link
Contributor Author

Noeda commented Mar 16, 2024

Aha. Thanks @ggerganov for the info. I'll be watching around on the regex stuff and maybe will help with the regex stuff depending on how much I'll have my own time to invest.

@daulet
Copy link

daulet commented Mar 20, 2024

Cohere and a lot of other models use HuggingFace's tokenizer, so a drop in fix is to use the library for tokenization (just feed corresponding tokenizer config, e.g. this) and avoid reimplementation and future maintenance in this project. The only issue that library is in rust.

@ggerganov would you be open to brining rust into llama.cpp?

@ggerganov
Copy link
Owner

3rd party projects can always use an external library to tokenize and pass the tokens to llama.cpp. It's not very convenient, but for now I don't want to incorporate a specific 3rd party implementation in the project. It's unfortunate that we don't properly support all sorts of tokenizers, but hopefully with time we will improve to at least support the important bits

@github-actions github-actions bot added the stale label Apr 21, 2024
Copy link
Contributor

github-actions bot commented May 5, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants