Command-R (CohereForAI model) tokenization disagrees with HF implementation #6104

Noeda · 2024-03-16T16:02:56Z

Command-R support was recently merged here: #6033

This issue is also discussed here where I initially thought it might be a bug on HF implementation side: https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/27

The model uses BPE; however something with tokenization is not exactly the same. I don't think it has any major impact on output quality but it does lead to the implementations disagreeing with top logits a little bit in some of my tests.

To test Command-R tokens we can use this with the HF model:

#!/usr/bin/env python3

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CohereForAI/c4ai-command-r-v01"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

test_string = """
This is a sentence.

### Sentence
"""

print(tokenizer.encode(test_string))

# -> [5, 206, 4184, 1801, 1671, 27281, 21, 206, 206, 2680, 195143, 206]

Llama.cpp comparison (I hacked tokenize to read the string from a file given by filepath to argv[2] instead of tokenizing argv[2]...do any of the cli tools print the tokens without having to do that?)

$ tokenize ~/text-generation-webui/models/commandr_dev_f16.gguf tokens_test2
<omitted output until token list>

     5 -> ''
   206 -> '
'
  4184 -> 'This'
  1801 -> ' is'
  1671 -> ' a'
 27281 -> ' sentence'
    21 -> '.'
  2126 -> '

'
  2680 -> '###'
195143 -> ' Sentence'
   206 -> '
'

To put the token lists side by side for readability:

HF:
# [5, 206, 4184, 1801, 1671, 27281, 21, 206, 206, 2680, 195143, 206]

llama.cpp
# [5, 206, 4184, 1801, 1671, 27281, 21, 2126,     2680, 195143, 206]

The part that's different is two 206s vs one 2126. (206 = '\n', 2126 = '\n\n').
As far as I can tell, both implementations will decode back to the original strings exactly always.

The tokenizers don't seem exactly the same. It seems that llama.cpp is more eager to give 2126 for \n\n than HF version.

I verified with Cohere that their implementation is correct (https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/27) I initially thought llama.cpp was correct and theirs was buggy.

The model might be slightly smarter if we match the tokenization, as then it would match how the model was trained. Empirically testing around I really don't think this impacts output quality in any material way, but it can influence ordering of top tokens a bit that can be noticable. I had the logits reorder themselves in a llama.cpp vs HF test prompt of about 2200 tokens where 7 tokens diverged (all of them two 206s vs one 2126). Maybe with particular kinds of prompts the divergence in tokenization would be much greater and output much different.

I'll offer to investigate and do a PR with an ETA some time next week when I can invest more time. Haven't read the tokenization code on either HF or llama.cpp yet as of opening this issue.

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-03-16T16:26:34Z

The most likely reason is that llama.cpp currently implements a specific BPE pre-process regex. I'm not even sure which one exactly anymore - likely the one used in GPT-2 per this comment, though one should verify:

llama.cpp/llama.cpp

Lines 9989 to 9999 in b5f4ae0

    
           std::vector<std::string> bpe_gpt2_preprocess(const std::string & text) { 
        
               std::vector<std::string> bpe_words; 
        
               std::vector<std::string> bpe_encoded_words; 
        
               std::string token = ""; 
        
               // GPT2 system regex:  's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+ 
        
               bool collecting_numeric = false; 
        
               bool collecting_letter = false; 
        
               bool collecting_special = false; 
        
               bool collecting_whitespace_lookahead = false; 
        
               bool collecting = false;

I would guess that Command-R uses some other regex. AFAIK each BPE-based model can use an arbitrary regex:

https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py?rgh-link-date=2024-03-04T08%3A25%3A04Z

The problem becomes more significant because in C++ we can't simply use built-in regex functions because they don't fully support unicode (or something like that, I'm no expert). So we have to do stuff like that: #4070. But then, this becomes very slow, so we have to custom-implement regex functions like the bpe_gpt2_preprocess() above

So long story short, the BPE tokenization in llama.cpp has to be improved. I've create an issue and sketched a rough plan what needs to be done: #5981

Noeda · 2024-03-16T16:29:07Z

Aha. Thanks @ggerganov for the info. I'll be watching around on the regex stuff and maybe will help with the regex stuff depending on how much I'll have my own time to invest.

daulet · 2024-03-20T22:58:54Z

Cohere and a lot of other models use HuggingFace's tokenizer, so a drop in fix is to use the library for tokenization (just feed corresponding tokenizer config, e.g. this) and avoid reimplementation and future maintenance in this project. The only issue that library is in rust.

@ggerganov would you be open to brining rust into llama.cpp?

ggerganov · 2024-03-21T07:05:10Z

3rd party projects can always use an external library to tokenize and pass the tokens to llama.cpp. It's not very convenient, but for now I don't want to incorporate a specific 3rd party implementation in the project. It's unfortunate that we don't properly support all sorts of tokenizers, but hopefully with time we will improve to at least support the important bits

github-actions · 2024-05-05T01:06:37Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

Noeda added the bug-unconfirmed label Mar 16, 2024

Noeda mentioned this issue Mar 16, 2024

C4AI Command ollama/ollama#3100

Closed

github-actions bot added the stale label Apr 21, 2024

github-actions bot closed this as completed May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Command-R (CohereForAI model) tokenization disagrees with HF implementation #6104

Command-R (CohereForAI model) tokenization disagrees with HF implementation #6104

Noeda commented Mar 16, 2024

ggerganov commented Mar 16, 2024

Noeda commented Mar 16, 2024

daulet commented Mar 20, 2024

ggerganov commented Mar 21, 2024

github-actions bot commented May 5, 2024

Command-R (CohereForAI model) tokenization disagrees with HF implementation #6104

Command-R (CohereForAI model) tokenization disagrees with HF implementation #6104

Comments

Noeda commented Mar 16, 2024

ggerganov commented Mar 16, 2024

Noeda commented Mar 16, 2024

daulet commented Mar 20, 2024

ggerganov commented Mar 21, 2024

github-actions bot commented May 5, 2024