Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Huggingface tokenzier support #189

Merged
merged 1 commit into from
Nov 29, 2023

Conversation

DOGEwbx
Copy link
Contributor

@DOGEwbx DOGEwbx commented Nov 28, 2023

Add logic to decide use huggingface tokenizer or sentence piece tokenizer.
It can support models using huggingface tokenizer like Falcon and Deepseek Coder

@CyberTimon
Copy link
Contributor

Thank you so much for this work @DOGEwbx . I'm waiting for deepseek coder support all the time. This is very helpful.

@CyberTimon
Copy link
Contributor

Is creating Exl2 quants with this also possible?

@DOGEwbx
Copy link
Contributor Author

DOGEwbx commented Nov 28, 2023

@CyberTimon Thanks for your interest on our work. I haven’t run testing on exl2 quants file but as all the modifications are on the tokenizer part, I don't think there will be problems on the specific data format.
I'm not very familiar with model quant techniques, if you could give me some model checkpoints or conversion scripts so that I can do some tests.

@turboderp
Copy link
Member

At the very least, this is going to take some time to review.

Transformers is a massive dependency to include just to support one model (Falcon still wouldn't work as there are other architectural differences).

As for remote code, my guess would be that 90% of users are unaware of the risks involved, so it should at least be opt-in.

I'll need a moment to think about it, to test that this doesn't break functionality like constrained sampling, and make sure there really isn't a better way.

@DOGEwbx
Copy link
Contributor Author

DOGEwbx commented Nov 28, 2023

Thanks for your reply.
For the transformers dependency issue, use the Tokenizers module instead would be a solution( but it will support Fast tokenizers only)
And you're right for trust_remote_code because I don't know the risks either.....We could disable it since the model should be on local.

@SinanAkkoyun
Copy link
Contributor

SinanAkkoyun commented Nov 28, 2023

Is there any specific way to use the fork? With pip install transformers the fork does not work for me:

python examples/chat.py -m../../models/deepseek/deepseek-coder-1.3b-instruct-GPTQ/ -mode deepseek
 -- Model: ../../models/deepseek/deepseek-coder-1.3b-instruct-GPTQ/
 -- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Prompt format: deepseek
 -- System prompt:

You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.

User: How can I tell the time in python?

���ııiiiii iii iv i i ii io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io io

6.7B shows very similar behaviour, but most of the time results in an invisible output loop in the chat example

I get the same behaviour no matter what prompt format (also tested the deepseek instruct format)

Maybe I am just doing something wrong, I'd appreciate help

@turboderp
Copy link
Member

@SinanAkkoyun The model seems to use linear a RoPE scaling factor of 4. I've been able to get coherent output out of the 1.3B model at least, using that.

@DOGEwbx The Tokenizers library seems like a more reasonable dependency, especially if it's optional. It largely mirrors Transformers, so it should be possible to adapt it to the code in this PR. There are still a few things I need to sort out and verify, like how control symbols are encoded, optional BOS/EOS tokens, that the vocabulary is preprocessed correctly, how UTF-8 characters are emitted and so on. I'll get to that in a few hours.

It's definitely not a trivial issue. I see over on the llama.cpp repo a whole bunch of people have been working on it for some weeks now.

As for remote code, the issue is that with the option enabled, AutoModel.from_pretrained and apparently also AutoTokenizer.from_pretrained will import and run architectures distributed with models. The architectures are defined as Python code, and with no sandboxing this code has access to your entire userspace. So it can steal browser cookies, take screenshots, read the clipboard, etc. Crucially, users usually don't think of downloading and trying out a new model as downloading and installing software, especially in UIs like text-generation-webui that let you download and run models with a couple of clicks.

@SinanAkkoyun
Copy link
Contributor

SinanAkkoyun commented Nov 28, 2023

@turboderp Thank youu, 6.7B is working coherently :)

@SinanAkkoyun
Copy link
Contributor

SinanAkkoyun commented Nov 28, 2023

@turboderp However, I can't seem to get 1.3B to output coherent responses. What params did you use?

EXL2 GPTQ:

python examples/chat.py -m../../models/deepseek/deepseek-coder-1.3b-instruct-GPTQ/ -mode deepseek -rs 4
 -- Model: ../../models/deepseek/deepseek-coder-1.3b-instruct-GPTQ/
 -- Options: ['rope_scale 4.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Prompt format: deepseek
 -- System prompt:

You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.

User: How can you tell the time in python but backwards?

In Python 3, there is no builtin function that directly provides a date/time object's creation timestamp (like UNIX_TIMESTAMP or UTCINFO) back towards present as it would if we used `datetime` module from standard library like this :


   from datetime import datetime     #importing necessary libraries         .           print(f"Current Timebackwards {now - timed  (secon
   elta(seconds=1)}")       exit()        else
                                                    ____         ^______      / \              |         |           |
               TIME               0                   6947825                      tic ()                   def now():
      return                                  .....                                nt                           ~                         ~                             ...                           __maintainable..>                                        PEP
   ....                  stderr                       errno                        sys            │
 ├───�                                               �                                           <-- OSError                              EOFError                               ModuleNotFoundError                                   tracebacks for debugging
                 eeeffttsssyyyy yrry rrrty jjjk kkk lll mmnn ssst ddmmm MMDDYYY hhMMSSXZtz AABBCCXXV VVARAARAAAPPPPZZQTTJKLLLAACCUURRNNOOEEEENNNNSRRRLLEEEEEWROOOORWWWMERLFHGDOIAMAEEDAIEMANESANTTATOMTAASIISOTUTULISIONETIMELIBLIOWEITIVIRTHLYDECKSHAKINREIDSMARTADMINISTERTIANTECHNIQUEANDINTERACTIVEFUNCTIONSLACKMELODYSBERGINSONSOCIALSCIENTIFICCOMPLETECOMMONCONTRASTNEWSPIPELINECUSTOMECREATIONPROCESSORDATAVERSIENTIDEFIABLEPREDICTPERCEPTORYOURSPACEUSERSAVESSLHAILTONLINEPARTIALDEREFINEFORCEPSIZEXPARAMERRONEIGHTSUMMARYDISPLOSIONSHELPFAKEIDENTITYKEYERRORVALUESUCURIZEMPTYTOSTRINGFORMATTERUNAVAILVALUEATTRNAMEOUTPUTTYPEJSONBUNDLESINGLETOPATHSYNERRAISEIOFFILEPATHREADFROMCURRENTDIRWRITEFILECREATESESSIONSFMLFLASKJOINTFRAMEGRAPHDATABASECONFIGUREPYTESTCAUSEMOVEALLCAPSUCLUBNETMARKSIGNALEMENTSUBSCRIPTVisualizationDataFrameGROUPPOSITIONDATASTRATEFETYPEINDIAUTHORMACHIEFPROMPTUSERGETHOURLONGTEXTCOLORADDHEADMANAGMENTOFFSETCOHERENCEFERMIEXTENDPRIMARYSOURCEWHICHMODELEDGELOADDBSQLPARSERQLIMITPROPDRUIDOWNREFCOUNTSELECTRETURNCONTENTSIGNPOSTFIXMSGSODDAHLTFULLDEBUGLOGGERMAINSTAPIREGRESSORSUPPORTSTATUSBORDERBOUNDARYCRASHOPTIMIZEEXTENDPASSWORDCHECKPOSTMULTITHREADARGUENCESOLCALLBACKLISTENABLEDFREEFORESPECTORTICKMAXFEATUREDSIMPUTFIRSTROWINDEXCOLLECTPROJECTOBJECTSKILLSTATELOWESTGLBLIGHTMONOCLEARNBASECOMICSCHARTDRILLHTMLTAGCLASSIFICTYPECPythonBaseExceptionFileExistsErr@Python Base Exception Fatal Error In File error

Or is this just due to 4bit quantization? The bf16 model responds with great answers for it's 1.3b size

@turboderp
Copy link
Member

turboderp commented Nov 29, 2023

I think maybe you're just asking too much of a tiny model. And quantization is known to affect smaller models more severely anyway. Remember you can also just run the FP16 version to compare.

@turboderp turboderp merged commit 8c2be34 into turboderp-org:master Nov 29, 2023
@turboderp
Copy link
Member

There. I rewrote it to use the Tokenizers library instead, as an optional dependency, and it seems to run okay now. It seems to consistently encode and decode the same as a HF AutoTokenizer. Encoding seems to work correctly during quantization as well.

I also added a workaround for the Tokenizer bug where some added tokens would decode incorrectly. Still need to test it with some of the other models that lack a SentencePiece tokenizer model.

@CyberTimon
Copy link
Contributor

Thank you

@SinanAkkoyun
Copy link
Contributor

SinanAkkoyun commented Nov 29, 2023

I think maybe you're just asking too much of a tiny model. And quantization is known to affect smaller models more severely anyway. Remember you can also just run the FP16 version to compare.

Yes, that's what puzzled me, the FP16 model ran perfectly fine and conquered most basic coding tasks easily
It would be cool to have a super fast model capable of basic coding, but perhaps 4bit is just not enough, I just wanted to make sure that it has nothing to do with hyperparameters

There. I rewrote it to use the Tokenizers library instead

Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants