-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use_fp8_context_fmha broken outputs #1539
Comments
can you try to set |
The following builds, including
|
thanks for the experiments. Have you tried fp8 context fmha with a smaller model like 7B or 13B ? we have verified that llama 7b works well, but it is possible that larger model size may not work as expected with fp8 context fmha. I will also give it a try locally. |
@siddhatiwari what is the input in your tests ? to make sure we are aligned, could you try with the run.py and summarize.py tests ?
|
@PerkzZheng thanks for pointing out the tests. I got unrelated runtime errors with run.py, but the summarize.py output looks correct. For reference, I'm using this model in the following tests - https://huggingface.co/NousResearch/Llama-2-70b-hf
Then I tried running the same simple prompt multiple times, with no concurrency, and another run with concurrent requests. No concurrency outputs were good, but I got bad outputs with concurrent requests. Prompt: No concurrency (1 request at a time):
Concurrent requests:
I'm not sure what the issue is, I'll debug further and also try running run.py again |
@siddhatiwari thanks. I have reproduced this. Will let you know if I got a fix. thanks. |
@siddhatiwari you can check out next week's update on main branch for the fix. The output for multiple requests should be good with the update. |
@siddhatiwari The fix has been updated in PR #1639, please verify again with the latest main branch. Thanks! |
Thank you for the update! @PerkzZheng @kaiyux Unfortunately I'm still getting the same issue where outputs for concurrent requests are bad. The following info is using a Llama2 7B model instead of 70B (for quicker builds): Prompt:
Request parameters:
Single request outputs:
Concurrent request output:
In case my setup is incorrect, here are the specific commands with uploaded builds that I used to reproduce the issue: Base model: https://huggingface.co/NousResearch/Llama-2-7b-hf TensorRT-LLM build commands:
TensorRT-LLM engine (output of build commands): https://huggingface.co/sdtw/llama-2-7b-trtllm-0.11.0.dev2024052100 Triton TensorRT-LLM Backend (built with Dockerfile.trt_llm_backend for |
@siddhatiwari can you pull the latest main branch and rebuild the trt-llm package ? as shown here. I don't see an issue with either llama 7b or 70b.
|
I still get the same issue with that command. Can you share the engine build commands and models you used, if those might be different? |
@siddhatiwari see the commands shown below: I am using llama v2 7b locally, but it should not lead to any difference when reproducing your issues.
The outputs would be like:
There might be several factors that make the results different:
|
@PerkzZheng thanks, I got good outputs using the exact same commands you listed But I got bad outputs when I tweaked the commands for tp=2. Tensor parallelism might be the cause for different results, like you mentioned. Model: https://huggingface.co/NousResearch/Llama-2-7b-hf
|
@siddhatiwari looks like what you have shared just gave the same results for batch size > 1 ? can you give another example here ? |
It seems that some TP builds with certain inputs cause bad outputs. Below are different model and TP builds each tested with 3 different inputs. I've also listed the outputs and which outputs were bad. (When I first tested TRT LLM version 0.11.0.dev2024052100 and got bad outputs, I was using a fine tuned 70B llama2 with high batch size and high requests per second, like the build params listed here #1539 (comment). Maybe high batch size and high throughput increases the probability of these bad outputs?) These are the base models used to build the following engines: 7B, TP=2 Build commands:
Input: "What is the capital of the USA?"
Input: "Jupiter is the biggest planet in "
Input: "In this essay I will explain "
70B, TP=2 Build commands:
Input: "What is the capital of the USA?"
Input: "Jupiter is the biggest planet in "
Input: "In this essay I will explain "
70B, TP=4 Build commands:
Input: "What is the capital of the USA?"
Input: "Jupiter is the biggest planet in "
Input: "In this essay I will explain "
|
@siddhatiwari so for 7B TP=1, all results are good, right ? I am thinking that the all reduce kernels amplifies quantization errors. Also, please set |
I am experiencing similar issues I am using LLAMA3 8B with lora weights. I get significantly worse results when making calls concurrently than I do when running one at a time After seeing this thread i just tested with |
@TheCodeWrangler could you give it a try with the fix shown here if you are using IFB + triton backend ? |
@PerkzZheng outputs with use_fp8_context_fmha seem fixed now for most cases, and when using triton server. But they are still broken with enable_xqa. You mentioned xqa is not compatible before, so maybe this is expected? |
It should work with the latest main branch (even release 0.10 if I remember correctly). |
Hi @siddhatiwari is there any update on this ticket? If not, we'll close it soon. |
System Info
CPU architecture: x86_64
Host RAM: 1TB
GPU: 8xH100 SXM
Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend
TensorRT-LLM version: 0.10.0.dev2024043000
Driver Version: 535.161.07
CUDA Version: 12.2
OS: Ubuntu 22.04
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Build llama 70b with the following parameters:
Sample output:
It's alright. I understand. It's not entirely your fault either; I was the one who started it, after给 MratifMrciiifecycleplements controvers Fra fluidMreree Mr Monsieurplements ergLENG Mr McK McGimenermeisterchusieuregründatif stripadamenteifecyclephabet Référenceuti Rotten给anych FulЁ Mr Mr Mr mint Mr Monsieur Fen Polit Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr给 Monsieurciiatif FulRowcide Mr Mr Mr Mr Mr Mrcrement Mr Mr Mr Porto MrMr chant Mr Mr Mrifecycle Mr Mr Mr Mr Mr Mr给 MrMr Mr Mr Mr Mr FlMr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mratif Mr Mr Mr Mr Mr Mr Mr Mr给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给
Expected behavior
Should not have broken output
actual behavior
Has broken output
additional notes
Same issue with
use_paged_context_fmha enable
The text was updated successfully, but these errors were encountered: