You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My hardware is an A100 GPU. I benchmarked the Llama 3 8B model, and I reached a speed of about 2000 tokens per second for an input of 35 tokens and an output of 250 tokens. However, this page states that for an input of 128 tokens and an output of 128 tokens, it reached 6552.62 tokens.
If possible, please provide the method to convert and build the model for this case. I achieved 2000 tokens with float16, but there is a significant difference compared to this page.
The text was updated successfully, but these errors were encountered:
I saw the following page:
https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-overview.md
My hardware is an A100 GPU. I benchmarked the Llama 3 8B model, and I reached a speed of about 2000 tokens per second for an input of 35 tokens and an output of 250 tokens. However, this page states that for an input of 128 tokens and an output of 128 tokens, it reached 6552.62 tokens.
If possible, please provide the method to convert and build the model for this case. I achieved 2000 tokens with float16, but there is a significant difference compared to this page.
The text was updated successfully, but these errors were encountered: