Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throughput Measurements #2648

Open
Alireza3242 opened this issue Jan 2, 2025 · 3 comments
Open

Throughput Measurements #2648

Alireza3242 opened this issue Jan 2, 2025 · 3 comments

Comments

@Alireza3242
Copy link

I saw the following page:

https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-overview.md

My hardware is an A100 GPU. I benchmarked the Llama 3 8B model, and I reached a speed of about 2000 tokens per second for an input of 35 tokens and an output of 250 tokens. However, this page states that for an input of 128 tokens and an output of 128 tokens, it reached 6552.62 tokens.

If possible, please provide the method to convert and build the model for this case. I achieved 2000 tokens with float16, but there is a significant difference compared to this page.

@Alireza3242
Copy link
Author

I increased batch size and reached to 5500 tokens per second.

@nv-guomingz
Copy link
Collaborator

what's ur A100 memory size, PCI-E or SXM version, trt-llm version?

@Alireza3242
Copy link
Author

Alireza3242 commented Jan 7, 2025

A100 80GB SXM, tensorrt 0.15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants