Throughput Measurements #2648

Alireza3242 · 2025-01-02T17:35:04Z

I saw the following page:

https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-overview.md

My hardware is an A100 GPU. I benchmarked the Llama 3 8B model, and I reached a speed of about 2000 tokens per second for an input of 35 tokens and an output of 250 tokens. However, this page states that for an input of 128 tokens and an output of 128 tokens, it reached 6552.62 tokens.

If possible, please provide the method to convert and build the model for this case. I achieved 2000 tokens with float16, but there is a significant difference compared to this page.

Alireza3242 · 2025-01-06T04:06:00Z

I increased batch size and reached to 5500 tokens per second.

nv-guomingz · 2025-01-06T13:40:31Z

what's ur A100 memory size, PCI-E or SXM version, trt-llm version?

Alireza3242 · 2025-01-07T03:02:28Z

A100 80GB SXM, tensorrt 0.15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throughput Measurements #2648

Throughput Measurements #2648

Alireza3242 commented Jan 2, 2025

Alireza3242 commented Jan 6, 2025

nv-guomingz commented Jan 6, 2025

Alireza3242 commented Jan 7, 2025 •

edited

Loading

Throughput Measurements #2648

Throughput Measurements #2648

Comments

Alireza3242 commented Jan 2, 2025

Alireza3242 commented Jan 6, 2025

nv-guomingz commented Jan 6, 2025

Alireza3242 commented Jan 7, 2025 • edited Loading

Alireza3242 commented Jan 7, 2025 •

edited

Loading