You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We can see the estimated generation throughput is higher than real results.
TRT-LLM
LLM-view
Model
Batch Size
TP (1)
Input Length
Output Length
Throughput (out tok/s/GPU)
est throughput
LLaMA 7B
256
1
128
128
5,353
8,934.54
LLaMA 7B
32
1
128
2048
1,518
2,796.58
LLaMA 7B
32
1
2048
128
547
788.73
LLaMA 7B
16
1
2048
2048
613
1,169.17
For the prefill time, you can see the estimated prefill time is lower than the real results.
TensorRT-LLM
LLM-view
bs
tp
input
1st latency
est 1st latencty (sec)
est (ms)
LLaMA 7B
1
1
128
16.1
0.006977
6.976999894
LLaMA 7B
1
1
2048
120.5
0.10088071
100.88071
This error makes it impossible to use LLM-view to predict the performance comparison of two hardware devices on the same task.
I feel that estimating precise computation time with the roofline model based on operators is very unreliable, and I would like to hear your opinion @hahnyuan .
The text was updated successfully, but these errors were encountered:
You make a very good point. The question you asked is a great one. The time estimated by the roofline model shows the fastest speed the hardware could possibly go. We wanted to help everyone better understand the most important things that affect how fast these huge language models (LLMs) can run on computers. So I think comparing how different things affect the time is useful.
But we have to remember that the exact numbers it predicts will always be faster than what really happens. These are the best possible times, not the real ones. So maybe we should add a note reminding everyone that this report only shows the highest it could go in a perfect world. Nothing is perfect in real life, so the real computer will always be a little slower.
The most important thing is that this tool helps us learn about what goes into making LLMs run fast or slow. Even if the exact time is off, seeing how the different parts work together can give us a good idea. And your question helped point that out - it's good to remember this just shows the limit, not what will really happen every time.
I understand that software often fails to fully utilize HBM bandwidth or max out Tensor Cores.
I'm attempting to use this project to compare the performance of an LLM inference task on two types of hardware. However, I often obtain results from LLM-Viewer that contradict the actual measurements. Please forgive me for not providing precise results, as some hardware information is confidential. So, I am currently using a very naive roofline model in my project. https://github.com/feifeibear/LLMRoofline
Therefore, I'm particularly curious about the purpose of this project:
What is the most significant effect of analyzing every Operator's AI? I understand that it could help us understand the effects of some optimizations in a more quantifiable way. Any other more practical usage for this project?
Could we establish a more precise performance model, perhaps using some sampling and fitting methods, to predict the costs of different tasks?
I compare the LLM-view results with TensorRT-LLM A100 performance provided by NVIDIA
We can see the estimated generation throughput is higher than real results.
For the prefill time, you can see the estimated prefill time is lower than the real results.
This error makes it impossible to use LLM-view to predict the performance comparison of two hardware devices on the same task.
I feel that estimating precise computation time with the roofline model based on operators is very unreliable, and I would like to hear your opinion @hahnyuan .
The text was updated successfully, but these errors were encountered: