How can I get throughput for a generative model #2

feifeibear · 2024-03-07T07:53:43Z

I would like to get the throughput measured by (generated tokens)/(overall latency = prefill+decode elapse).
Could you please provide an example of this?

The function analyze() dose not have a param as promp_len.

hahnyuan · 2024-03-08T13:06:04Z

I have updated a chat stage for your requirement. This is avaliable on http://llm-viewer.com/ >0.3.5 version.

feifeibear · 2024-03-11T06:35:22Z

Could you tell me how to use it in my code? What is the differences with this PR #1 ?

hahnyuan · 2024-03-11T06:51:23Z

PR #1 has been merged. The result may be slightly different when generating long sequences, as I have used an approximation in the web mode to reduce the analysis cost and the web response time. However, when the sequence length is small, the result remains the same. Rest assured, I have tested the difference between them and it is less than 1%.

feifeibear · 2024-03-11T06:54:03Z

PR #1 has been merged. The result may be slightly different when generating long sequences, as I have used an approximation in the web mode to reduce the analysis cost and the web response time. However, when the sequence length is small, the result remains the same. Rest assured, I have tested the difference between them and it is less than 1%.

Thanks, could you provide an API in the codebase to use it using python?

feifeibear · 2024-03-11T07:13:26Z

If using your latest webview, we can see the latency for bs=64, in=512, out=512 is 8.2s

However, if I use the analyze_generate_task() API the latency is over 11s.
nvidia_A100_80G: 1st token latency 1.4909548877801777, total latency 11.478970542701218, throughput 2854.6113850632005 Token/sec

My code is here #3

feifeibear · 2024-03-11T07:19:54Z

I have fixed the bug, the inconsistency between webview and cmd comes from the use_flash_attn flag.

feifeibear mentioned this issue Mar 11, 2024

[feat] generation task cli #3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I get throughput for a generative model #2

How can I get throughput for a generative model #2

feifeibear commented Mar 7, 2024

hahnyuan commented Mar 8, 2024

feifeibear commented Mar 11, 2024

hahnyuan commented Mar 11, 2024

feifeibear commented Mar 11, 2024

feifeibear commented Mar 11, 2024 •

edited

Loading

feifeibear commented Mar 11, 2024

How can I get throughput for a generative model #2

How can I get throughput for a generative model #2

Comments

feifeibear commented Mar 7, 2024

hahnyuan commented Mar 8, 2024

feifeibear commented Mar 11, 2024

hahnyuan commented Mar 11, 2024

feifeibear commented Mar 11, 2024

feifeibear commented Mar 11, 2024 • edited Loading

feifeibear commented Mar 11, 2024

feifeibear commented Mar 11, 2024 •

edited

Loading