Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I get throughput for a generative model #2

Open
feifeibear opened this issue Mar 7, 2024 · 6 comments
Open

How can I get throughput for a generative model #2

feifeibear opened this issue Mar 7, 2024 · 6 comments

Comments

@feifeibear
Copy link
Contributor

I would like to get the throughput measured by (generated tokens)/(overall latency = prefill+decode elapse).
Could you please provide an example of this?

The function analyze() dose not have a param as promp_len.

@hahnyuan
Copy link
Owner

hahnyuan commented Mar 8, 2024

I have updated a chat stage for your requirement. This is avaliable on http://llm-viewer.com/ >0.3.5 version.

@feifeibear
Copy link
Contributor Author

Could you tell me how to use it in my code? What is the differences with this PR #1 ?

@hahnyuan
Copy link
Owner

PR #1 has been merged. The result may be slightly different when generating long sequences, as I have used an approximation in the web mode to reduce the analysis cost and the web response time. However, when the sequence length is small, the result remains the same. Rest assured, I have tested the difference between them and it is less than 1%.

@feifeibear
Copy link
Contributor Author

PR #1 has been merged. The result may be slightly different when generating long sequences, as I have used an approximation in the web mode to reduce the analysis cost and the web response time. However, when the sequence length is small, the result remains the same. Rest assured, I have tested the difference between them and it is less than 1%.

Thanks, could you provide an API in the codebase to use it using python?

@feifeibear
Copy link
Contributor Author

feifeibear commented Mar 11, 2024

If using your latest webview, we can see the latency for bs=64, in=512, out=512 is 8.2s
image

However, if I use the analyze_generate_task() API the latency is over 11s.
nvidia_A100_80G: 1st token latency 1.4909548877801777, total latency 11.478970542701218, throughput 2854.6113850632005 Token/sec

My code is here #3

@feifeibear
Copy link
Contributor Author

I have fixed the bug, the inconsistency between webview and cmd comes from the use_flash_attn flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants