-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I get throughput for a generative model #2
Comments
I have updated a |
Could you tell me how to use it in my code? What is the differences with this PR #1 ? |
PR #1 has been merged. The result may be slightly different when generating long sequences, as I have used an approximation in the web mode to reduce the analysis cost and the web response time. However, when the sequence length is small, the result remains the same. Rest assured, I have tested the difference between them and it is less than 1%. |
Thanks, could you provide an API in the codebase to use it using python? |
If using your latest webview, we can see the latency for bs=64, in=512, out=512 is 8.2s However, if I use the analyze_generate_task() API the latency is over 11s. My code is here #3 |
I have fixed the bug, the inconsistency between webview and cmd comes from the use_flash_attn flag. |
I would like to get the throughput measured by (generated tokens)/(overall latency = prefill+decode elapse).
Could you please provide an example of this?
The function analyze() dose not have a param as promp_len.
The text was updated successfully, but these errors were encountered: