Bundle similar LLMs into a pool to avoid rate limit #6896

tqtensor · 2024-11-25T07:53:44Z

tqtensor
Nov 25, 2024

Hi,

We recently hit a deadlock with our system when we reached the rate limit for Llama 3 70B on Groq. It was a bit of a mess - the program kept retrying, which only made things worse by consuming more tokens and extending the wait time for the rate limit to reset.

As a quick fix, I bundled Llama 3, Llama 3.1, and Gemma 2 into a wrapper. The idea was that if we hit a rate limit error with one model, the system would temporarily switch to another. This works because, at least with Groq, the rate limits are per model.

Now, I made sure beforehand that switching between these models wouldn’t significantly affect our results. But I’m wondering if we could discuss a more formal way to implement this trick.

Think of it like a connection pool for databases, but instead, it’s a ‘rate limit pool’ for LLMs. How could we design this to be more robust and reusable?

ishaan-jaff · 2024-11-28T01:35:42Z

ishaan-jaff
Nov 28, 2024
Maintainer

We already support this, load balancing: https://docs.litellm.ai/docs/routing

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bundle similar LLMs into a pool to avoid rate limit #6896

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Bundle similar LLMs into a pool to avoid rate limit #6896

tqtensor Nov 25, 2024

Replies: 1 comment

ishaan-jaff Nov 28, 2024 Maintainer

tqtensor
Nov 25, 2024

ishaan-jaff
Nov 28, 2024
Maintainer