Replies: 1 comment
-
We already support this, load balancing: https://docs.litellm.ai/docs/routing |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
We recently hit a deadlock with our system when we reached the rate limit for Llama 3 70B on Groq. It was a bit of a mess - the program kept retrying, which only made things worse by consuming more tokens and extending the wait time for the rate limit to reset.
As a quick fix, I bundled Llama 3, Llama 3.1, and Gemma 2 into a wrapper. The idea was that if we hit a rate limit error with one model, the system would temporarily switch to another. This works because, at least with Groq, the rate limits are per model.
Now, I made sure beforehand that switching between these models wouldn’t significantly affect our results. But I’m wondering if we could discuss a more formal way to implement this trick.
Think of it like a connection pool for databases, but instead, it’s a ‘rate limit pool’ for LLMs. How could we design this to be more robust and reusable?
Beta Was this translation helpful? Give feedback.
All reactions