🎅 I WISH LITELLM HAD... #361

krrishdholakia · 2023-09-13T19:40:55Z

This is a ticket to track a wishlist of items you wish LiteLLM had.

COMMENT BELOW 👇

With your request 🔥 - if we have any questions, we'll follow up in comments / via DMs

Respond with ❤️ to any request you would also like to see

P.S.: Come say hi 👋 on the Discord

krrishdholakia · 2023-09-13T19:44:04Z

[LiteLLM Client] Add new models via UI

Thinking aloud it seems intuitive that you'd be able to add new models / remap completion calls to different models via UI. Unsure on real problem though.

krrishdholakia · 2023-09-13T19:46:30Z

User / API Access Management

Different users have access to different models. It'd be helpful if there was a way to maybe leverage the BudgetManager to gate access. E.g. GPT-4 is expensive, i don't want to expose that to my free users but i do want my paid users to be able to use it.

krrishdholakia · 2023-09-13T19:48:57Z

cc: @yujonglee @WilliamEspegren @zakhar-kogan @ishaan-jaff @PhucTranThanh feel free to add any requests / ideas here.

ishaan-jaff · 2023-09-13T19:49:49Z

[Spend Dashboard] View analytics for spend per llm and per user

This allows me to see what my most expensive llms are and what users are using litellm heavily

ishaan-jaff · 2023-09-13T19:51:34Z

Auto select the best LLM for a given task

If it's a simple task like responding to "hello" litlellm should auto-select a cheaper but faster llm like j2-light

Pipboyguy · 2023-09-13T21:43:33Z

Integration with NLP Cloud

krrishdholakia · 2023-09-13T22:04:01Z

That's awesome @Pipboyguy - dm'ing on linkedin to learn more!

krrishdholakia · 2023-09-14T17:56:09Z

@ishaan-jaff check out this truncate param in the cohere api

This looks super interesting. Similar to your token trimmer. If the prompt exceeds context window, trim in a particular manner.

I would maybe only run trimming on user/assistant messages. Not touch the system prompt (works for RAG scenarios as well).

haseeb-heaven · 2023-09-17T00:00:25Z

Option to use Inference API so we can use any model from Hugging Face 🤗

krrishdholakia · 2023-09-17T00:20:03Z

@haseeb-heaven you can already do this -

litellm/litellm/llms/huggingface_restapi.py

Line 53 in a63784d

completion_url = f"https://api-inference.huggingface.co/models/{model}"

from litellm import completion 
response = completion(model="huggingface/gpt2", messages=[{"role": "user", "content": "Hey, how's it going?"}])
print(response)

haseeb-heaven · 2023-09-17T00:30:12Z

@haseeb-heaven you can already do this -

litellm/litellm/llms/huggingface_restapi.py

Line 53 in a63784d

completion_url = f"https://api-inference.huggingface.co/models/{model}"
from litellm import completion 
response = completion(model="huggingface/gpt2", messages=[{"role": "user", "content": "Hey, how's it going?"}])
print(response) 

Wow great thanks its working. Nice feature

smig23 · 2023-09-18T02:39:52Z

Support for inferencing using models hosted on Petals swarms (https://github.com/bigscience-workshop/petals), both public and private.

ishaan-jaff · 2023-09-18T16:11:27Z

@smig23 what are you trying to use petals for ? We found it to be quite unstable and it would not consistently pass our tests

shauryr · 2023-09-18T17:28:54Z

finetuning wrapper for openai, huggingface etc.

krrishdholakia · 2023-09-18T18:37:02Z

@shauryr i created an issue to track this - feel free to add any missing details here

smig23 · 2023-09-18T18:57:48Z

@smig23 what are you trying to use petals for ? We found it to be quite unstable and it would not consistently pass our tests

Specifically for my aims, I'm running a private swarm as a experiment with a view to implementing with in private organization, who have idle GPU resources, but it's distributed. The initial target would be inferencing and if litellm was able to be the abstraction layer, it would allow flexibility to go another direction with hosting in the future.

ranjancse26 · 2023-09-19T05:02:17Z

I wish the litellm to have a direct support for finetuning the model. Based on the below blog post, I understand that in order to fine tune, one needs to have a specific understanding on the LLM provider and then follow their instructions or library for fine tuning the model. Why not the LiteLLM do all the abstraction and handle the fine-tuning aspects as well?

https://docs.litellm.ai/docs/tutorials/finetuned_chat_gpt
https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset

ranjancse26 · 2023-09-19T07:31:45Z

I wish LiteLLM has a support for open-source embeddings like sentence-transformers, hkunlp/instructor-large etc.

Sorry, based on the below documentation, it seems there's only support for the Open AI embedding.

https://docs.litellm.ai/docs/embedding/supported_embedding

ranjancse26 · 2023-09-19T09:21:00Z

I wish LiteLLM has the integration to cerebrium platform. Please check the below link for the prebuilt-models.

https://docs.cerebrium.ai/cerebrium/prebuilt-models

ishaan-jaff · 2023-09-19T16:19:28Z

@ranjancse26 what models on cerebrium do you want to use with LiteLLM ?

ranjancse26 · 2023-09-19T16:30:20Z

@ishaan-jaff The cerebrium has got a lot of pre-built model. The focus should be on consuming the open-source models first ex: Lama 2, GPT4All, Falcon, FlanT5 etc. I am mentioning this as a first step. However, it's a good idea to have the Litellm take care of the internal communication with the custom-built models too. In-turn based on the API which the cerebrium is exposing.

ishaan-jaff · 2023-09-19T18:44:22Z

@smig23 We've added support for petals to LiteLLM https://docs.litellm.ai/docs/providers/petals

ranjancse26 · 2023-09-21T00:25:23Z

I wish Litellm has a built-in support for the majority of the provider operations than targeting the text generation alone. Consider an example of Cohere, the below one allows users to have conversations with a Large Language Model (LLM) from Cohere.

https://docs.cohere.com/reference/post_chat

ranjancse26 · 2023-09-21T00:32:02Z

I wish Litellm has a ton of support and examples for users to develop apps with RAG pattern. It's kind of mandatory to go with the standard best practices and we all wish to have the same support.

ranjancse26 · 2023-09-21T00:36:39Z

I wish Litellm has use-case driven examples for beginners. Keeping in mind of the day-to-day use-cases, it's a good idea to come up with a great sample which covers the following aspects.

Text classification
Text summarization
Text translation
Text generation
Code generation

ranjancse26 · 2023-09-21T00:39:56Z

I wish Litellm to support for various known or popular vector db's. Here are couple of them to begin with.

Pinecone
Qdrant
Weaviate
Milvus
DuckDB
Sqlite

ranjancse26 · 2023-09-21T00:49:23Z

I wish Litellm has a built-in support for performing the web-scrapping or to get the real-time data using known provider like serpapi. It will be helpful for users to build the custom AI models or integrate with the LLMs for performing the retrieval augmented based generation.

https://serpapi.com/blog/llms-vs-serpapi/#serpapi-google-local-results-parser
https://colab.research.google.com/drive/1Q9VvVzjZJja7_y2Ls8qBkE_NApbLiqly?usp=sharing

krrishdholakia · 2024-10-17T00:34:15Z

@yigitkonur streaming with vercel sdk works with their openai integration currently

@GildeshAbhay replied on the issue you created - sample code for how you'd want this to work would be helpful

WissamAntoun · 2024-10-24T21:33:43Z

Support for Reranker API for Huggingface's Text Embedding Inference

wesleyearlstander · 2024-11-08T04:50:56Z

I wish litellm had module federation. With the fast-approaching era of real-time AI, loading only the necessary provider packages will be crucial in keeping system latency low.

databill86 · 2024-11-08T08:29:18Z

Feature Request: Request Throttling/Queueing for Rate Limit Management

Related to @denisergashbaev's comments here and here which perfectly describes this need.

Desired Functionality

+1 to the request for a global throttling mechanism with queuing. To expand on @denisergashbaev's description:

When a deployment is approaching its rate limit (RPM/TPM)
Instead of failing or routing to another deployment
The requests should be queued and processed in order
Each request waits until it can be safely sent without exceeding the rate limit
Using a global state (e.g., Redis) to coordinate across multiple instances

Current Solutions vs Desired Behavior

Current: Request Prioritization

The current priority queue implementation (docs) focuses on prioritizing between requests but does not prevent rate limit errors. If there's only one deployment, requests will still fail when hitting rate limits rather than being queued.

Current: Usage-based Routing

The current routing strategy (docs) helps distribute load across multiple deployments but doesn't solve the fundamental issue of managing rate limits through queuing.

Example Use Case

router = Router(
model_list=[{
"model_name": "gpt-3.5-turbo",
"litellm_params": {
"model": "openai-tier1",
"rpm": 60 # 60 requests per minute
}
}]
)

Desired behavior: If 100 requests come in within a minute

First 60 requests process normally
Next 40 requests are queued
Queued requests automatically process as capacity becomes available
No rate limit errors are thrown

Benefits

More reliable systems - no rate limit errors
Simpler implementation for users - no need to handle rate limit errors
Works even with single deployment scenarios
Prevents rate limit exhaustion: Currently, simple retries (like RateLimitErrorRetries) can actually worsen the situation by accumulating failed requests and further exhausting rate limits. Even with router_settings like retry_after or timeout, we can still have these problems. A proper queuing system would handle this gracefully, ensuring failed requests don't compound the rate limit problem.

This feature would be incredibly valuable for the community, as evidenced by multiple users requesting similar functionality. LiteLLM is already an amazing tool for LLM deployment management, and this addition would make it even more robust for production use cases.

This feature would be incredibly valuable for the community. LiteLLM is already an amazing tool, I'm still testing it in multiple scenarios, but I think this addition would make it even more robust for production use cases.

krrishdholakia · 2024-11-08T10:27:02Z

@databill86 requests which fail due to rate limit errors are kept in queue and retried until the timeout for the request is hit

databill86 · 2024-11-08T14:57:09Z

@databill86 requests which fail due to rate limit errors are kept in queue and retried until the timeout for the request is hit

Thanks for the response! However, there's a crucial distinction to make here.

The current retry mechanism can actually worsen rate limit issues, particularly with OpenAI:

Failed requests count against limits: As per OpenAI's documentation, unsuccessful requests still contribute to your per-minute limit. Simply retrying failed requests (even from a queue) will:
- Count against your rate limits
- Potentially trigger more rate limits
- Leave you unable to process any requests for a longer period
What we need instead:
- A global state tracking current TPM/RPM usage
- Process requests from queue only when we know capacity is available
- Intelligent throttling when multiple requests arrive during wait periods
- Avoid sending all queued requests at once when capacity becomes available

The key difference is proactive vs reactive handling:

Current: Reactive - Wait for failures, then retry (which counts against limits)
Needed: Proactive - Track usage and only send requests when we know they won't exceed limits

This would provide much better resource utilization and prevent the "cascade effect" where retries compound the rate limit problem.

krrishdholakia · 2024-11-08T16:43:15Z

Failed requests count against limits: As per OpenAI's documentation, unsuccessful requests still contribute to your per-minute limit. Simply retrying failed requests (even from a queue) will

we wait based on the retry-after header present in the rate limit error, so we don't trigger this issue. Here's the test -

litellm/tests/local_testing/test_router.py

Line 2349 in 1bef645

assert int(response_headers["retry-after"]) == cooldown_time

Needed: Proactive - Track usage and only send requests when we know they won't exceed limits

this already exists. use rate limit aware routing - https://docs.litellm.ai/docs/routing#advanced---routing-strategies-%EF%B8%8F

lazariv · 2024-11-14T14:23:04Z

Allow configuring API-baseurl for audio/speech endpoint.

Currently only OpenAI, Azure and Vertex are supported. That would be nice to allow configuring the api_base parameter to allow self-hosted TTS engines (with OpenAI API), such as https://github.com/matatonic/openedai-speech , to be used by setting e.g.:

- model_name: tts
  litellm_params:
    model: openai/tts-1
    api_base: https://local/tts/engine
    api_key: os.environ/OPENAI_API_KEY

lazariv · 2024-11-15T13:18:06Z

Groups of models

Provide a possibility to create groups of models (e.g. "Free tier models", "Public models", etc.), so that a specific virtual key can be given access to such groups.

Currently virtual key can be given access only per team, which doesn't scale if many teams are present, and adding a new public model requires to edit all teams.

regismesquita · 2024-11-22T15:36:31Z

I would love to be able to have the citations field included in the response body when using Perplexity. Currently, I was able to achieve this for non-streaming responses using the success hook, but I had no luck with streaming responses.

krrishdholakia · 2024-11-22T16:20:38Z

@lazariv this already exists - https://docs.litellm.ai/docs/proxy/tag_routing

derekalia · 2024-11-23T20:16:39Z

Pixtral vision support - mistralai/Pixtral-Large-Instruct-2411

jtsai-quid · 2024-11-25T03:22:26Z

Adding tokenize and detokenize to the llm utils endpoints, please 🙏

Tomato6966 · 2024-11-26T14:08:05Z

I wish litellm would support: "updating assistants" through "PATCH /assistants/:assistantId", deleting Threads through "DELETE /threads/:threadId".

Else: Very great project!

hao0608 · 2024-11-29T06:36:53Z

Support Xinference rerank model

CheshireAI · 2024-11-29T19:20:10Z

I wish there was support for local stable diffusion and/or comfyui

pazevedo-hyland · 2024-12-03T09:37:52Z

Embedding models on langchain. (Currently only Chat Interface exists)

ivanbelenky · 2024-12-03T14:44:31Z

I wish it had no dependencies apart from httpx and pydantic and that the arrows coming out of the hype train not intersect with each other

dym-ok · 2024-12-11T10:35:13Z

I wish this beautiful library supported Bedrock Inference Profiles.

We use them to attribute costs.

abourget · 2024-12-11T21:06:05Z

I wish it had an abstraction to submit traces to its different logging backends like langfuse and friends,
I wish it was a receptor of OpenTelemetry data, and would repackage and forward to its backends.
Does that exist?

brooksc · 2024-12-14T20:11:26Z

Have you thought about adding a "meta model" option where a user could specify

Here are all the "services" I have access to - e.g. openai, anthropic, aws, ollama, etc.
I want a model that can do coding well, vision, classification, tool using, etc.
I want to prioritize a model based on cost, speed, quality or multiple criteria in this order...

And litellm with everything it knows would just pick the best available model.

I see you have a json file with pricing and model capabilities.

I didn't see anything like this exists nor did gemini research find anything. https://g.co/gemini/share/e704a93c8938

This would require collecting data on all the benchmarks, e.g. how well each did on coding benchmarks vs, others to make a selection. You have the data on cost. I didn't check if you have tokens per second.

There is probably some memory required - e.g. validate that project X works on each of the models due to the variations in model execution. but once you do a "benchmark" pass to validate functionality against various tests, it becomes a preferred model selection when considering ther algorithm on which one to pick.

I'm asking about this is because it feels like one of the major taxes of setting up a new project or 3rd party/oss project is figuring out which model to use, optimize for cost, etc. Sometimes I have a more powerful machine on my local network with ollama I want to use when it's available, other times use a cloud service or my local ollama.

I want that to all happen automagically... e.g. use AI to select the AI model

brooksc · 2024-12-14T20:18:23Z

Another suggestion.

I'm lazy, I don't want to read all of your docs to figure out the answer to what I want. I want to ask ChatGPT, Claude, Geimini, etc to get the answer for me. thing is they aren't very good at browsing your website yet.

one suggestion is to create a serialized version of the docs in a /llms.txt like https://llmstxt.org/ and I can just feed it this url. hopefully eventually they get smart enough to look for this if it exists.

For now I'll use https://uithub.com/BerriAI/litellm/tree/main/docs/my-website/docs?accept=text/html&maxTokens=50000&ext=md but this isn't well known and it may not contain what you want to prioritize in the index.

Ideally you'd also have links on your site off to "Ask ChatGPT about these docs" with a input box which then opens

https://chat.openai.com/?q=https%3A%2F%2Fdocs.litellm.ai%2Fllms.txt+yourquery&model=gpt-4o

sort of like the old google site search... hopefully we don't have to do that too long.

something would also enable is "I'm using litellm... analyze my code and look at llms.txt and see what other features I should consider leveraging"

d4g · 2024-12-18T09:13:19Z

I wish I could enable citations for perplexity on litellm via the config.yaml so I would get citations in open-webui.
#6662

krrishdholakia · 2024-12-18T14:31:35Z

@d4g we already return the perplexity citations. If there's a Param needed just add it under 'litellm_params'

d4g · 2024-12-18T14:41:23Z

Where and how? In the yaml?

krrishdholakia · 2024-12-18T15:21:14Z

just checked perplexity doc. no param needed, it should be returned automatically (see the 200 status code response) - https://docs.perplexity.ai/api-reference/chat-completions

For any provider-specific param, see here - https://docs.litellm.ai/docs/completion/provider_specific_params#proxy-usage

githubuser16384 · 2024-12-26T16:33:40Z

I wish there was vision support for LLM providers that provide vision support through their official documentation. Case in point- Groq. Reference: https://console.groq.com/docs/vision

krrishdholakia · 2024-12-26T21:26:01Z

@githubuser16384 litellm already supports vision on all models - https://docs.litellm.ai/docs/completion/vision

Created a ticket to add an example on groq docs for this.

krrishdholakia pinned this issue Sep 13, 2023

krrishdholakia changed the title ~~LiteLLM Wishlist~~ 🎅 I WISH LITELLM ADDED... Sep 14, 2023

krrishdholakia changed the title ~~🎅 I WISH LITELLM ADDED...~~ 🎅 I WISH LITELLM HAD... Sep 14, 2023

markoff-dev mentioned this issue Dec 12, 2024

[Feature]: Batching in LiteLLM for models that do not have native batching support. #7194

Open

krrishdholakia mentioned this issue Dec 26, 2024

I wish there was vision support for LLM providers that provide vision support through their official documentation. Case in point- Groq. Reference: https://console.groq.com/docs/vision #7433

Closed

🎅 I WISH LITELLM HAD... #361

🎅 I WISH LITELLM HAD... #361

Comments

krrishdholakia commented Sep 13, 2023 • edited Loading

COMMENT BELOW 👇

With your request 🔥 - if we have any questions, we'll follow up in comments / via DMs

krrishdholakia commented Sep 13, 2023

krrishdholakia commented Sep 13, 2023

krrishdholakia commented Sep 13, 2023 • edited Loading

ishaan-jaff commented Sep 13, 2023 • edited Loading

ishaan-jaff commented Sep 13, 2023

Pipboyguy commented Sep 13, 2023

krrishdholakia commented Sep 13, 2023

krrishdholakia commented Sep 14, 2023 • edited Loading

haseeb-heaven commented Sep 17, 2023

krrishdholakia commented Sep 17, 2023 • edited Loading

haseeb-heaven commented Sep 17, 2023

smig23 commented Sep 18, 2023

ishaan-jaff commented Sep 18, 2023

shauryr commented Sep 18, 2023

krrishdholakia commented Sep 18, 2023

smig23 commented Sep 18, 2023

ranjancse26 commented Sep 19, 2023

ranjancse26 commented Sep 19, 2023

ranjancse26 commented Sep 19, 2023

ishaan-jaff commented Sep 19, 2023

ranjancse26 commented Sep 19, 2023

ishaan-jaff commented Sep 19, 2023

ranjancse26 commented Sep 21, 2023

ranjancse26 commented Sep 21, 2023

ranjancse26 commented Sep 21, 2023

ranjancse26 commented Sep 21, 2023

ranjancse26 commented Sep 21, 2023 • edited Loading

krrishdholakia commented Oct 17, 2024

WissamAntoun commented Oct 24, 2024

wesleyearlstander commented Nov 8, 2024

databill86 commented Nov 8, 2024

Feature Request: Request Throttling/Queueing for Rate Limit Management

Desired Functionality

Current Solutions vs Desired Behavior

Current: Request Prioritization

Current: Usage-based Routing

Example Use Case

Benefits

krrishdholakia commented Nov 8, 2024

databill86 commented Nov 8, 2024

krrishdholakia commented Nov 8, 2024

lazariv commented Nov 14, 2024

lazariv commented Nov 15, 2024

regismesquita commented Nov 22, 2024

krrishdholakia commented Nov 22, 2024

derekalia commented Nov 23, 2024

jtsai-quid commented Nov 25, 2024

Tomato6966 commented Nov 26, 2024

hao0608 commented Nov 29, 2024

CheshireAI commented Nov 29, 2024

pazevedo-hyland commented Dec 3, 2024

ivanbelenky commented Dec 3, 2024

dym-ok commented Dec 11, 2024

abourget commented Dec 11, 2024

brooksc commented Dec 14, 2024

brooksc commented Dec 14, 2024

d4g commented Dec 18, 2024

krrishdholakia commented Dec 18, 2024

d4g commented Dec 18, 2024

krrishdholakia commented Dec 18, 2024

githubuser16384 commented Dec 26, 2024

krrishdholakia commented Dec 26, 2024

krrishdholakia commented Sep 13, 2023 •

edited

Loading

krrishdholakia commented Sep 13, 2023 •

edited

Loading

ishaan-jaff commented Sep 13, 2023 •

edited

Loading

krrishdholakia commented Sep 14, 2023 •

edited

Loading

krrishdholakia commented Sep 17, 2023 •

edited

Loading

ranjancse26 commented Sep 21, 2023 •

edited

Loading