-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Deferred cache key calculation leads to caching failure when modifying inputs #7316
Comments
the fact you don't have a cache hit makes sense. This isn't a bug. your input is different. perhaps try recreating this ticket as a feature request for what you'd want to happen here, as this is not a bug. |
Sorry, I wasn't clear. Between runs of the script, the first request is not cached, and should be. (It sounds like you interpreted me as saying there should be a cache hit on the second request, within one run of that script — I agree that shouldn't happen.) If I run a script like that one twice in a row, with an identical starting
Minimal reproducing script: import litellm
litellm._turn_on_debug()
litellm.enable_cache(type="disk")
import sys
async def go():
messages = [{"role": "user", "content": "Say one random thing."}]
response = await litellm.acompletion(messages=messages, model="anthropic/claude-3-5-haiku-20241022")
print(response.choices[0].message.content, file=sys.stderr)
print("\n\n*** AFTER FIRST CALL, BEFORE MODIFYING MESSAGES ***\n\n", file=sys.stderr)
messages.append(response.choices[0].message)
messages.append({"role": "user", "content": "Say a different random thing."})
print("\n\n*** BEFORE SECOND CALL ***\n\n", file=sys.stderr)
await litellm.acompletion(messages=messages, model="anthropic/claude-3-5-haiku-20241022")
import asyncio
asyncio.run(go()) Logs from two successive runs: (I'm not sure if this happens when using the synchronous |
@npt do you see this error if you just wait 2s before calling? The reason is that we write to cache in a background task (don't want to slow down actual request/response). In our testing we just add a slight delay before running the 2nd call to allow a cache write to happen -
|
Can confirm this test passes after adding a 1s sleep import litellm
litellm._turn_on_debug()
litellm.enable_cache(type="disk")
import sys
async def go():
messages = [{"role": "user", "content": "Say one random thing."}]
response = await litellm.acompletion(
messages=messages, model="anthropic/claude-3-5-haiku-20241022"
)
print(response.choices[0].message.content, file=sys.stderr)
print(
"\n\n*** AFTER FIRST CALL, BEFORE MODIFYING MESSAGES ***\n\n",
file=sys.stderr,
)
await asyncio.sleep(1)
response_2 = await litellm.acompletion(
messages=messages, model="anthropic/claude-3-5-haiku-20241022"
)
print(response_2.choices[0].message.content, file=sys.stderr)
assert response.id == response_2.id # use id to confirm cache hit
import asyncio
asyncio.run(go()) |
I don't have the option to message you on LinkedIn. I do think you're missing something. When I add a sleep before modifying This can be fixed, while still writing to cache in a background task, by calculating the hashed cache key before returning from |
@npt i'm confused by this point
what does 'hit cache' here mean, since you're also in agreement this wouldn't result in a cache hit log 2 tells me this did check cache
|
the second run would not result in a cache hit, so yes i agree we can reduce the cache calculation, but i'm not sure i see the 'bug' being hit here perhaps a test which raises an exception based on your expected behaviour would help? (the one shared previously just does prints, so i'm not sure what the 'bug' is) |
(Since the above was a bit peeved — I'm overall very happy with LiteLLM, appreciate that it gives me a unified API across models + structured-output support for Anthropic + caching right out of the box, and expecting to use and appreciate more features in the future. I'm just also frustrated that I tripped over this one thing.) Here's a test that asserts the expected behavior, and fails: import asyncio
import litellm
litellm.enable_cache()
async def run_test():
messages = [{"role": "user", "content": "Say one random thing."}]
response = await litellm.acompletion(messages=messages, model="anthropic/claude-3-5-haiku-20241022")
# await asyncio.sleep(2)
messages.append(response.choices[0].message)
return response.id
async def go():
id1 = await run_test()
await asyncio.sleep(2)
id2 = await run_test()
assert id1 == id2
asyncio.run(go()) When I uncomment the commented sleep, it passes. The second call of |
More to the point, it works if import asyncio
import litellm
litellm.enable_cache()
async def run_test():
messages = [{"role": "user", "content": "Say one random thing."}]
response = await litellm.acompletion(messages=messages, model="anthropic/claude-3-5-haiku-20241022")
return response.id
async def go():
id1 = await run_test()
await asyncio.sleep(2)
id2 = await run_test()
assert id1 == id2
asyncio.run(go()) (so the modification of messages immediately after the |
thanks for this @npt - reopening the ticket, i'll use your test for repro |
What happened?
I observed that, when using code like:
LiteLLM's caching doesn't work. (At least the disk cache doesn't, I haven't checked other cache types.)
In the verbose logs, I found that the response to the first request is written to a different hashed cache key than was checked at the start. AFAICT this is because:
acompletion
returns...When I pass
messages=messages.copy()
intoacompletion
instead, caching works again.It seems like
acompletion
returns (even though the actual write is deferred)acompletion
returnsRelevant log output
No response
Are you a ML Ops Team?
No
What LiteLLM version are you on ?
v1.55.6
Twitter / LinkedIn details
No response
The text was updated successfully, but these errors were encountered: