Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Deferred cache key calculation leads to caching failure when modifying inputs #7316

Open
npt opened this issue Dec 19, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@npt
Copy link

npt commented Dec 19, 2024

What happened?

I observed that, when using code like:

messages = [...]
response = await acompletion(messages=messages, ...)
messages.append(response.choices[0].message)
messages.append(...)
response = await acompletion(messages=messages, ...)

LiteLLM's caching doesn't work. (At least the disk cache doesn't, I haven't checked other cache types.)

In the verbose logs, I found that the response to the first request is written to a different hashed cache key than was checked at the start. AFAICT this is because:

  • The cache key is calculated twice, when reading and when writing
  • The second calculation happens after acompletion returns...
  • ...using a shallow copy of the arguments.

When I pass messages=messages.copy() into acompletion instead, caching works again.

It seems like

  • the hashed cache key should only be calculated once
  • or if there's a reason to calculate it twice, it should be calculated before acompletion returns (even though the actual write is deferred)
  • or if the write key calculation has to be deferred, it should use a deep copy of the arguments taken before acompletion returns

Relevant log output

No response

Are you a ML Ops Team?

No

What LiteLLM version are you on ?

v1.55.6

Twitter / LinkedIn details

No response

@npt npt added the bug Something isn't working label Dec 19, 2024
@krrishdholakia
Copy link
Contributor

messages.append(response.choices[0].message)
messages.append(...)
response = await acompletion(messages=messages, ...)

the fact you don't have a cache hit makes sense. This isn't a bug.

your input is different. perhaps try recreating this ticket as a feature request for what you'd want to happen here, as this is not a bug.

@krrishdholakia krrishdholakia closed this as not planned Won't fix, can't repro, duplicate, stale Dec 19, 2024
@npt
Copy link
Author

npt commented Dec 19, 2024

Sorry, I wasn't clear. Between runs of the script, the first request is not cached, and should be. (It sounds like you interpreted me as saying there should be a cache hit on the second request, within one run of that script — I agree that shouldn't happen.)

If I run a script like that one twice in a row, with an identical starting messages value:

  • Expected behavior: on the second run, the first acompletion call hits cache.
  • Observed behavior: the second run does not hit cache.
  • If I change the first call to acompletion(messages=messages.copy(), ...), then the second run does hit the cache.

Minimal reproducing script:

import litellm
litellm._turn_on_debug()
litellm.enable_cache(type="disk")

import sys
async def go():
    messages = [{"role": "user", "content": "Say one random thing."}]
    response = await litellm.acompletion(messages=messages, model="anthropic/claude-3-5-haiku-20241022")
    print(response.choices[0].message.content, file=sys.stderr)
    print("\n\n*** AFTER FIRST CALL, BEFORE MODIFYING MESSAGES ***\n\n", file=sys.stderr)
    messages.append(response.choices[0].message)
    messages.append({"role": "user", "content": "Say a different random thing."})
    print("\n\n*** BEFORE SECOND CALL ***\n\n", file=sys.stderr)
    await litellm.acompletion(messages=messages, model="anthropic/claude-3-5-haiku-20241022")


import asyncio
asyncio.run(go())

Logs from two successive runs:
log.1.txt
log.2.txt

(I'm not sure if this happens when using the synchronous completion.)

@krrishdholakia
Copy link
Contributor

@npt do you see this error if you just wait 2s before calling?

The reason is that we write to cache in a background task (don't want to slow down actual request/response).

In our testing we just add a slight delay before running the 2nd call to allow a cache write to happen -

await asyncio.sleep(0.5)

@krrishdholakia
Copy link
Contributor

Can confirm this test passes after adding a 1s sleep

import litellm

    litellm._turn_on_debug()
    litellm.enable_cache(type="disk")

    import sys

    async def go():
        messages = [{"role": "user", "content": "Say one random thing."}]
        response = await litellm.acompletion(
            messages=messages, model="anthropic/claude-3-5-haiku-20241022"
        )
        print(response.choices[0].message.content, file=sys.stderr)
        print(
            "\n\n*** AFTER FIRST CALL, BEFORE MODIFYING MESSAGES ***\n\n",
            file=sys.stderr,
        )
        await asyncio.sleep(1)
        response_2 = await litellm.acompletion(
            messages=messages, model="anthropic/claude-3-5-haiku-20241022"
        )
        print(response_2.choices[0].message.content, file=sys.stderr)

        assert response.id == response_2.id  # use id to confirm cache hit

    import asyncio

    asyncio.run(go())

@krrishdholakia krrishdholakia closed this as not planned Won't fix, can't repro, duplicate, stale Dec 20, 2024
@krrishdholakia
Copy link
Contributor

@npt happy to continue this conversation over linkedin/dm's if you think i'm missing something here

linkedin

@npt
Copy link
Author

npt commented Dec 21, 2024

I don't have the option to message you on LinkedIn.

I do think you're missing something. When I add a sleep before modifying messages, it works as desired, but I shouldn't have to do that, and if I do have to do that it should be documented. Making multiple completion calls in the same conversation by mutating the same messages list is a common case (the function calling example in the docs does it), and doing it shouldn't break caching and (as it did for me) require the user to spend time digging through the debug logs and library code to figure out why their responses aren't being cached.

This can be fixed, while still writing to cache in a background task, by calculating the hashed cache key before returning from acompletion (/ using the hashed cache key that was already calculated when doing the cache lookup).

@krrishdholakia
Copy link
Contributor

@npt i'm confused by this point

Observed behavior: the second run does not hit cache.

what does 'hit cache' here mean, since you're also in agreement this wouldn't result in a cache hit

log 2 tells me this did check cache

�[92m15:35:05 - LiteLLM:DEBUG�[0m: caching_handler.py:140 - Checking Cache

@krrishdholakia
Copy link
Contributor

krrishdholakia commented Dec 21, 2024

using the hashed cache key that was already calculated when doing the cache lookup

the second run would not result in a cache hit, so yes i agree we can reduce the cache calculation, but i'm not sure i see the 'bug' being hit here

perhaps a test which raises an exception based on your expected behaviour would help? (the one shared previously just does prints, so i'm not sure what the 'bug' is)

@npt
Copy link
Author

npt commented Dec 21, 2024

(Since the above was a bit peeved — I'm overall very happy with LiteLLM, appreciate that it gives me a unified API across models + structured-output support for Anthropic + caching right out of the box, and expecting to use and appreciate more features in the future. I'm just also frustrated that I tripped over this one thing.)

Here's a test that asserts the expected behavior, and fails:

import asyncio
import litellm

litellm.enable_cache()

async def run_test():
    messages = [{"role": "user", "content": "Say one random thing."}]
    response = await litellm.acompletion(messages=messages, model="anthropic/claude-3-5-haiku-20241022")
    # await asyncio.sleep(2)
    messages.append(response.choices[0].message)
    return response.id

async def go():
    id1 = await run_test()
    await asyncio.sleep(2)
    id2 = await run_test()
    assert id1 == id2

asyncio.run(go())

When I uncomment the commented sleep, it passes. The second call of run_test should result in a cache hit, since it calls acompletion with the same arguments as before.

@npt
Copy link
Author

npt commented Dec 23, 2024

More to the point, it works if run_test doesn't sleep and also doesn't append, i.e. this passes:

import asyncio
import litellm

litellm.enable_cache()

async def run_test():
    messages = [{"role": "user", "content": "Say one random thing."}]
    response = await litellm.acompletion(messages=messages, model="anthropic/claude-3-5-haiku-20241022")
    return response.id

async def go():
    id1 = await run_test()
    await asyncio.sleep(2)
    id2 = await run_test()
    assert id1 == id2

asyncio.run(go())

(so the modification of messages immediately after the acompletion call is what's causing the problem; it's not that the extra sleep is just giving enough time for the cache write to happen, or something)

@krrishdholakia
Copy link
Contributor

thanks for this @npt - reopening the ticket, i'll use your test for repro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants