[Bug]: Deferred cache key calculation leads to caching failure when modifying inputs #7316

npt · 2024-12-19T20:07:14Z

What happened?

I observed that, when using code like:

messages = [...]
response = await acompletion(messages=messages, ...)
messages.append(response.choices[0].message)
messages.append(...)
response = await acompletion(messages=messages, ...)

LiteLLM's caching doesn't work. (At least the disk cache doesn't, I haven't checked other cache types.)

In the verbose logs, I found that the response to the first request is written to a different hashed cache key than was checked at the start. AFAICT this is because:

The cache key is calculated twice, when reading and when writing
The second calculation happens after acompletion returns...
...using a shallow copy of the arguments.

When I pass messages=messages.copy() into acompletion instead, caching works again.

It seems like

the hashed cache key should only be calculated once
or if there's a reason to calculate it twice, it should be calculated before acompletion returns (even though the actual write is deferred)
or if the write key calculation has to be deferred, it should use a deep copy of the arguments taken before acompletion returns

Relevant log output

No response

Are you a ML Ops Team?

No

What LiteLLM version are you on ?

v1.55.6

Twitter / LinkedIn details

No response

The text was updated successfully, but these errors were encountered:

krrishdholakia · 2024-12-19T21:03:41Z

messages.append(response.choices[0].message)
messages.append(...)
response = await acompletion(messages=messages, ...)

the fact you don't have a cache hit makes sense. This isn't a bug.

your input is different. perhaps try recreating this ticket as a feature request for what you'd want to happen here, as this is not a bug.

npt · 2024-12-19T23:36:10Z

Sorry, I wasn't clear. Between runs of the script, the first request is not cached, and should be. (It sounds like you interpreted me as saying there should be a cache hit on the second request, within one run of that script — I agree that shouldn't happen.)

If I run a script like that one twice in a row, with an identical starting messages value:

Expected behavior: on the second run, the first acompletion call hits cache.
Observed behavior: the second run does not hit cache.
If I change the first call to acompletion(messages=messages.copy(), ...), then the second run does hit the cache.

Minimal reproducing script:

import litellm
litellm._turn_on_debug()
litellm.enable_cache(type="disk")

import sys
async def go():
    messages = [{"role": "user", "content": "Say one random thing."}]
    response = await litellm.acompletion(messages=messages, model="anthropic/claude-3-5-haiku-20241022")
    print(response.choices[0].message.content, file=sys.stderr)
    print("\n\n*** AFTER FIRST CALL, BEFORE MODIFYING MESSAGES ***\n\n", file=sys.stderr)
    messages.append(response.choices[0].message)
    messages.append({"role": "user", "content": "Say a different random thing."})
    print("\n\n*** BEFORE SECOND CALL ***\n\n", file=sys.stderr)
    await litellm.acompletion(messages=messages, model="anthropic/claude-3-5-haiku-20241022")


import asyncio
asyncio.run(go())

Logs from two successive runs:
log.1.txt
log.2.txt

(I'm not sure if this happens when using the synchronous completion.)

krrishdholakia · 2024-12-20T23:55:54Z

@npt do you see this error if you just wait 2s before calling?

The reason is that we write to cache in a background task (don't want to slow down actual request/response).

In our testing we just add a slight delay before running the 2nd call to allow a cache write to happen -

litellm/tests/local_testing/cache_unit_tests.py

Line 73 in 4322954

await asyncio.sleep(0.5)

krrishdholakia · 2024-12-20T23:59:11Z

Can confirm this test passes after adding a 1s sleep

import litellm

    litellm._turn_on_debug()
    litellm.enable_cache(type="disk")

    import sys

    async def go():
        messages = [{"role": "user", "content": "Say one random thing."}]
        response = await litellm.acompletion(
            messages=messages, model="anthropic/claude-3-5-haiku-20241022"
        )
        print(response.choices[0].message.content, file=sys.stderr)
        print(
            "\n\n*** AFTER FIRST CALL, BEFORE MODIFYING MESSAGES ***\n\n",
            file=sys.stderr,
        )
        await asyncio.sleep(1)
        response_2 = await litellm.acompletion(
            messages=messages, model="anthropic/claude-3-5-haiku-20241022"
        )
        print(response_2.choices[0].message.content, file=sys.stderr)

        assert response.id == response_2.id  # use id to confirm cache hit

    import asyncio

    asyncio.run(go())

krrishdholakia · 2024-12-21T00:00:50Z

@npt happy to continue this conversation over linkedin/dm's if you think i'm missing something here

npt · 2024-12-21T01:03:03Z

I don't have the option to message you on LinkedIn.

I do think you're missing something. When I add a sleep before modifying messages, it works as desired, but I shouldn't have to do that, and if I do have to do that it should be documented. Making multiple completion calls in the same conversation by mutating the same messages list is a common case (the function calling example in the docs does it), and doing it shouldn't break caching and (as it did for me) require the user to spend time digging through the debug logs and library code to figure out why their responses aren't being cached.

This can be fixed, while still writing to cache in a background task, by calculating the hashed cache key before returning from acompletion (/ using the hashed cache key that was already calculated when doing the cache lookup).

krrishdholakia · 2024-12-21T01:05:04Z

@npt i'm confused by this point

Observed behavior: the second run does not hit cache.

what does 'hit cache' here mean, since you're also in agreement this wouldn't result in a cache hit

log 2 tells me this did check cache

�[92m15:35:05 - LiteLLM:DEBUG�[0m: caching_handler.py:140 - Checking Cache

krrishdholakia · 2024-12-21T01:06:22Z

using the hashed cache key that was already calculated when doing the cache lookup

the second run would not result in a cache hit, so yes i agree we can reduce the cache calculation, but i'm not sure i see the 'bug' being hit here

perhaps a test which raises an exception based on your expected behaviour would help? (the one shared previously just does prints, so i'm not sure what the 'bug' is)

npt · 2024-12-21T01:19:51Z

(Since the above was a bit peeved — I'm overall very happy with LiteLLM, appreciate that it gives me a unified API across models + structured-output support for Anthropic + caching right out of the box, and expecting to use and appreciate more features in the future. I'm just also frustrated that I tripped over this one thing.)

Here's a test that asserts the expected behavior, and fails:

import asyncio
import litellm

litellm.enable_cache()

async def run_test():
    messages = [{"role": "user", "content": "Say one random thing."}]
    response = await litellm.acompletion(messages=messages, model="anthropic/claude-3-5-haiku-20241022")
    # await asyncio.sleep(2)
    messages.append(response.choices[0].message)
    return response.id

async def go():
    id1 = await run_test()
    await asyncio.sleep(2)
    id2 = await run_test()
    assert id1 == id2

asyncio.run(go())

When I uncomment the commented sleep, it passes. The second call of run_test should result in a cache hit, since it calls acompletion with the same arguments as before.

npt · 2024-12-23T19:58:16Z

More to the point, it works if run_test doesn't sleep and also doesn't append, i.e. this passes:

import asyncio
import litellm

litellm.enable_cache()

async def run_test():
    messages = [{"role": "user", "content": "Say one random thing."}]
    response = await litellm.acompletion(messages=messages, model="anthropic/claude-3-5-haiku-20241022")
    return response.id

async def go():
    id1 = await run_test()
    await asyncio.sleep(2)
    id2 = await run_test()
    assert id1 == id2

asyncio.run(go())

(so the modification of messages immediately after the acompletion call is what's causing the problem; it's not that the extra sleep is just giving enough time for the cache write to happen, or something)

krrishdholakia · 2024-12-23T20:10:27Z

thanks for this @npt - reopening the ticket, i'll use your test for repro

npt added the bug Something isn't working label Dec 19, 2024

krrishdholakia closed this as not planned Won't fix, can't repro, duplicate, stale Dec 19, 2024

krrishdholakia reopened this Dec 19, 2024

krrishdholakia closed this as not planned Won't fix, can't repro, duplicate, stale Dec 20, 2024

krrishdholakia reopened this Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Deferred cache key calculation leads to caching failure when modifying inputs #7316

[Bug]: Deferred cache key calculation leads to caching failure when modifying inputs #7316

npt commented Dec 19, 2024 •

edited

Loading

krrishdholakia commented Dec 19, 2024

npt commented Dec 19, 2024 •

edited

Loading

krrishdholakia commented Dec 20, 2024

krrishdholakia commented Dec 20, 2024

krrishdholakia commented Dec 21, 2024

npt commented Dec 21, 2024

krrishdholakia commented Dec 21, 2024

krrishdholakia commented Dec 21, 2024 •

edited

Loading

npt commented Dec 21, 2024

npt commented Dec 23, 2024

krrishdholakia commented Dec 23, 2024

[Bug]: Deferred cache key calculation leads to caching failure when modifying inputs #7316

[Bug]: Deferred cache key calculation leads to caching failure when modifying inputs #7316

Comments

npt commented Dec 19, 2024 • edited Loading

What happened?

Relevant log output

Are you a ML Ops Team?

What LiteLLM version are you on ?

Twitter / LinkedIn details

krrishdholakia commented Dec 19, 2024

npt commented Dec 19, 2024 • edited Loading

krrishdholakia commented Dec 20, 2024

krrishdholakia commented Dec 20, 2024

krrishdholakia commented Dec 21, 2024

npt commented Dec 21, 2024

krrishdholakia commented Dec 21, 2024

krrishdholakia commented Dec 21, 2024 • edited Loading

npt commented Dec 21, 2024

npt commented Dec 23, 2024

krrishdholakia commented Dec 23, 2024

npt commented Dec 19, 2024 •

edited

Loading

npt commented Dec 19, 2024 •

edited

Loading

krrishdholakia commented Dec 21, 2024 •

edited

Loading