-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable semantic cache with @CacheResult #659
base: main
Are you sure you want to change the base?
Conversation
I think it should
This is a tricky one... Maybe we should skip it initially |
If a prompt leads to tool executions, I think we shouldn't cache the response. Because if we then served the same response again without actually invoking the tools again, that would be really weird. (And re-invoking the tools is probably not a good idea either, they may be highly non-deterministic and non-idempotent and may change the outcome) |
It's certainly weird either way |
if (cache != null) { | ||
log.debug("Attempting to obtain AI response from cache"); | ||
|
||
Embedding query = context.embeddingModel.embed(userMessage.text()).content(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC here we should also check that there is no previous conversation (no user message in the memory), because we want to only support caching for the first message, right? Or are you checking this elsewhere already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache is configurable with the property quarkus.langchain4j.cache.max-size
(default 1). Should this be hidden from developers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we're not understanding each other - cache max size is the maximum number of prompts stored in it, right?
But here I am talking about the fact that we should only attempt to use the cache when it's for the first prompt in the conversation, so that we avoid the situation that you described in #637 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, max-size
is the maximum number of prompts stored, where prompts should be the combination of SystemMessage + UserMessage
. Each cache as id, if possible the id of the cache is unique (for example if we are in a RequestScope
context), if not the id will always be the value default
.
Id with context:
<ContextId>#<Class>.<Method>
Id without context:
default
Having said that, I don't think we understand each other (but that's my problem 😵).
To avoid returning a wrong cached response like in the example in #637 (comment), is it not enough to have a max-size
of 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having said that, I don't think we understand each other (but that's my problem 😵). To avoid returning a wrong cached response like in the example in #637 (comment), is it not enough to have a
max-size
of 1?
A single Cache
instance contains prompts (system+user message pairs) from MULTIPLE conversations; and from each such conversation, it stores only the initial prompt, do I get that right? Therefore, the max-size
refers to the number of conversations from which we store the initial prompt? But then that has nothing to do with how many messages have passed in the current conversation no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was we would have a cache for each AI service method, but its life would be application-scoped and store the first prompt of every conversation that starts by calling that method.
Ok, so in this case the id of the cache will always be: #<ClassName>.<MethodName>
In the example above: #org.acme.exampleLLMService.fiveLinesPoem
and #org.acme.exampleLLMService.tenLinesPoem
for all the users.
How to avoid memory problems, that's an important question - set a reasonable TTL and periodically clean it, or I'm not sure we could use weak references to have the GC clean up stuff over time.
Yes, in this case it might be better to set a default TTL and of course document the fact that the developer can find an out-of-memory problem if the TTL value is too high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok and the default max-size
will surely be more than one, because we will store prompts from potentially different users. So actually, the max-size
will probably be the main safeguard against eating too much memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let me go back to the example with the last changes we have in mind:
package org.acme.example;
@RegisterAiService
@CacheResult
@SystemMessage("You are a poet")
public interface LLMService {
@UserMessage("Write a poem of 5 lines about {poem}")
public String fiveLinesPoem(String poem);
@UserMessage("Write a poem of 10 lines about {poem}")
public String tenLinesPoem(String poem);
}
Configuration:
max-size: 2
ttl: 5m
ADM
and JM
, wants to create poems.
Two caches are created:
Cache#1: #org.acme.exampleLLMService.fiveLinesPoem
Cache#2: #org.acme.exampleLLMService.tenLinesPoem
Steps:
-
ADM calls
-> fiveLinesPoem("dog");
-> No value in the cache;
-> Calls the LLM;
-> Stores the response in theCache#1
-
JM calls
-> fiveLinesPoem("cat");
-> No value in the cache;
-> Calls the LLM;
-> Stores the response in theCache#1
[MAX-SIZE REACHED] -
ADM calls
-> fiveLinesPoem("turtle");
-> No value in the cache;
-> Calls the LLM;
-> The value is NOT cached [max size reached] -
ADM calls
-> fiveLinesPoem("cat");
-> Return the cached value. [Cache#1
]
The two cached responses will be deleted after 5 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this makes sense to me. But let's ask @cescoffier to confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestion about that? @cescoffier
core/runtime/src/main/java/io/quarkiverse/langchain4j/runtime/cache/AiCacheStore.java
Show resolved
Hide resolved
I'm also thinking about a cache property that allows the developer to decide which type of embedding to use, something like |
That should point not at the provider name, but at the name of the configured model, or whatever is the proper term here (I mean like the |
I think that the whole cache idea is that in general similar question should produce similar result, so caching in some cases makes sense. When tools are involved this is less likely to happen, as we have absolutely no idea what response the tool will return even for the exact same input. So, I'd say that it should not cache result, unless the user somehow defines that the tool is idempotent (always produces same output for a given input) or cachable itself. |
Do you think it would make sense to enable caching for the Streaming API if the method or class has the |
I think we can leave streaming out for a first version and think about it some more |
What is the state of this? |
It's been a while, but the last few messages were about the scope of the cache, specifically whether or not it should be shared between users. |
In this PR, the idea is to implement the semantic cache as discussed in #637.
The idea is to allow the developer to use the @CacheResult annotation to decide which methods need to use the semantic cache.
The idea is to create different classes following what was done for
ChatMemory
and useDefaultMemoryIdProvider
to create an isolated cache for each user if possible (I don't know if this is enough to avoid possible data leakage).An open point is what to do if the
@Tools
annotation is present.Shouldn't
@CacheResult
be usable in this case?TODO LIST:
There is still work to be done, but I think there is enough code to understand if the direction taken is the right one or if changes need to be made.