-
Notifications
You must be signed in to change notification settings - Fork 353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: KernelMemory - Simultaneous execution of AskDocument & ImportDocument #918
Comments
I haven't worked much on the kernel memory side of things, but from that call stack I can see it's trying to use an executor after it has been disposed inside the I would guess that you're not meant to make a second call while a previous async call is still ongoing. Right now I don't think anyone is working much on the KM integration, so if you'd like to dig in and PR a fix (e.g. a check that throws if you try to make a second call, or actually a true fix to allow simultaneous calls) I'll be happy to review it :) |
I see a lot of issue on KM and LLamasharp. RAG is a very popular technology, it would be very good to bring it to mind. Of all the problems 0.16.0, I am now most concerned about a 2-fold drop in performance from the GPU :( |
They don't do anything with the context at all. They don't create or delete it. KM ----> LLama ----> Please explain why the context is created in the constructor and immediately deleted, and then re-created each time in InferAsync?
|
I can't remember why that was done, I think it was added ages ago as a hacky fix to work around an issue. I've been planning to overhaul the executors soon and hopefully that will get removed.
The As to the bug, I don't think |
The solution to the problem has been found. !!! But no, with such a decision, some garbage gets in response. LLamaSharpTextGenerator.GenerateTextAsync The executor needs to be created every time:
this option gives an error: :( |
|
please explain what does context space mean? |
When a context is created, you choose how many tokens can fit into that context (by setting it here). Every single token needs to fit into that space, so if you load up with a context of e.g. 1000 you cannot have more than 1000 tokens processed by the model in total. |
I set ContextSize = 8192 |
The problem is that the Context is created alone in the constructor, and there are already several InteractiveExecutors in it. And that's the problem. In general, I see that there are large gaps with Executor. |
Unfortunately, in its current form, all this does not work with KM in the multithreaded cases! When several users are working with the application at the same time or ImportDocument is running. |
I've tried everything, it seems. My code:
// Another implementation option GenerateTextAsync
Even with the second consecutive call: llama_get_logits_ith: invalid logits id 0, reason: batch.logits[0] != true This error can be corrected:
But the simultaneous call still doesn't work: CUDA error: operation not permitted when stream is capturing I do not know what else to watch, do not load your own model for each call. |
I am not surprised that your code does not work. You need to reimplement it completely. It does not have any sense to work with the context in several function while using a StatelessExecutor! StatelessExecutor is a very good option for KernelMemory, but you must not rely on the context. The context is created while doing the inference and then destroyed. This is also the best for multi-threaded applications because the GPU memory is released earlier. Furthermore, multi-threading with GPU is more difficult with LLamaSharp than it seems at first sight because allocating GPU memory happens first always on the first GPU and if there is no space there after a second request fills GPU memory the first request cannot allocate GPU memory for the context and you will get NoKvslot messages (not enough GPU memory for the context). The best implementation for the moment depends on how many GUs you have, if you have only one small GPU, then you need to make sure that there is only one inference call to the GPU at once and then GPU memory must be release and then a new inference call can be added, etc. If you have several GPUs, then you need to place the inference calls on different GPUs, etc. If you have one large GPU, then you may place, for example, two inference calls at once on one GPU... Etc. you must build in some relatively complex logic into your code that uses the GPU. We will not be able to help you with this, but if you understand the code, then this will be relatively straightforward to do. Try to first review some existing examples and make some yourself with different options. |
This code is written based on: I started rewriting this code because it doesn't work from different threads at the same time.
|
Again, it is more difficult than you think. The LLamaSharp library does not provide you with all possible options and sometime you need to extend the library for your needs and sometimes you need to suggest an improvement and post a PR. This is one of the extensions that you will need to be able to use the StatelessExecutor in LlamaSharpTextGenerator.
Here, we do not pass the context to LlamaSharpTextGenerator because it will use the StatelessExecutor! You will also need to remove the context from LlamaSharpTextGenerator and create a special version of LlamaSharpTextGenerator |
I've already rewritten that, of course. Please, I have already delved very deeply into the code :) |
I rewrote the code:
Using Services.AddTransient:
When calling from different threads at the same time, such errors: CUDA error: operation not permitted when stream is capturing |
It wasn't in my code! |
Description
System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'Cannot use this
SafeLLamaContextHandle
- it has been disposed'.at LLama.Native.SafeLLamaContextHandle.ThrowIfDisposed()
at LLama.Native.SafeLLamaContextHandle.GetLogitsIth(Int32 i)
at LLama.StatelessExecutor.InferAsync(String prompt, IInferenceParams inferenceParams, CancellationToken cancellationToken)+MoveNext()
at LLama.StatelessExecutor.InferAsync(String prompt, IInferenceParams inferenceParams, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
at Microsoft.KernelMemory.Handlers.SummarizationHandler.SummarizeAsync(String content, IContext context)
at Microsoft.KernelMemory.Handlers.SummarizationHandler.SummarizeAsync(String content, IContext context)
at Microsoft.KernelMemory.Handlers.SummarizationHandler.InvokeAsync(DataPipeline pipeline, CancellationToken cancellationToken)
at Microsoft.KernelMemory.Pipeline.InProcessPipelineOrchestrator.RunPipelineAsync(DataPipeline pipeline, CancellationToken cancellationToken)
at Microsoft.KernelMemory.Pipeline.BaseOrchestrator.ImportDocumentAsync(String index, DocumentUploadRequest uploadRequest, IContext context, CancellationToken cancellationToken)
Reproduction Steps
KernelMemory - Simultaneous execution of AskDocument & ImportDocument
Environment & Configuration
Known Workarounds
The text was updated successfully, but these errors were encountered: