Replies: 1 comment
-
The simple way to do it is just to create two generators, each with its own cache but both referencing the same model. They should work independently. The generator_1 = ExLlamaV2StreamingGenerator(model, cache_1, tokenizer)
generator_2 = ExLlamaV2StreamingGenerator(model, cache_2, tokenizer)
generator_1.begin_stream_ex(...)
generator_2.begin_stream_ex(...)
while True:
res_1 = generator_1.stream_ex()
res_2 = generator_1.stream_ex()
... Just note that even though the model is stateless it's not thread-safe. I plan to replace the generator/cache system with a more versatile paged attention scheme soon. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi there!
I'm looking for a way to share model weights among two or more generators/caches.
The reason for this:
I want to keep one cache for my "main line" iterative generations and have other caches for auxiliary generations (mainly agent/verifier tasks). Of course I could use batching instead but that will result in performance going down because of fixed batch size, even if some slots are used (or do I get something wrong here?).
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions