Best practice for sharing model weights among several generators/caches #440

xonfour · 2024-05-05T18:01:35Z

xonfour
May 5, 2024

Hi there!

I'm looking for a way to share model weights among two or more generators/caches.

The reason for this:

I want to keep one cache for my "main line" iterative generations and have other caches for auxiliary generations (mainly agent/verifier tasks). Of course I could use batching instead but that will result in performance going down because of fixed batch size, even if some slots are used (or do I get something wrong here?).

Thanks!

turboderp · 2024-05-05T19:54:36Z

turboderp
May 5, 2024
Maintainer

The simple way to do it is just to create two generators, each with its own cache but both referencing the same model. They should work independently. The ExLlamaV2 object is stateless, so you can attach as many generators as you want, or use the forward function separately (for classification or whatever) in between calls to the generator(s). You could also interleave operations like so:

generator_1 = ExLlamaV2StreamingGenerator(model, cache_1, tokenizer)
generator_2 = ExLlamaV2StreamingGenerator(model, cache_2, tokenizer)
generator_1.begin_stream_ex(...)
generator_2.begin_stream_ex(...)
while True:
    res_1 = generator_1.stream_ex()
    res_2 = generator_1.stream_ex()
    ...

Just note that even though the model is stateless it's not thread-safe.

I plan to replace the generator/cache system with a more versatile paged attention scheme soon.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice for sharing model weights among several generators/caches #440

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Best practice for sharing model weights among several generators/caches #440

xonfour May 5, 2024

Replies: 1 comment

turboderp May 5, 2024 Maintainer

xonfour
May 5, 2024

turboderp
May 5, 2024
Maintainer