-
-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REQUEST] Better document rope_scaling/rope_alpha in wiki, and add config of yarn_rope_factor/yarn_rope_original_max_position_embeddings #239
Comments
Yarn scaling can be specified in the model's
I was looking into the yarn settings and possibly exposing them for override in tabby, but this opens up a whole can of worms in terms of number of args and sanity-checking the configuration. Both of the options currently exposed by tabby - linear rope and ntk/alpha - can be applied (doesn't necessarily mean should) regardless of whether the model has another rope method specified (yarn/su/llama3). However, yarn/su/llama3 rope settings would be mutually exclusive with each other, and we would need to enforce only one of them being active at a time. |
Yes😅 |
Honestly, I either don’t fully understand how this is supposed to work or I’m missing something. Let’s say I’m using Qwen32B, which has a default context length of 32,768. If I specify rope_scaling: 4.0 in the config.json, should Tabby load it with a 130k context? For me, this doesn’t happen—it still loads with the default 32,768 context length. Even if this worked, such a large context is too much for my system—I’m limited to 65k for this model. It would be great if there were a way to specify the maximum context length for each model individually, as well as assign them aliases. For example, Qwen32B-32k, Qwen32B-65k, and so on, to load the same models when accessing them via the OAI API. In practice, when I specify max_seq_len: 65535 in the config, the model does load with those settings. However, this applies to all models at once. For instance, if I want to load Qwen14B with a 131k context, it becomes an issue without manually editing the config. The only solution I’ve found so far is to create separate YAML configs for specific models and launch them directly through Tabby with commands like: Each time, I have to stop and restart Tabby with new settings, which is not very convenient. Perhaps I don’t fully understand how to set up automatic changes properly. I’d appreciate any advice on this issue. |
|
Do we really need to edit the max_position_embeddings parameter? I'm a bit confused.
Changing max_seq_len or cache_size in the yml file, along with editing the model configuration as shown above, is the only thing that allows me to load a context length above 32k for the model.
I didn’t know that the tabby_config.yml file could be placed in the directory of each model. This could indeed be useful if it works as intended. However, I just tried this approach, and it didn’t work for me. In any case, this method doesn’t cover the need to load the same model with different context lengths. For example, this can be relevant when connecting to a draft model. Why don’t you recommend using separate configurations for connecting different models? That said, I’ve already solved the issue in my own way. I wrote a small proxy application that starts and stops Tabby with the required configurations based on requests through the OpenAI API. It’s a bit of an unconventional solution, but it works for me exactly as I wanted. |
Problem
No response
Solution
This is comments in exllamav2 src, add to wiki to be more informative:
and add yarn_rope_factor/yarn_rope_original_max_position_embeddings to config.yml
Alternatives
No response
Explanation
see
https://github.com/turboderp/exllamav2/blob/master/exllamav2/config.py
and
https://github.com/turboderp/exllamav2/blob/master/exllamav2/model_init.py:
Examples
No response
Additional context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: