-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add logic to calculate how much space to allocate for completion requests #205
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
adrastogi
requested review from
manodasanW,
craigloewen-msft,
krschau,
EricJohnson327 and
jasmilimsft
June 6, 2024 17:23
@EricJohnson327 / @krschau, FYI for you as this PR adds a new package reference that I believe will need to be added to the feed (Microsoft.ML.Tokenizers). Thank you! |
krschau
approved these changes
Jun 6, 2024
adrastogi
commented
Jun 6, 2024
dhoehna
reviewed
Jun 10, 2024
dhoehna
reviewed
Jun 10, 2024
manodasanW
reviewed
Jun 10, 2024
manodasanW
approved these changes
Jun 10, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary of the pull request
Our implementation doesn't try to 'right-size' the number of tokens that completion requests should use, which sometimes results in failures due to the total request being too large. This PR adds logic to calculate how many tokens to specify, which should hopefully mitigate this problem.
References and relevant issues
Closes #194
Detailed description of the pull request / Additional comments
The model we are using (gpt-35-turbo-instruct) has a fixed context window (4096 tokens), which is shared across the input prompt and the response produced by the model. https://platform.openai.com/docs/models/gpt-3-5-turbo
Callers of the API can specify how many tokens the model can allocate to the response via the max tokens parameter. https://platform.openai.com/docs/api-reference/chat/create#chat-create-max_tokens
We observed that with more complex or larger projects, responses weren't being produced due to our completion calls specifying a fixed max token amount (2000 tokens), so in cases where the prompt for a particular completion can be on the larger size, the request would be rejected since the model would observe that the total number of tokens exceeds its limit.
OpenAI has a Python library called TikToken for calculating the number of tokens that a particular input string consumes when processed by a particular model family, and Microsoft has a managed implementation here: https://github.com/microsoft/Tokenizer
This PR takes advantage of this functionality to calculate how many tokens to allocate for the completion requests.
From analyzing various observed failures, the input prompts are generally just a bit larger than half of the max token limit (so, just enough that the previous 2000 token limit would cause the overflow). It is possible that for an exorbitant input, we will not leave enough space for the model to complete a response. I updated that case in the code to generate an exception so that we can see whether this is a common occurrence, and we can tune the behavior from there. (In the long term, we may want to eventually move to a model with a larger context window.)
Validation steps performed
I used several test prompts that were previously failing (e.g., generating an Orleans project, generating a tic-tac-toe GUI app), and those no longer generate any errors. I also did some basic scenario tests to ensure that I didn't regress anything.
PR checklist