Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logic to calculate how much space to allocate for completion requests #205

Merged
merged 6 commits into from
Jun 10, 2024

Conversation

adrastogi
Copy link
Contributor

@adrastogi adrastogi commented Jun 6, 2024

Summary of the pull request

Our implementation doesn't try to 'right-size' the number of tokens that completion requests should use, which sometimes results in failures due to the total request being too large. This PR adds logic to calculate how many tokens to specify, which should hopefully mitigate this problem.

References and relevant issues

Closes #194

Detailed description of the pull request / Additional comments

The model we are using (gpt-35-turbo-instruct) has a fixed context window (4096 tokens), which is shared across the input prompt and the response produced by the model. https://platform.openai.com/docs/models/gpt-3-5-turbo

Callers of the API can specify how many tokens the model can allocate to the response via the max tokens parameter. https://platform.openai.com/docs/api-reference/chat/create#chat-create-max_tokens

We observed that with more complex or larger projects, responses weren't being produced due to our completion calls specifying a fixed max token amount (2000 tokens), so in cases where the prompt for a particular completion can be on the larger size, the request would be rejected since the model would observe that the total number of tokens exceeds its limit.

OpenAI has a Python library called TikToken for calculating the number of tokens that a particular input string consumes when processed by a particular model family, and Microsoft has a managed implementation here: https://github.com/microsoft/Tokenizer

This PR takes advantage of this functionality to calculate how many tokens to allocate for the completion requests.

From analyzing various observed failures, the input prompts are generally just a bit larger than half of the max token limit (so, just enough that the previous 2000 token limit would cause the overflow). It is possible that for an exorbitant input, we will not leave enough space for the model to complete a response. I updated that case in the code to generate an exception so that we can see whether this is a common occurrence, and we can tune the behavior from there. (In the long term, we may want to eventually move to a model with a larger context window.)

Validation steps performed

I used several test prompts that were previously failing (e.g., generating an Orleans project, generating a tic-tac-toe GUI app), and those no longer generate any errors. I also did some basic scenario tests to ensure that I didn't regress anything.

PR checklist

  • Closes #xxx
  • Tests added/passed
  • Documentation updated

@adrastogi
Copy link
Contributor Author

@EricJohnson327 / @krschau, FYI for you as this PR adds a new package reference that I believe will need to be added to the feed (Microsoft.ML.Tokenizers). Thank you!

@adrastogi adrastogi merged commit efad074 into main Jun 10, 2024
3 checks passed
@adrastogi adrastogi deleted the user/adrastogi/prompt-window-fix branch June 10, 2024 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Token limit errors observed in Quickstart Playground
4 participants