Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embeddings: batch size vs context length #963

Open
dluc opened this issue Oct 30, 2024 · 8 comments
Open

Embeddings: batch size vs context length #963

dluc opened this issue Oct 30, 2024 · 8 comments

Comments

@dluc
Copy link

dluc commented Oct 30, 2024

Description

I’m using two models, openchat_3.5.Q5_K_M.gguf to generate text and nomic-embed-text-v1.5.Q8_0.gguf to calculate text embeddings.

When I input text that exceeds 512 tokens - in my case, it’s 979 tokens - embedding generation throws this exception:

System.ArgumentException: Input contains more tokens than configured batch size (Parameter 'batch')
at LLama.LLamaContext.Decode(LLamaBatch batch)

However, the model documentation specifies a context length of 8192 tokens.

Questions:

  • ideas about the exception? is it a bug in the model or llamasharp?
  • does LlamaSharp support loading two different models?
@martindevans
Copy link
Member

Batch size is the maximum number of tokens that can be processed at once, it's separate from the context size.

For text generation you can feed the model multiple batches before generating a response.

For embedding, I think right now the embedder requires that all of your text is sent in one batch. So you'll need a larger batch for embeddings.

does LlamaSharp support loading two different models?

You're already loading openchat_3.5.Q5_K_M.gguf and nomic-embed-text-v1.5.Q8_0.gguf, which is two different models. So I'm not quite sure what you're asking, sorry.

@dluc
Copy link
Author

dluc commented Oct 30, 2024

Here's our code, where the exception is thrown:

public async Task<Embedding> GenerateEmbeddingAsync(string text)
{
    if (this._log.IsEnabled(LogLevel.Trace))
    {
        this._log.LogTrace("Generating embedding, input token size: {0}", this._textTokenizer.CountTokens(text));
    }

    // Throws `System.ArgumentException`
    var embeddings = await this._embedder.GetEmbeddings(text);

    return new Embedding(embeddings[0]);
}

Batch size is the maximum number of tokens that can be processed at once, it's separate from the context size.

The string is 979 tokens, and I would expect GetEmbeddings to generate one embedding (one array with a single element to be precise).

Is there something to change in the method above?

@martindevans
Copy link
Member

The string is 979 tokens, and I would expect GetEmbeddings to generate one embedding (one array with a single element to be precise).

Sounds right.

Since you must process everything for embeddings in one batch that means you batch size must be set to 979, or greater.

@dluc
Copy link
Author

dluc commented Oct 31, 2024

Looking at the examples, there's no code about the batch size - how is the batch size set?

e.g. https://github.com/SciSharp/LLamaSharp/blob/master/LLama.Examples/Examples/GetEmbeddings.cs

@dluc
Copy link
Author

dluc commented Oct 31, 2024

trying to run https://github.com/SciSharp/LLamaSharp/blob/master/LLama.Examples/Examples/GetEmbeddings.cs and it throws the same exception:

Unhandled exception. System.ArgumentException: Input contains more tokens than configured batch size (Parameter 'batch')
at LLama.LLamaContext.Decode(LLamaBatch batch) in LLama/LLamaContext.cs:line 403
at LLama.LLamaContext.<>c__DisplayClass42_0.b__0() in LLama/LLamaContext.cs:line 414
at System.Threading.Tasks.Task`1.InnerInvoke()
at System.Threading.Tasks.Task.<>c.<.cctor>b__281_0(Object obj)
at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location ---
at LLama.LLamaEmbedder.GetEmbeddings(String input, CancellationToken cancellationToken) in LLama/LLamaEmbedder.cs:line 88
at LLama.Examples.Examples.GetEmbeddings.Run() in LLama.Examples/Examples/GetEmbeddings.cs:line 42
at ExampleRunner.Run() in LLama.Examples/ExampleRunner.cs:line 57
at Program.

$(String[] args) in LLama.Examples/Program.cs:line 40
at Program.(String[] args)

@martindevans
Copy link
Member

Batch size is set in the ModelParams, see here. If you don't set it, the default is 512, which is large enough for the examples but not your use case.

@dluc
Copy link
Author

dluc commented Nov 2, 2024

Wouldn't it be easier if batch size was automatically set to match max tokens? is there any benefit from having a lower default?

For instance, if a model supports up to 8192 tokens per embedding, automatically setting batch size to 8192 would replicate the behavior seen in HF, OpenAI, etc.

@martindevans
Copy link
Member

A large batch size is costly (it takes extra memory). It's generally not worth making very large since (for text generation) after the initial prompt you'll be submitting just one single token at a time. For embedding it's different, you must make the batch size as large as the largest amount of data you'll ever need an embedding for, since it can't be split across multiple batches (currently).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants