Embeddings: batch size vs context length #963

dluc · 2024-10-30T17:18:38Z

Description

I’m using two models, openchat_3.5.Q5_K_M.gguf to generate text and nomic-embed-text-v1.5.Q8_0.gguf to calculate text embeddings.

When I input text that exceeds 512 tokens - in my case, it’s 979 tokens - embedding generation throws this exception:

System.ArgumentException: Input contains more tokens than configured batch size (Parameter 'batch')
at LLama.LLamaContext.Decode(LLamaBatch batch)

However, the model documentation specifies a context length of 8192 tokens.

Questions:

ideas about the exception? is it a bug in the model or llamasharp?
does LlamaSharp support loading two different models?

The text was updated successfully, but these errors were encountered:

martindevans · 2024-10-30T17:40:14Z

Batch size is the maximum number of tokens that can be processed at once, it's separate from the context size.

For text generation you can feed the model multiple batches before generating a response.

For embedding, I think right now the embedder requires that all of your text is sent in one batch. So you'll need a larger batch for embeddings.

does LlamaSharp support loading two different models?

You're already loading openchat_3.5.Q5_K_M.gguf and nomic-embed-text-v1.5.Q8_0.gguf, which is two different models. So I'm not quite sure what you're asking, sorry.

dluc · 2024-10-30T17:52:44Z

Here's our code, where the exception is thrown:

public async Task<Embedding> GenerateEmbeddingAsync(string text)
{
    if (this._log.IsEnabled(LogLevel.Trace))
    {
        this._log.LogTrace("Generating embedding, input token size: {0}", this._textTokenizer.CountTokens(text));
    }

    // Throws `System.ArgumentException`
    var embeddings = await this._embedder.GetEmbeddings(text);

    return new Embedding(embeddings[0]);
}

Batch size is the maximum number of tokens that can be processed at once, it's separate from the context size.

The string is 979 tokens, and I would expect GetEmbeddings to generate one embedding (one array with a single element to be precise).

Is there something to change in the method above?

martindevans · 2024-10-30T21:50:51Z

The string is 979 tokens, and I would expect GetEmbeddings to generate one embedding (one array with a single element to be precise).

Sounds right.

Since you must process everything for embeddings in one batch that means you batch size must be set to 979, or greater.

dluc · 2024-10-31T04:21:05Z

Looking at the examples, there's no code about the batch size - how is the batch size set?

e.g. https://github.com/SciSharp/LLamaSharp/blob/master/LLama.Examples/Examples/GetEmbeddings.cs

dluc · 2024-10-31T04:42:50Z

trying to run https://github.com/SciSharp/LLamaSharp/blob/master/LLama.Examples/Examples/GetEmbeddings.cs and it throws the same exception:

Unhandled exception. System.ArgumentException: Input contains more tokens than configured batch size (Parameter 'batch')
at LLama.LLamaContext.Decode(LLamaBatch batch) in LLama/LLamaContext.cs:line 403
at LLama.LLamaContext.<>c__DisplayClass42_0.b__0() in LLama/LLamaContext.cs:line 414
at System.Threading.Tasks.Task`1.InnerInvoke()
at System.Threading.Tasks.Task.<>c.<.cctor>b__281_0(Object obj)
at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location ---
at LLama.LLamaEmbedder.GetEmbeddings(String input, CancellationToken cancellationToken) in LLama/LLamaEmbedder.cs:line 88
at LLama.Examples.Examples.GetEmbeddings.Run() in LLama.Examples/Examples/GetEmbeddings.cs:line 42
at ExampleRunner.Run() in LLama.Examples/ExampleRunner.cs:line 57
at Program.
$(String[] args) in LLama.Examples/Program.cs:line 40
at Program.(String[] args)

martindevans · 2024-10-31T14:24:30Z

Batch size is set in the ModelParams, see here. If you don't set it, the default is 512, which is large enough for the examples but not your use case.

dluc · 2024-11-02T00:08:21Z

Wouldn't it be easier if batch size was automatically set to match max tokens? is there any benefit from having a lower default?

For instance, if a model supports up to 8192 tokens per embedding, automatically setting batch size to 8192 would replicate the behavior seen in HF, OpenAI, etc.

martindevans · 2024-11-02T00:47:00Z

A large batch size is costly (it takes extra memory). It's generally not worth making very large since (for text generation) after the initial prompt you'll be submitting just one single token at a time. For embedding it's different, you must make the batch size as large as the largest amount of data you'll ever need an embedding for, since it can't be split across multiple batches (currently).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embeddings: batch size vs context length #963

Embeddings: batch size vs context length #963

dluc commented Oct 30, 2024

martindevans commented Oct 30, 2024

dluc commented Oct 30, 2024 •

edited

Loading

martindevans commented Oct 30, 2024

dluc commented Oct 31, 2024

dluc commented Oct 31, 2024

martindevans commented Oct 31, 2024

dluc commented Nov 2, 2024

martindevans commented Nov 2, 2024

Embeddings: batch size vs context length #963

Embeddings: batch size vs context length #963

Comments

dluc commented Oct 30, 2024

Description

martindevans commented Oct 30, 2024

dluc commented Oct 30, 2024 • edited Loading

martindevans commented Oct 30, 2024

dluc commented Oct 31, 2024

dluc commented Oct 31, 2024

martindevans commented Oct 31, 2024

dluc commented Nov 2, 2024

martindevans commented Nov 2, 2024

dluc commented Oct 30, 2024 •

edited

Loading