Instant CLIP Tokenizer: a fast tokenizer for the CLIP neural network

Instant CLIP Tokenizer is a fast pure-Rust text tokenizer for OpenAI's CLIP model. It is intended to be a replacement for the original Python-based tokenizer included in the CLIP repository, aiming for 100% compatibility with the original implementation. It can also be used with OpenCLIP and other implementations using the same tokenizer.

In addition to being usable as a Rust crate it also includes Python bindings built with PyO3 so that it can be used as a native Python module.

For the microbenchmarks included in this repository, Instant CLIP Tokenizer is ~70x faster than the Python implementation (with preprocessing and caching disabled to ensure a fair comparison).

Using the library

Rust

[dependencies]
instant-clip-tokenizer = "0.1.0"
# To enable additional functionality that depends on the `ndarray` crate:
# instant-clip-tokenizer = { version = "0.1.0", features = ["ndarray"] }

Python (>= 3.9)

pip install instant-clip-tokenizer

Using the library requires numpy >= 1.16.0 installed in your Python environment (e.g., via pip install numpy).

Examples

use instant_clip_tokenizer::{Token, Tokenizer};

let tokenizer = Tokenizer::new();

let mut tokens = Vec::new();
tokenizer.encode("A person riding a motorcycle", &mut tokens);
let tokens = tokens.into_iter().map(Token::to_u16).collect::<Vec<_>>();
println!("{:?}", tokens);

// -> [320, 2533, 6765, 320, 10297]

import instant_clip_tokenizer

tokenizer = instant_clip_tokenizer.Tokenizer()

tokens = tokenizer.encode("A person riding a motorcycle")
print(tokens)

# -> [320, 2533, 6765, 320, 10297]

batch = tokenizer.tokenize_batch(["A person riding a motorcycle", "Hi there"], context_length=5)
print(batch)

# -> [[49406   320  2533  6765 49407]
#     [49406  1883   997 49407     0]]

Testing

To run the tests run the following:

cargo test --all-features

You can also test the Python bindings with:

make test-python

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.cargo		.cargo
.github		.github
instant-clip-tokenizer-py		instant-clip-tokenizer-py
instant-clip-tokenizer		instant-clip-tokenizer
scripts		scripts
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cover.svg		cover.svg
deny.toml		deny.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instant CLIP Tokenizer: a fast tokenizer for the CLIP neural network

Using the library

Rust

Python (>= 3.9)

Examples

Testing

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

instant-labs/instant-clip-tokenizer

Folders and files

Latest commit

History

Repository files navigation

Instant CLIP Tokenizer: a fast tokenizer for the CLIP neural network

Using the library

Rust

Python (>= 3.9)

Examples

Testing

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages