Multi-Tokenizer

Tokenization of Multilingual Texts using Language-Specific Tokenizers

Overview

Multi-Tokenizer is a Python package that provides tokenization of multilingual texts using language-specific tokenizers. The package is designed to be used in a variety of applications, including natural language processing, machine learning, and data analysis. Behind the scenes, the package uses lingua library to detect the language of the text segments, the tokenizers library to create language-specific tokenizers, and then tokenizes the text segments using the appropriate tokenizer. Multi-tokenizer introduces additional special tokens to handle the language-specific tokenization, which can be used to reconstruct the original text segments after tokenization and allows the models to differentiate between the languages in the text segments.

Installation

Using pip

pip install multi-tokenizer

from Source

git clone https://github.com/chandralegend/multi-tokenizer.git
cd multi-tokenizer
pip install .

Usage

from multi_tokenizer import MultiTokenizer, PretrainedTokenizers

# specify the language tokenizers to be used
lang_tokenizers = [
    PretrainedTokenizers.ENGLISH,
    PretrainedTokenizers.CHINESE,
    PretrainedTokenizers.HINDI,
]

# create a multi-tokenizer object (split_text=True to split the text into segments, for better language detection)
tokenizer = MultiTokenizer(lang_tokenizers, split_text=True)

sentence = "Translate this hindi sentence to english - बिल्ली बहुत प्यारी है."

# Pretokenize the text
pretokenized_text = tokenizer.pre_tokenize(sentence) # [('<EN>', (0, 1)), ('Translate', (1, 10)), ('Ġthis', (10, 15)), ('Ġhindi', (15, 21)), ...]

# Encode the text
ids, tokens = tokenizer.encode(sentence) # [3, 7235, 6614, 86, 755, 775, 10763, 83, 19412, 276, ...], ['<EN>', 'Tr', 'ans', 'l', 'ate', 'Ġthis', 'Ġhind', ...]

# Decode the tokens
decoded_text = tokenizer.decode(ids) # Translate this hindi sentence to english - बिल्ली बहुत प्यारी है.

Development Setup

Prerequisites

Use the VSCode Dev Containers for easy setup (Recommended)
Install dev dependencies
```
pip install poetry
poetry install
```

Linting, Formatting and Type Checking

Add the directory to safe.directory

git config --global --add safe.directory /workspaces/multi-tokenizer

Run the following command to lint and format the code
```
pre-commit run --all-files
```
To install pre-commit hooks, run the following command (Recommended)
```
pre-commit install
```

Running the tests

Run the tests using the following command

pytest -n "auto"

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
docs		docs
multi_tokenizer		multi_tokenizer
scripts		scripts
support		support
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
dev.requirements.txt		dev.requirements.txt
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Tokenizer

Overview

Installation

Using pip

from Source

Usage

Development Setup

Prerequisites

Linting, Formatting and Type Checking

Running the tests

Approaches

Evaluation

Contributors

About

Releases 2

Languages

License

aya-multitokenizer/multi-tokenizer

Folders and files

Latest commit

History

Repository files navigation

Multi-Tokenizer

Overview

Installation

Using pip

from Source

Usage

Development Setup

Prerequisites

Linting, Formatting and Type Checking

Running the tests

Approaches

Evaluation

Contributors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Languages