diff --git a/README.md b/README.md index f5917e4..e60b160 100644 --- a/README.md +++ b/README.md @@ -1,22 +1,16 @@ -# Tokenization of Multilingual Texts using Language-Specific Tokenizers +# Multi-Tokenizer +Tokenization of Multilingual Texts using Language-Specific Tokenizers -## Approaches +[![PyPI version](https://img.shields.io/pypi/v/multi-tokenizer.svg)](https://pypi.org/project/multi-tokenizer/) -1. [Approach 1: Individual tokenizers for each language](support/proposal_1.md) -2. [Approach 2: Unified tokenization approach across languages using utf-8 encondings](support/proposal_2.md) +## Overview -## Evaluation - -- [Evaluation Methodologies](support/evaluation.md#evaluation-metodologies) -- [Data Collection and Analysis](support/evaluation.md#7-data-collection-and-analysis) -- [Comparative Analysis](support/evaluation.md#8-comparative-analysis) -- [Implementation Plan](support/evaluation.md#9-implementation-plan) -- [Future Expansion](support/evaluation.md#10-future-expansion) +Multi-Tokenizer is a Python package that provides tokenization of multilingual texts using language-specific tokenizers. The package is designed to be used in a variety of applications, including natural language processing, machine learning, and data analysis. Behind the scenes, the package uses `lingua` library to detect the language of the text segments, the `tokenizers` library to create language-specific tokenizers, and then tokenizes the text segments using the appropriate tokenizer. Multi-tokenizer introduces additional special tokens to handle the language-specific tokenization, which can be used to reconstruct the original text segments after tokenization and allows the models to differentiate between the languages in the text segments. ## Development Setup ### Prerequisites -- Use the Dev Container for easy setup +- Use the VSCode Dev Containers for easy setup (Recommended) - Install dev dependencies ```bash pip install poetry @@ -42,3 +36,24 @@ Run the tests using the following command ```bash pytest -n "auto" ``` + +## Approaches + +1. [Approach 1: Individual tokenizers for each language](support/proposal_1.md) +2. [Approach 2: Unified tokenization approach across languages using utf-8 encondings](support/proposal_2.md) + +## Evaluation + +- [Evaluation Methodologies](support/evaluation.md#evaluation-metodologies) +- [Data Collection and Analysis](support/evaluation.md#7-data-collection-and-analysis) +- [Comparative Analysis](support/evaluation.md#8-comparative-analysis) +- [Implementation Plan](support/evaluation.md#9-implementation-plan) +- [Future Expansion](support/evaluation.md#10-future-expansion) + +## Contributors + +- [Rob Neuhaus](https://github.com/rrenaud) - ⛴👨🏻‍✈️ +- [Chandra Irugalbandara](https://github.com/chandralegend) +- [Alvanli](https://github.com/alvanli) +- [Vishnu Vardhan](https://github.com/VishnuVardhanSaiLanka) +- [Anthony Susevski](https://github.com/asusevski) diff --git a/pyproject.toml b/pyproject.toml index fd9f5c9..ddf6fa8 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [tool.poetry] name = "multi-tokenizer" -version = "0.1.0" +version = "0.1.1" description = "" authors = ["chandralegend "] license = "MIT"