feat: Update multi-tokenizer to version 0.1.1 and improve Chinese out…

…put in proposal_2.md
aya-multitokenizer · Jul 21, 2024 · 242d1c9 · 242d1c9
1 parent ed7330c
commit 242d1c9
Show file tree

Hide file tree

Showing 2 changed files with 28 additions and 13 deletions.
diff --git a/README.md b/README.md
@@ -1,22 +1,16 @@
-# Tokenization of Multilingual Texts using Language-Specific Tokenizers
+# Multi-Tokenizer
+Tokenization of Multilingual Texts using Language-Specific Tokenizers
 
-## Approaches
+[![PyPI version](https://img.shields.io/pypi/v/multi-tokenizer.svg)](https://pypi.org/project/multi-tokenizer/)
 
-1. [Approach 1: Individual tokenizers for each language](support/proposal_1.md)
-2. [Approach 2: Unified tokenization approach across languages using utf-8 encondings](support/proposal_2.md)
+## Overview
 
-## Evaluation
-
-- [Evaluation Methodologies](support/evaluation.md#evaluation-metodologies)
-- [Data Collection and Analysis](support/evaluation.md#7-data-collection-and-analysis)
-- [Comparative Analysis](support/evaluation.md#8-comparative-analysis)
-- [Implementation Plan](support/evaluation.md#9-implementation-plan)
-- [Future Expansion](support/evaluation.md#10-future-expansion)
+Multi-Tokenizer is a Python package that provides tokenization of multilingual texts using language-specific tokenizers. The package is designed to be used in a variety of applications, including natural language processing, machine learning, and data analysis. Behind the scenes, the package uses `lingua` library to detect the language of the text segments, the `tokenizers` library to create language-specific tokenizers, and then tokenizes the text segments using the appropriate tokenizer. Multi-tokenizer introduces additional special tokens to handle the language-specific tokenization, which can be used to reconstruct the original text segments after tokenization and allows the models to differentiate between the languages in the text segments.
 
 ## Development Setup
 
 ### Prerequisites
-- Use the Dev Container for easy setup
+- Use the VSCode Dev Containers for easy setup (Recommended)
 - Install dev dependencies
     ```bash
     pip install poetry
@@ -42,3 +36,24 @@ Run the tests using the following command
 ```bash
 pytest -n "auto"
 ```
+
+## Approaches
+
+1. [Approach 1: Individual tokenizers for each language](support/proposal_1.md)
+2. [Approach 2: Unified tokenization approach across languages using utf-8 encondings](support/proposal_2.md)
+
+## Evaluation
+
+- [Evaluation Methodologies](support/evaluation.md#evaluation-metodologies)
+- [Data Collection and Analysis](support/evaluation.md#7-data-collection-and-analysis)
+- [Comparative Analysis](support/evaluation.md#8-comparative-analysis)
+- [Implementation Plan](support/evaluation.md#9-implementation-plan)
+- [Future Expansion](support/evaluation.md#10-future-expansion)
+
+## Contributors
+
+- [Rob Neuhaus](https://github.com/rrenaud) - ⛴👨🏻‍✈️
+- [Chandra Irugalbandara](https://github.com/chandralegend)
+- [Alvanli](https://github.com/alvanli)
+- [Vishnu Vardhan](https://github.com/VishnuVardhanSaiLanka)
+- [Anthony Susevski](https://github.com/asusevski)
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "multi-tokenizer"
-version = "0.1.0"
+version = "0.1.1"
 description = ""
 authors = ["chandralegend <irugalbandarachandra@gmail.com>"]
 license = "MIT"