Skip to content

Commit

Permalink
feat: Update multi-tokenizer to version 0.1.1 and improve Chinese out…
Browse files Browse the repository at this point in the history
…put in proposal_2.md
  • Loading branch information
chandralegend committed Jul 21, 2024
1 parent ed7330c commit 242d1c9
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 13 deletions.
39 changes: 27 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,16 @@
# Tokenization of Multilingual Texts using Language-Specific Tokenizers
# Multi-Tokenizer
Tokenization of Multilingual Texts using Language-Specific Tokenizers

## Approaches
[![PyPI version](https://img.shields.io/pypi/v/multi-tokenizer.svg)](https://pypi.org/project/multi-tokenizer/)

1. [Approach 1: Individual tokenizers for each language](support/proposal_1.md)
2. [Approach 2: Unified tokenization approach across languages using utf-8 encondings](support/proposal_2.md)
## Overview

## Evaluation

- [Evaluation Methodologies](support/evaluation.md#evaluation-metodologies)
- [Data Collection and Analysis](support/evaluation.md#7-data-collection-and-analysis)
- [Comparative Analysis](support/evaluation.md#8-comparative-analysis)
- [Implementation Plan](support/evaluation.md#9-implementation-plan)
- [Future Expansion](support/evaluation.md#10-future-expansion)
Multi-Tokenizer is a Python package that provides tokenization of multilingual texts using language-specific tokenizers. The package is designed to be used in a variety of applications, including natural language processing, machine learning, and data analysis. Behind the scenes, the package uses `lingua` library to detect the language of the text segments, the `tokenizers` library to create language-specific tokenizers, and then tokenizes the text segments using the appropriate tokenizer. Multi-tokenizer introduces additional special tokens to handle the language-specific tokenization, which can be used to reconstruct the original text segments after tokenization and allows the models to differentiate between the languages in the text segments.

## Development Setup

### Prerequisites
- Use the Dev Container for easy setup
- Use the VSCode Dev Containers for easy setup (Recommended)
- Install dev dependencies
```bash
pip install poetry
Expand All @@ -42,3 +36,24 @@ Run the tests using the following command
```bash
pytest -n "auto"
```

## Approaches

1. [Approach 1: Individual tokenizers for each language](support/proposal_1.md)
2. [Approach 2: Unified tokenization approach across languages using utf-8 encondings](support/proposal_2.md)

## Evaluation

- [Evaluation Methodologies](support/evaluation.md#evaluation-metodologies)
- [Data Collection and Analysis](support/evaluation.md#7-data-collection-and-analysis)
- [Comparative Analysis](support/evaluation.md#8-comparative-analysis)
- [Implementation Plan](support/evaluation.md#9-implementation-plan)
- [Future Expansion](support/evaluation.md#10-future-expansion)

## Contributors

- [Rob Neuhaus](https://github.com/rrenaud) - ⛴👨🏻‍✈️
- [Chandra Irugalbandara](https://github.com/chandralegend)
- [Alvanli](https://github.com/alvanli)
- [Vishnu Vardhan](https://github.com/VishnuVardhanSaiLanka)
- [Anthony Susevski](https://github.com/asusevski)
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "multi-tokenizer"
version = "0.1.0"
version = "0.1.1"
description = ""
authors = ["chandralegend <irugalbandarachandra@gmail.com>"]
license = "MIT"
Expand Down

0 comments on commit 242d1c9

Please sign in to comment.