-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
1138a5c
commit c6e6ff5
Showing
16 changed files
with
472 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
{ | ||
"name": "Python 3", | ||
"image": "mcr.microsoft.com/devcontainers/python:1-3.12-bullseye", | ||
"features": { | ||
"ghcr.io/meaningful-ooo/devcontainer-features/fish:1": { | ||
"fisher": true | ||
}, | ||
"ghcr.io/devcontainers/features/conda:1": { | ||
"addCondaForge": true, | ||
"version": "latest" | ||
} | ||
}, | ||
"postCreateCommand": "pip3 install --user -r dev.requirements.txt" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
[flake8] | ||
exclude = fixtures, build, examples, temp* | ||
plugins = flake8_import_order, flake8_docstrings, flake8_comprehensions, flake8_bugbear, flake8_annotations, pep8_naming, flake8_simplify | ||
max-line-length = 120 | ||
ignore = E203, W503, ANN101, ANN102, I201, ANN401, D401, SIM115 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
name: Linting and Pre-commit checks | ||
|
||
on: | ||
pull_request: | ||
push: | ||
branches: | ||
- main | ||
|
||
jobs: | ||
pre-commit: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- name: Check out code | ||
uses: actions/checkout@v2 | ||
|
||
- name: Set up Python | ||
uses: actions/setup-python@v2 | ||
with: | ||
python-version: 3.12 | ||
|
||
- name: Install pre-commit | ||
run: pip install pre-commit | ||
|
||
- name: Run pre-commit hooks | ||
run: pre-commit run --all-files |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
name: PyPI Release | ||
|
||
on: | ||
workflow_dispatch: | ||
|
||
jobs: | ||
release: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- name: Checkout code | ||
uses: actions/checkout@v2 | ||
- name: Set up Python | ||
uses: actions/setup-python@v2 | ||
with: | ||
python-version: '3.12' | ||
- name: Install Poetry | ||
run: | | ||
pip install poetry | ||
- name: Build and publish package | ||
run: | | ||
poetry config pypi-token.pypi ${{ secrets.PYPI_TOKEN }} | ||
poetry build | ||
poetry publish | ||
- run: pip install githubrelease | ||
- run: python scripts/gh_release.py | ||
env: | ||
GITHUB_TOKEN: ${{ secrets.GH_TOKEN }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# See https://pre-commit.com for more information | ||
# See https://pre-commit.com/hooks.html for more hooks | ||
repos: | ||
- repo: https://github.com/pre-commit/pre-commit-hooks | ||
rev: v3.2.0 | ||
hooks: | ||
- id: trailing-whitespace | ||
- id: end-of-file-fixer | ||
- id: check-yaml | ||
- id: check-added-large-files | ||
- repo: https://github.com/psf/black | ||
rev: 24.4.2 | ||
hooks: | ||
- id: black | ||
- repo: https://github.com/PyCQA/flake8 | ||
rev: 7.1.0 | ||
hooks: | ||
- id: flake8 | ||
exclude: 'venv|.conda|.git|.vscode|__pycache__|tests|examples|build|dist|.*.egg-info|.*.egg|temp*' | ||
additional_dependencies: [pep8-naming, flake8_import_order, flake8_docstrings, flake8_comprehensions, flake8_bugbear, flake8_annotations, flake8_simplify] | ||
- repo: https://github.com/pre-commit/mirrors-mypy | ||
rev: v1.10.1 | ||
hooks: | ||
- id: mypy | ||
exclude: 'venv|.conda|.git|.vscode|__pycache__|tests|examples|vendor|build|dist|.*.egg-info|.*.egg|temp*' | ||
args: | ||
- --follow-imports=silent | ||
- --ignore-missing-imports |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,44 @@ | ||
# Tokenization of Multilingual Texts using Language-Specific Tokenizers | ||
## Approaches | ||
# Tokenization of Multilingual Texts using Language-Specific Tokenizers | ||
|
||
## Approaches | ||
|
||
1. [Approach 1: Individual tokenizers for each language](support/proposal_1.md) | ||
2. [Approach 2: Unified tokenization approach across languages using utf-8 encondings](support/proposal_2.md) | ||
|
||
# Evaluation | ||
## Evaluation | ||
|
||
- [Evaluation Methodologies](support/evaluation.md#evaluation-metodologies) | ||
- [Data Collection and Analysis](support/evaluation.md#7-data-collection-and-analysis) | ||
- [Comparative Analysis](support/evaluation.md#8-comparative-analysis) | ||
- [Implementation Plan](support/evaluation.md#9-implementation-plan) | ||
- [Future Expansion](support/evaluation.md#10-future-expansion) | ||
- [Future Expansion](support/evaluation.md#10-future-expansion) | ||
|
||
## Development Setup | ||
|
||
### Prerequisites | ||
- Use the Dev Container for easy setup | ||
- Install dev dependencies | ||
```bash | ||
pip install poetry | ||
poetry install | ||
``` | ||
|
||
### Linting, Formatting and Type Checking | ||
- Add the directory to safe.directory | ||
```bash | ||
git config --global --add safe.directory /workspaces/multi-tokenizer | ||
``` | ||
- Run the following command to lint and format the code | ||
```bash | ||
pre-commit run --all-files | ||
``` | ||
- To install pre-commit hooks, run the following command (Recommended) | ||
```bash | ||
pre-commit install | ||
``` | ||
|
||
### Running the tests | ||
Run the tests using the following command | ||
```bash | ||
pytest -n "auto" | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
pytest | ||
pre-commit | ||
poetry |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
"""Multi-tokenizer package.""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
[mypy] | ||
python_version = 3.12 | ||
ignore_missing_imports = True | ||
explicit_package_bases = True | ||
|
||
[mypy-requests.*] | ||
ignore_missing_imports = True |
Oops, something went wrong.