Skip to content

Commit

Permalink
init python package structure
Browse files Browse the repository at this point in the history
  • Loading branch information
chandralegend committed Jul 19, 2024
1 parent 1138a5c commit c6e6ff5
Show file tree
Hide file tree
Showing 16 changed files with 472 additions and 9 deletions.
14 changes: 14 additions & 0 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"name": "Python 3",
"image": "mcr.microsoft.com/devcontainers/python:1-3.12-bullseye",
"features": {
"ghcr.io/meaningful-ooo/devcontainer-features/fish:1": {
"fisher": true
},
"ghcr.io/devcontainers/features/conda:1": {
"addCondaForge": true,
"version": "latest"
}
},
"postCreateCommand": "pip3 install --user -r dev.requirements.txt"
}
5 changes: 5 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[flake8]
exclude = fixtures, build, examples, temp*
plugins = flake8_import_order, flake8_docstrings, flake8_comprehensions, flake8_bugbear, flake8_annotations, pep8_naming, flake8_simplify
max-line-length = 120
ignore = E203, W503, ANN101, ANN102, I201, ANN401, D401, SIM115
25 changes: 25 additions & 0 deletions .github/workflows/linitng.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
name: Linting and Pre-commit checks

on:
pull_request:
push:
branches:
- main

jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- name: Check out code
uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.12

- name: Install pre-commit
run: pip install pre-commit

- name: Run pre-commit hooks
run: pre-commit run --all-files
27 changes: 27 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: PyPI Release

on:
workflow_dispatch:

jobs:
release:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.12'
- name: Install Poetry
run: |
pip install poetry
- name: Build and publish package
run: |
poetry config pypi-token.pypi ${{ secrets.PYPI_TOKEN }}
poetry build
poetry publish
- run: pip install githubrelease
- run: python scripts/gh_release.py
env:
GITHUB_TOKEN: ${{ secrets.GH_TOKEN }}
28 changes: 28 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v3.2.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files
- repo: https://github.com/psf/black
rev: 24.4.2
hooks:
- id: black
- repo: https://github.com/PyCQA/flake8
rev: 7.1.0
hooks:
- id: flake8
exclude: 'venv|.conda|.git|.vscode|__pycache__|tests|examples|build|dist|.*.egg-info|.*.egg|temp*'
additional_dependencies: [pep8-naming, flake8_import_order, flake8_docstrings, flake8_comprehensions, flake8_bugbear, flake8_annotations, flake8_simplify]
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.10.1
hooks:
- id: mypy
exclude: 'venv|.conda|.git|.vscode|__pycache__|tests|examples|vendor|build|dist|.*.egg-info|.*.egg|temp*'
args:
- --follow-imports=silent
- --ignore-missing-imports
40 changes: 35 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,44 @@
# Tokenization of Multilingual Texts using Language-Specific Tokenizers
## Approaches
# Tokenization of Multilingual Texts using Language-Specific Tokenizers

## Approaches

1. [Approach 1: Individual tokenizers for each language](support/proposal_1.md)
2. [Approach 2: Unified tokenization approach across languages using utf-8 encondings](support/proposal_2.md)

# Evaluation
## Evaluation

- [Evaluation Methodologies](support/evaluation.md#evaluation-metodologies)
- [Data Collection and Analysis](support/evaluation.md#7-data-collection-and-analysis)
- [Comparative Analysis](support/evaluation.md#8-comparative-analysis)
- [Implementation Plan](support/evaluation.md#9-implementation-plan)
- [Future Expansion](support/evaluation.md#10-future-expansion)
- [Future Expansion](support/evaluation.md#10-future-expansion)

## Development Setup

### Prerequisites
- Use the Dev Container for easy setup
- Install dev dependencies
```bash
pip install poetry
poetry install
```

### Linting, Formatting and Type Checking
- Add the directory to safe.directory
```bash
git config --global --add safe.directory /workspaces/multi-tokenizer
```
- Run the following command to lint and format the code
```bash
pre-commit run --all-files
```
- To install pre-commit hooks, run the following command (Recommended)
```bash
pre-commit install
```

### Running the tests
Run the tests using the following command
```bash
pytest -n "auto"
```
3 changes: 3 additions & 0 deletions dev.requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
pytest
pre-commit
poetry
Empty file added docs/.gitkeep
Empty file.
1 change: 1 addition & 0 deletions multi_tokenizer/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Multi-tokenizer package."""
7 changes: 7 additions & 0 deletions mypy.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
[mypy]
python_version = 3.12
ignore_missing_imports = True
explicit_package_bases = True

[mypy-requests.*]
ignore_missing_imports = True
Loading

0 comments on commit c6e6ff5

Please sign in to comment.