kl3m-embedding-research

Versions

There have been a number of breaking changes in the transformers, tokenizers, and cuda libraries over the last 12 months.

In order to replicate the training process or run mteb benchmarks, you may need to:

Use the transformers and tokenizers versions specified in the pyproject.toml file.
Remove cached files under ~/.cache/huggingface/hub/
Check that you are on H100s with CUDA version between 535-560.

Description

This ALEA project contains the research pipeline for the KL3M embedding models.

(The KL3M tokenizers have been moved to the kl3m-tokenizers repository.)

Model Cards

TODO

Training a Model

You can replicate or train your own model like this:

Pick a model configuration under the models/ directory.
Review the config.json and training.json files for details related to the model architecture and training parameters.
Run the training script for the model you want to train using the commands below.
Monitor progress with the describe.py script using the commands below.

Model training can be resumed as long as the log.jsonl file is present in the model configuration or checkpoint path.

DeBERTa-based Models

torch only (single GPU)

$ PYTHONPATH=. poetry run python3 kl3m_embeddings/embeddings/deberta/train_deberta_single.py models/kl3m-embedding-005/

torch + deepspeed (multi-node, multi-GPU)

$ DS_SKIP_CUDA_CHECK=1 PYTHONPATH=. poetry run deepspeed kl3m_embeddings/embeddings/deberta/train_deberta_deepspeed.py models/kl3m-embedding-005-deepspeed-2/

Monitoring Progress

$ PYTHONPATH=. poetry run python3 kl3m_embeddings/embeddings/describe.py models/kl3m-embedding-005/log.jsonl

Example Outputs

Progress Example

Training:   4%|█▊        | 7247/200000 [09:38<4:41:05, 11.43it/s, loss=1.37, loss_100=2.623, loss_1000=4.955, last_eval=5.69, grad_norm=1.12, lr=2.0e-04, step_time=0.08, token_rate=86553.61]

Sample Log Line (log.jsonl)

{"step": 2600, "epoch": 1, "lr": 0.0002, "sample_time": 0.0018472671508789062, "reduced_dim": 64, "task": "mlm", "num_samples": 128, "num_identifiers": 2, "num_tokens": 16384, "samples_by_dataset": {"ukleg": 64, "govinfo": 64}, "tokens_by_dataset": {"ukleg": 8192, "govinfo": 8192}, "loss": 8.297395706176758, "forward_time": 0.0015826225280761719, "backward_time": 0.0047855377197265625, "clip_threshold": 3.105683786869049, "step_time": 1.8407979011535645, "total_time": 298.2537636756897, "token_rate": 119195.14296106658, "time": "2024-10-22T09:12:42.395676"}

Sample Eval Line (eval.jsonl)

{"step": 2600, "mean": 6.590894358307123, "median": 6.974860191345215, "std": 1.9348315678504489, "min": 0.1022321879863739, "p5": 3.3413278245925904, "p95": 8.781183547973633, "max": 13.027746200561523, "num_samples": 1000, "svd_mean_ratio_1": 2.2945302575826645, "svd_median_ratio_1": 2.4049798250198364}

License

This ALEA project is released under the MIT License. See the LICENSE file for details.

Support

If you encounter any issues or have questions about using this ALEA project, please open an issue on GitHub.

Learn More

To learn more about ALEA and its software and research projects like KL3M, visit the ALEA website.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
kl3m_embeddings		kl3m_embeddings
models		models
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
learning_rate_loss.png		learning_rate_loss.png
loss_by_step.png		loss_by_step.png
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
step_time_components.png		step_time_components.png
svd_metrics.png		svd_metrics.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kl3m-embedding-research

Versions

Description

Model Cards

Training a Model

DeBERTa-based Models

torch only (single GPU)

torch + deepspeed (multi-node, multi-GPU)

Monitoring Progress

Example Outputs

License

Support

Learn More

About

Releases

Packages

Languages

License

alea-institute/kl3m-embedding-research

Folders and files

Latest commit

History

Repository files navigation

kl3m-embedding-research

Versions

Description

Model Cards

Training a Model

DeBERTa-based Models

torch only (single GPU)

torch + deepspeed (multi-node, multi-GPU)

Monitoring Progress

Example Outputs

License

Support

Learn More

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages