There have been a number of breaking changes in the transformers
, tokenizers
, and cuda
libraries over the last 12 months.
In order to replicate the training process or run mteb
benchmarks, you may need to:
- Use the
transformers
andtokenizers
versions specified in thepyproject.toml
file. - Remove cached files under
~/.cache/huggingface/hub/
- Check that you are on H100s with CUDA version between 535-560.
This ALEA project contains the research pipeline for the KL3M embedding models.
(The KL3M tokenizers have been moved to the kl3m-tokenizers repository.)
TODO
You can replicate or train your own model like this:
- Pick a model configuration under the
models/
directory. - Review the
config.json
andtraining.json
files for details related to the model architecture and training parameters. - Run the training script for the model you want to train using the commands below.
- Monitor progress with the
describe.py
script using the commands below.
Model training can be resumed as long as the log.jsonl
file is present in the model configuration or checkpoint path.
$ PYTHONPATH=. poetry run python3 kl3m_embeddings/embeddings/deberta/train_deberta_single.py models/kl3m-embedding-005/
$ DS_SKIP_CUDA_CHECK=1 PYTHONPATH=. poetry run deepspeed kl3m_embeddings/embeddings/deberta/train_deberta_deepspeed.py models/kl3m-embedding-005-deepspeed-2/
$ PYTHONPATH=. poetry run python3 kl3m_embeddings/embeddings/describe.py models/kl3m-embedding-005/log.jsonl
Progress Example
Training: 4%|█▊ | 7247/200000 [09:38<4:41:05, 11.43it/s, loss=1.37, loss_100=2.623, loss_1000=4.955, last_eval=5.69, grad_norm=1.12, lr=2.0e-04, step_time=0.08, token_rate=86553.61]
Sample Log Line (log.jsonl
)
{"step": 2600, "epoch": 1, "lr": 0.0002, "sample_time": 0.0018472671508789062, "reduced_dim": 64, "task": "mlm", "num_samples": 128, "num_identifiers": 2, "num_tokens": 16384, "samples_by_dataset": {"ukleg": 64, "govinfo": 64}, "tokens_by_dataset": {"ukleg": 8192, "govinfo": 8192}, "loss": 8.297395706176758, "forward_time": 0.0015826225280761719, "backward_time": 0.0047855377197265625, "clip_threshold": 3.105683786869049, "step_time": 1.8407979011535645, "total_time": 298.2537636756897, "token_rate": 119195.14296106658, "time": "2024-10-22T09:12:42.395676"}
Sample Eval Line (eval.jsonl
)
{"step": 2600, "mean": 6.590894358307123, "median": 6.974860191345215, "std": 1.9348315678504489, "min": 0.1022321879863739, "p5": 3.3413278245925904, "p95": 8.781183547973633, "max": 13.027746200561523, "num_samples": 1000, "svd_mean_ratio_1": 2.2945302575826645, "svd_median_ratio_1": 2.4049798250198364}
This ALEA project is released under the MIT License. See the LICENSE file for details.
If you encounter any issues or have questions about using this ALEA project, please open an issue on GitHub.
To learn more about ALEA and its software and research projects like KL3M, visit the ALEA website.