Skip to content

Latest commit

 

History

History
189 lines (139 loc) · 11.3 KB

pretrain_tinyllama.md

File metadata and controls

189 lines (139 loc) · 11.3 KB

Pretrain TinyLlama

This tutorial will walk you through pretraining TinyLlama.

Tip

To get started with zero setup, clone the TinyLlama studio on Lightning AI.

 

What's TinyLlama?

TinyLlama is architecturally the same as Meta AI's LLama 2, but only has 1.1B parameters and is instead trained on multiple epochs on a mix of SlimPajama and Starcoder datasets.

Here is a quick fact sheet:

Name Description
Parameters 1.1B
Model Size Layers: 22, Heads: 32, Query Groups: 4, Embedding Size: 2048, Intermediate Size: 5632
Sequence Length 2048
Learning Rate 4e-4
Learning Rate Schedule Cosine with 2000 warmup steps
Training Data SlimPajama (893 GB), Starcoder (290 GB)
Combined Dataset Size Around 950B tokens
Total Tokens During Training 3 trillion (3 epochs)
Time to complete training ~ 4 weeks with 64 A100 GPUs
Model FLOPs Utilization (MFU) 52%

(this table was sourced from the author's README)

 

Download datasets

You can download the data using git lfs:

# Make sure you have git-lfs installed (https://git-lfs.com):
sudo apt install git-lfs
git clone https://huggingface.co/datasets/cerebras/slimpajama-627b data/slimpajama-raw
git clone https://huggingface.co/datasets/bigcode/starcoderdata data/starcoderdata-raw

Around 1.2 TB of disk space is required to store both datasets.

 

Prepare the datasets for training

In order to start pretraining litgpt on it, you need to read, tokenize, and write the data in binary chunks. This will leverage the litdata optimization pipeline and streaming dataset.

First, install additional dependencies for preprocessing:

pip install '.[all]'

You will need to have the tokenizer config available:

litgpt download \
   --repo_id meta-llama/Llama-2-7b-hf \
   --access_token your_hf_token \
   --tokenizer_only true

Then, run the preprocessing script for each dataset and split. You will require 1.1 TB of disk space for Starcoder and 2.5 TB of space for the SlimPajama dataset.

Starcoder:

python litgpt/data/prepare_starcoder.py \
  --input_dir data/starcoderdata-raw \
  --output_dir data/starcoder \
  --tokenizer_path checkpoints/meta-llama/Llama-2-7b-hf

SlimPajama:

python litgpt/data/prepare_slimpajama.py \
  --input_dir data/slimpajama-raw/validation \
  --output_dir data/slimpajama/val \
  --tokenizer_path checkpoints/meta-llama/Llama-2-7b-hf

python litgpt/data/prepare_slimpajama.py \
  --input_dir data/slimpajama-raw/test \
  --output_dir data/slimpajama/test \
  --tokenizer_path checkpoints/meta-llama/Llama-2-7b-hf

python litgpt/data/prepare_slimpajama.py \
  --input_dir data/slimpajama-raw/train \
  --output_dir data/slimpajama/train \
  --tokenizer_path checkpoints/meta-llama/Llama-2-7b-hf

If you want to run on a small slice of the datasets first, pass the flag --fast_dev_run=true to the commands above. In the above we are assuming that you will be using the same tokenizer as used in LlaMA/TinyLlama, but any trained SentencePiece tokenizer with a 32000 vocabulary size will do here.

 

Pretraining

Running the pretraining script with its default settings requires at least 8 A100 GPUs.

litgpt pretrain --config config_hub/pretrain/tinyllama.yaml

The script will save checkpoints periodically to the folder out/. By default, the pretrain script will pretrain the model with FSDP in bfloat16 mixed precision and gradient accumulation.

Note that pretrain is not actually a model-specific training script, so feel free try other configurations or change the model type and size by passing a different string to the model name argument, for example:

litgpt pretrain --model_name Gemma-2b

The currently supported model names can be listed by executing litgpt pretrain without any additional arguments.

Keep in mind that training with a single machine will take weeks. To speed up the process, you'll need access to a cluster. Once you're in a cluster, you can follow these instructions to launch the script across machines:

The script exposes several hyperparameters you can tweak through the command line.

For instance, --train.micro_batch_size should be adjusted so the process will use the available GPU memory. For more tips to avoid out-of-memory issues, please also see the more detailed Dealing with out-of-memory (OOM) errors guide.

Last, logging is kept minimal in the script, but for long-running experiments we recommend switching to a proper experiment tracker. As an example, we included WandB (set --logger_name=wandb) to show how you can integrate any experiment tracking framework. For reference, here are the loss curves for our reproduction.

 

Resume training

The checkpoints saved during pretraining contain all the information to resume if needed. Simply rerun the script with the --resume argument added:

litgpt pretrain \
  --config config_hub/pretrain/tinyllama.yaml \
  --resume out/pretrain/tiny-llama/step-00060500

Important: Each checkpoint is a directory. Point to the directory, not the 'lit_model.pth' file inside of it.

 

Export checkpoints

After training is completed, you can convert the checkpoint to a format that can be loaded for evaluation, inference, finetuning etc.

litgpt convert pretrained_checkpoint \
  --checkpoint_dir out/pretrain/tiny-llama/step-00060500 \
  --output_dir checkpoints/tiny-llama/final

After conversion, the output folder will contain these files:

checkpoints/tiny-llama/final
├── model_config.yaml
├── lit_model.pth
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model

You can then use this checkpoint folder to run evaluation, inference, finetuning or process the checkpoint further.

 

Project templates

The following Lightning Studio templates provide LitGPT pretraining projects in reproducible environments with multi-GPU and multi-node support:  

Prepare the TinyLlama 1T token dataset

Pretrain LLMs - TinyLlama 1.1B

Continued Pretraining with TinyLlama 1.1B