This repository contains all the necessary items needed to build your own LLM from scratch. Just follow the instructions. Inspired from Karpathy's nanoGPT and Shakespeare generator, I made this repository to build my own LLM. It has everything from data collection for the Model to architecture file, tokenizer and train file.
This repo contains:
- Data Collector: Web-Scrapper containing directory, in case you want to gather the data from scratch instead of downloading.
- Data Processing: Directory that contains code to pre-process certain kinds of file like converting parquet files to .txt and .csv files and file appending codes.
- Models: Contains all the necessary code to train a model of your own. A BERT model, GPT model & Seq-2-Seq model along with tokenizer and run files.
Before setting up SmallLanguageModel, ensure that you have the following prerequisites installed:
- Python 3.8 or higher
- pip (Python package installer)
Follow these steps to train your own tokenizer or generate outputs from the trained model:
-
Clone this repository:
git clone https://github.com/shivendrra/SmallLanguageModel-project cd SLM-clone
-
Install Dependencies:
pip install requirements.txt
-
Train: Read the training.md for more information. Follow it.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.
MIT License. Check out License.md for more info.