Classification of Imbalanced Data with LLM (Large Language Model)
This project code uses TabLLM and T-few project and its paper.
As an Undergraduate Researcher, I did not had the luxury to have over 30GB GPU, and did not want to spend money on Colab. So I had to modify lots of code and versionings to make it work on free tier Colab, and even locally with smaller LLMs.
If there are questions, please leave an issue on github to talk more about it, or email me at contactjameshan@gmail.com
- This code is ran on Google Colab Free Tier. It will follow those versions.
- Used Python Black Formatter
/.old
: Old attempt for Imbalanced LLM. Testing Idea./bin
: Shell code to run the project/configs
: Configuration Data, related to/src/utils/Config.py
/Datasets
: Raw csv datasets (Not included, go to TabLLM Project)/Datasets-serialized
: Serialized datasets (Not included, go to TabLLM Project)/exp_out
: Your Train result (Not included)/pretrained_checkpoints
: Saved Model (Not included)- If using model T0 or T0_3b, get the file from TabLLM Project, turn on
load_model()
inEncoderDecoder()
and add file
- If using model T0 or T0_3b, get the file from TabLLM Project, turn on
/src
: Your Source/templates
:
This tutorial is modified from TabLLM Project.
We will use Google Colab Free Tier to run.
For my case, we will use stroke-prediction-dataset/data
- Run Make_Datset.ipynb Or
create_external_datasets.py --dataset stroke
- Go to
evaluate_external_dataset
and add your dataset name onargs_datasets
variable - Make a new file called
template_<datasetName>
ontemplates
folder. Use other templates as reference. - Go to
bin/few-shot-pretrained-100k.sh
and add your dataset onfor dataset in <dataSetName>
- Run TabLLM.ipynb
For Stroke Dataset
python src/scripts/get_result_table.py -e t5_\* -d stroke