Skip to content

Commit

Permalink
wip
Browse files Browse the repository at this point in the history
  • Loading branch information
kenarsa committed May 2, 2024
1 parent d545340 commit a5e3e0a
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 7 deletions.
47 changes: 41 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ This repo is a minimalist and extensible framework for benchmarking different LL
- [C4 Perplexity](#c4-perplexity)
- [ARC](#arc)
- [Data](#data)
- [Quantization](#quantization)
- [C4](#c4)
- [ARC](#arc)
- [Models](#models)
Expand All @@ -25,31 +26,65 @@ This repo is a minimalist and extensible framework for benchmarking different LL

### GPTQ

[GPTQ](https://arxiv.org/abs/2210.17323) is arguably the most popular quantization technique for LLMs at the moment. It
is fairly powerful as it fully reconstructs the weights to closely mimic the floating-point version.

### picoLLM Compression

picoLLM Compression is a developed by Picovoice. What sets it apart is that it optimally distributes bits (resources)
within and across model parameters. picoLLM accepts a target model size and given that distributes all available bits
optimally across and within model parameters. Hence, picoLLM is an x-bit quantization technique.

## Tasks

### C4 Perplexity

Perplexity is very sensitive to quantization and can be used to detect deterioration early on. It is a language modeling
task.

### ARC

[AI2 Reasoning Challenge (ARC) dataset](https://allenai.org/data/arc) is a multiple choice dataset that can measure the
models ability to perform reasoning. ARC dataset is partitioned into two segments: easy and challenge. We perform the benchmark
on both partitions and report the results separately.

## Data

### C4
All the data needed to run the benchmark is already available under [res](res) for ease of use. But if you wish to reproduce
it or find out how the data is curated or even change it you can use the sections below:

[C4 dataset](https://huggingface.co/datasets/c4)
### Quantization

We do need a sample dataset for GPTQ and picoLLM to learn characteristics of the model to perform their algorithms. We
choose to use 128 randomly selected sequences from the train portion of the [C4 dataset](https://huggingface.co/datasets/c4). Once you download the dataset
run the following from the root of the repository to extract and normalize the data:

```console
python3 data/c4-normalize.py --repository-folder ${REPOSITORY_FOLDER} --normalized-folder ${VALIDATION_FOLDER} --portion validation
python3 data/c4-normalize.py --repository-folder ${REPOSITORY_FOLDER} --normalized-folder ${TRAIN_FOLDER} --portion train
```
replace `${REPOSITORY_FOLDER}` with the path the downloaded dataset repository, `${TRAIN_FOLDER}` with a folder to hold on to
the normalized data.

Then we sample 128 sequences from teh normalized data:

```console
python3 data/c4-sample.py --dataset-folder ${TRAIN_FOLDER} --portion train
```

### C4

For the perplexity task we use 128 randomly selected snippets from the validation portion of the
[C4 dataset](https://huggingface.co/datasets/c4). Once you download the dataset run the following from the root of the
repository to extract and normalize the data:

```console
python3 data/c4-sample.py --dataset-folder ${VALIDATION_FOLDER}
python3 data/c4-normalize.py --repository-folder ${REPOSITORY_FOLDER} --normalized-folder ${VALIDATION_FOLDER} --portion validation
```

Then we sample 128 sequences from teh normalized data:

```console
python3 data/c4-normalize.py --repository-folder ${REPOSITORY_FOLDER} --normalized-folder ${TRAIN_FOLDER} --portion train
python3 data/c4-sample.py --dataset-folder ${TRAIN_FOLDER}
python3 data/c4-sample.py --dataset-folder ${VALIDATION_FOLDER} --portion valid
```

### ARC
Expand Down
5 changes: 4 additions & 1 deletion data/c4-sample.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,20 @@
def main() -> None:
parser = ArgumentParser()
parser.add_argument('--dataset-folder', required=True)
parser.add_argument('--portion', choices=['train', 'valid'], required=True)
parser.add_argument('--num-sequences', type=int, default=128)
parser.add_argument('--min-sequence-length', type=int, default=1024 * 8)
parser.add_argument('--seed', type=int, default=666)

args = parser.parse_args()

dataset_folder = args.dataset_folder
num_sequences = args.num_sequences
min_sequence_length = args.min_sequence_length
seed = args.seed
portion = args.portion

sample_folder = os.path.join(os.path.dirname(__file__), '../res/c4-valid')
sample_folder = os.path.join(os.path.dirname(__file__), f'../res/c4-{portion}')
if os.path.isdir(sample_folder):
shutil.rmtree(sample_folder)
os.makedirs(sample_folder)
Expand Down

0 comments on commit a5e3e0a

Please sign in to comment.