Extraction tool to parse MS Celeb dataset
The MS Celeb Dataset is a database of faces with 6,464,018 images.
Due to some error, the original dataset is gone . However, there is a torrent availble for use here It contains a tsv file with the images encoded as base64 strings.
This extraction tool helps read through the tsv and place images of the same person in their respective folders. As it reads through the tsv file, it deletes the already read entries, meaning it requires no extra disk space to save the processed files.
The reasoning for this is:
- Most libraries have built in helper functions to parse such a structure, including pytorch and keras/tensorflow
- Modern file systems hash their files, so if the path of the file is known, reading it is O(1) time
- Storing as the original jpeg files give a reduction in size from 95 GB to 57 GB
pip install -r requirements.txt
Usage: extractor.py [OPTIONS] COMMAND [ARGS]...
Utility to help extract MS Celeb data into manageable fils.
Options:
--help Show this message and exit.
Commands:
combine Combine clean_list_128Vec_WT051_P010.txt and...
process Read lines from the MS Celeb TSV file and save into a directory...
First use the combine command to combine the two text files provided in the dataset. Details of why to combine will be clear on referring to Section "How to use C-MS-Celeb" at https://github.com/EB-Dodo/C-MS-Celeb. Further, the 2 txt files are not found in the torrent but in https://github.com/EB-Dodo/C-MS-Celeb/blob/master/clean_list.7z
Usage: extractor.py combine [OPTIONS]
Combine clean_list_128Vec_WT051_P010.txt and relabel_list_128Vec_T058.txt
together.
The output of this file is used by the process command.
Options:
--clean_list_128_path FILENAME Path of clean_list_128Vec_WT051_P010.txt
[required]
--relabel_list_128_path FILENAME
Path of relabel_list_128Vec_T058 [required]
--output_path FILE Path of output file [required]
--help Show this message and exit.
Then use the generated combined txt file into the process command to start extracting lines from the tsv and saving to jpeg files.
Usage: extractor.py process [OPTIONS]
Read lines from the MS Celeb TSV file and save into a directory structure.
The files will be put in this format:
root/person_x/xxx.jpg root/person_x/xxy.jpg
root/person_x/xxz.jpg
root/person_y/123.jpg root/person_y/817.jpg
root/person_y/some.jpg
!NOTE!: As this command reads the TSV, it will delete the lines already
read.
Options:
--tsv_location FILENAME Location of the entire MS Celeb tsv file
[required]
--output_dir PATH Output directory for images [required]
--combined_file_path FILE Location of the file generated by combine command
[required]
--chunk_size INTEGER Number of bytes to read from the tsv at once
--num_threads INTEGER
--help Show this message and exit.
Example:
python ms-celeb-extractor/extractor.py process --tsv_location=head.tsv --output_dir out --combined_file_path combined.txt
89it [00:03, 23.58it/s]
Feel free to add issues or pull requests