ms-celeb-extractor

Extraction tool to parse MS Celeb dataset

The MS Celeb Dataset is a database of faces with 6,464,018 images.

Due to some error, the original dataset is gone . However, there is a torrent availble for use here It contains a tsv file with the images encoded as base64 strings.

This extraction tool helps read through the tsv and place images of the same person in their respective folders. As it reads through the tsv file, it deletes the already read entries, meaning it requires no extra disk space to save the processed files.

The reasoning for this is:

Most libraries have built in helper functions to parse such a structure, including pytorch and keras/tensorflow
Modern file systems hash their files, so if the path of the file is known, reading it is O(1) time
Storing as the original jpeg files give a reduction in size from 95 GB to 57 GB

Installing

pip install -r requirements.txt

Usage

Usage: extractor.py [OPTIONS] COMMAND [ARGS]...

  Utility to help extract MS Celeb data into manageable fils.

Options:
  --help  Show this message and exit.

Commands:
  combine  Combine clean_list_128Vec_WT051_P010.txt and...
  process  Read lines from the MS Celeb TSV file and save into a directory...

First use the combine command to combine the two text files provided in the dataset. Details of why to combine will be clear on referring to Section "How to use C-MS-Celeb" at https://github.com/EB-Dodo/C-MS-Celeb. Further, the 2 txt files are not found in the torrent but in https://github.com/EB-Dodo/C-MS-Celeb/blob/master/clean_list.7z

Usage: extractor.py combine [OPTIONS]

  Combine clean_list_128Vec_WT051_P010.txt and relabel_list_128Vec_T058.txt
  together.

  The output of this file is used by the process command.

Options:
  --clean_list_128_path FILENAME  Path of clean_list_128Vec_WT051_P010.txt
                                  [required]

  --relabel_list_128_path FILENAME
                                  Path of relabel_list_128Vec_T058  [required]
  --output_path FILE              Path of output file  [required]
  --help                          Show this message and exit.

Then use the generated combined txt file into the process command to start extracting lines from the tsv and saving to jpeg files.

  Usage: extractor.py process [OPTIONS]

  Read lines from the MS Celeb TSV file and save into a directory structure.
  The files will be put in this format:

      root/person_x/xxx.jpg     root/person_x/xxy.jpg
      root/person_x/xxz.jpg

      root/person_y/123.jpg     root/person_y/817.jpg
      root/person_y/some.jpg

  !NOTE!: As this command reads the TSV, it will delete the lines already
  read.

Options:
  --tsv_location FILENAME    Location of the entire MS Celeb tsv file
                             [required]

  --output_dir PATH          Output directory for images  [required]
  --combined_file_path FILE  Location of the file generated by combine command
                             [required]

  --chunk_size INTEGER       Number of bytes to read from the tsv at once
  --num_threads INTEGER
  --help                     Show this message and exit.

Example:

python ms-celeb-extractor/extractor.py process --tsv_location=head.tsv --output_dir out --combined_file_path combined.txt
89it [00:03, 23.58it/s]

Contributing

Feel free to add issues or pull requests

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md
extractor.py		extractor.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ms-celeb-extractor

Installing

Usage

Contributing

About

Releases

Packages

Contributors 2

Languages

harveyslash/ms-celeb-extractor

Folders and files

Latest commit

History

Repository files navigation

ms-celeb-extractor

Installing

Usage

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages