Skip to content

Chinese Mammography Database (CMMD dataset) Deep Learning Classification Pipeline

License

Notifications You must be signed in to change notification settings

CraigMyles/cggm-mammography-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CMMD Mammography Classification Pipeline

GitHub

This repository was created as part of my masters degree thesis.

The Chinese Mammography Database (CMMD) is a recently released mammography dataset which is rich in quality yet has not been researched extensively due to it's very recent release to The Cancer Imaging Archive (TCIA). This repository creates a deep learning pipeline which can be utilised either partially with the CMMD data pre-processing or as a full pipeline in which models can be interchanged or optimised where best suits. It is hoped that the research carried out within my dissertation and within this repository will act as an aid for future resarchers aiming to make use of the CMMD dataset.

A .pdf version of my dissertation including model results will be uploaded upon recieving marks/feedback on my submission. Until then, please feel free to make the most of this repository. Also please feel free to contact me if you have any questions whatsoever.

Good Luck.

Features

  • CMMD Data Exploration
  • CMMD Metadata Pre-processing
  • Split Data to Bening/Malignat with Stratification by Patient
    • Train
    • Validate
    • Test
  • Data Augmentation
  • Training
    • Custom Model Definition
      • AlexNet
      • LeNet
    • Transfer Learning
      • ResNet50
      • VGG16
      • Xception
        • Fine Tuning
  • Testing
    • Model Evaluation
    • Model Predictions
    • Model Metrics

Prerequisites

  • Python 3.8.10

  • CUDA 10.1

It should additionally be noted that this was written on Linux (Ubuntu 20.04.3 LTS). Attempt on Windows or other operating systems at own risk.

Get required Python packages:

git clone https://github.com/CraigMyles/cggm-mammography-classification.git
cd cggm-mammography-classification
pip install -r requirements.txt

Get Dataset

The CMMD dataset can be freely downloaded from The Cancer Imaging Archive (TCIA) You must use the NBIA Data Retriever to download The Chinese Mammography Database (CMMD).

For download instructions, follow this guide: Downloading Data from the TCIA Data Portal Using the Data Retriever.

Download the CMMD manifest and clinicaldata file to a folder within your working directory.

E.g.

  • /path/to/my/dir/dataset/manifest-1616439774456/
  • /path/to/my/dir/dataset/CMMD_clinicaldata_revision.xlsx

Run Instructions

To run the pipeline in its entirety, begin with file 0_ and run each notebook in sequential order tll 5_.

If you only require the CMMD preprocessing and metadata handling, please run sections 0_0_Data_Exploration.ipynb, 0_Data_Exploration.ipynb, and 1_stratification_data_split.ipynb.

A collated main.py has been added which allows for the pipeline to be run in it's entirety when given a path to the manifest folder and the .xlsx metadata file. For running in CLI run the following command with your relevant paths.

python3 main.py --manifest_path "/path/to/my/dir/dataset/manifest-1616439774456/" \
    --metadata_path "/path/to/my/dir/dataset/CMMD_clinicaldata_revision.xlsx"

This will run the entire pipeline with the Xception model and fine-tuning. For partial use of the program, comment out particular methods in the main, or refer to the Jupyter Notebooks. Please note that the main.py script may take upward of 12 hours to run. Please ensure that there is at least 25GB of storage available to hold the dataset and run preprocessing scripts.

Classification Pipeline

pipeline

Contact

About

Chinese Mammography Database (CMMD dataset) Deep Learning Classification Pipeline

Topics

Resources

License

Stars

Watchers

Forks