CMMD Mammography Classification Pipeline

This repository was created as part of my masters degree thesis.

The Chinese Mammography Database (CMMD) is a recently released mammography dataset which is rich in quality yet has not been researched extensively due to it's very recent release to The Cancer Imaging Archive (TCIA). This repository creates a deep learning pipeline which can be utilised either partially with the CMMD data pre-processing or as a full pipeline in which models can be interchanged or optimised where best suits. It is hoped that the research carried out within my dissertation and within this repository will act as an aid for future resarchers aiming to make use of the CMMD dataset.

A .pdf version of my dissertation including model results will be uploaded upon recieving marks/feedback on my submission. Until then, please feel free to make the most of this repository. Also please feel free to contact me if you have any questions whatsoever.

Good Luck.

Features

CMMD Data Exploration
CMMD Metadata Pre-processing
Split Data to Bening/Malignat with Stratification by Patient
- Train
- Validate
- Test
Data Augmentation
Training
- Custom Model Definition
  - AlexNet
  - LeNet
- Transfer Learning
  - ResNet50
  - VGG16
  - Xception
    - Fine Tuning
Testing
- Model Evaluation
- Model Predictions
- Model Metrics

Prerequisites

Python 3.8.10
CUDA 10.1

It should additionally be noted that this was written on Linux (Ubuntu 20.04.3 LTS). Attempt on Windows or other operating systems at own risk.

Get required Python packages:

git clone https://github.com/CraigMyles/cggm-mammography-classification.git
cd cggm-mammography-classification
pip install -r requirements.txt

Get Dataset

The CMMD dataset can be freely downloaded from The Cancer Imaging Archive (TCIA) You must use the NBIA Data Retriever to download The Chinese Mammography Database (CMMD).

For download instructions, follow this guide: Downloading Data from the TCIA Data Portal Using the Data Retriever.

Download the CMMD manifest and clinicaldata file to a folder within your working directory.

E.g.

/path/to/my/dir/dataset/manifest-1616439774456/
/path/to/my/dir/dataset/CMMD_clinicaldata_revision.xlsx

Run Instructions

To run the pipeline in its entirety, begin with file 0_ and run each notebook in sequential order tll 5_.

If you only require the CMMD preprocessing and metadata handling, please run sections 0_0_Data_Exploration.ipynb, 0_Data_Exploration.ipynb, and 1_stratification_data_split.ipynb.

A collated main.py has been added which allows for the pipeline to be run in it's entirety when given a path to the manifest folder and the .xlsx metadata file. For running in CLI run the following command with your relevant paths.

python3 main.py --manifest_path "/path/to/my/dir/dataset/manifest-1616439774456/" \
    --metadata_path "/path/to/my/dir/dataset/CMMD_clinicaldata_revision.xlsx"

This will run the entire pipeline with the Xception model and fine-tuning. For partial use of the program, comment out particular methods in the main, or refer to the Jupyter Notebooks. Please note that the main.py script may take upward of 12 hours to run. Please ensure that there is at least 25GB of storage available to hold the dataset and run preprocessing scripts.

Classification Pipeline

Contact

Craig Myles (me@craig.im)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CMMD Mammography Classification Pipeline

Features

Prerequisites

Get Dataset

Run Instructions

Classification Pipeline

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

CMMD Mammography Classification Pipeline

Features

Prerequisites

Get Dataset

Run Instructions

Classification Pipeline

Contact