This repository was created as part of my masters degree thesis.
The Chinese Mammography Database (CMMD) is a recently released mammography dataset which is rich in quality yet has not been researched extensively due to it's very recent release to The Cancer Imaging Archive (TCIA). This repository creates a deep learning pipeline which can be utilised either partially with the CMMD data pre-processing or as a full pipeline in which models can be interchanged or optimised where best suits. It is hoped that the research carried out within my dissertation and within this repository will act as an aid for future resarchers aiming to make use of the CMMD dataset.
A .pdf version of my dissertation including model results will be uploaded upon recieving marks/feedback on my submission. Until then, please feel free to make the most of this repository. Also please feel free to contact me if you have any questions whatsoever.
Good Luck.
- CMMD Data Exploration
- CMMD Metadata Pre-processing
- Split Data to Bening/Malignat with Stratification by Patient
- Train
- Validate
- Test
- Data Augmentation
- Training
- Custom Model Definition
- AlexNet
- LeNet
- Transfer Learning
- ResNet50
- VGG16
- Xception
- Fine Tuning
- Custom Model Definition
- Testing
- Model Evaluation
- Model Predictions
- Model Metrics
-
Python 3.8.10
-
CUDA 10.1
It should additionally be noted that this was written on Linux (Ubuntu 20.04.3 LTS). Attempt on Windows or other operating systems at own risk.
Get required Python packages:
git clone https://github.com/CraigMyles/cggm-mammography-classification.git
cd cggm-mammography-classification
pip install -r requirements.txt
The CMMD dataset can be freely downloaded from The Cancer Imaging Archive (TCIA) You must use the NBIA Data Retriever to download The Chinese Mammography Database (CMMD).
For download instructions, follow this guide: Downloading Data from the TCIA Data Portal Using the Data Retriever.
Download the CMMD manifest
and clinicaldata
file to a folder within your working directory.
E.g.
- /path/to/my/dir/dataset/manifest-1616439774456/
- /path/to/my/dir/dataset/CMMD_clinicaldata_revision.xlsx
To run the pipeline in its entirety, begin with file 0_
and run each notebook in sequential order tll 5_
.
If you only require the CMMD preprocessing and metadata handling, please run sections 0_0_Data_Exploration.ipynb
, 0_Data_Exploration.ipynb
, and 1_stratification_data_split.ipynb
.
A collated main.py has been added which allows for the pipeline to be run in it's entirety when given a path to the manifest folder and the .xlsx metadata file. For running in CLI run the following command with your relevant paths.
python3 main.py --manifest_path "/path/to/my/dir/dataset/manifest-1616439774456/" \
--metadata_path "/path/to/my/dir/dataset/CMMD_clinicaldata_revision.xlsx"
This will run the entire pipeline with the Xception model and fine-tuning. For partial use of the program, comment out particular methods in the main, or refer to the Jupyter Notebooks. Please note that the main.py script may take upward of 12 hours to run. Please ensure that there is at least 25GB of storage available to hold the dataset and run preprocessing scripts.
- Craig Myles (me@craig.im)