GitHub - dell-ai-engineering/BigDL-ImageProcessing-Examples: Notebook to train an AI model to detect diseases in Chest Xrays

Summary

In this notebook, we demonstrate how we can we can build an end to end deep learning pipeline on Spark leveraging the Analytics Zoo for an image processing problem. Distributed Spark worker nodes are used to train our deep learning model at scale. We used the Chest Xray dataset released by the National Health Institute to develop an AI models to diagnose pneumonia, emphysema, and other thoracic pathologies from chest x-rays. Using the Stanford University CheXNet model as inspiration, we explore ways of developing accurate models for this problem on a distributed Spark cluster. We explore various neural network topologies to gain insight into what types of modelsneural networks scale well in parallel and reduceimprove training time from days to hours.

Refer to the white paper for more information on this study.

Environment

Python 2.7 or higher
JDK 8
Apache Spark 2.1.1 or higher
Jupyter Notebook 4.1 or spark submit using CLI(Command Line Interface).
BigDL 0.7.0 or higher
Analytics zoo 0.4.0 or higher

Hardware Infrastructure

Hadoop cluster with at least 4 nodes with driver memory 170GB and executor memory is 170GB.

Download and Install Analytics Zoo and BigDL

Download from this link to install and configure BigDL and for analytics zoo follow this link.
Follow these documentation links for the detailed steps on how to install and configure BigDL and Analytics Zoo.

Run jupyter Notebook

Run export SPARK_HOME = the root directory of Spark. (Ex: /opt/cloudera/parcels/SPARK2-2.1.0.cloudera2-1.cdh5.7.0.p0.171658/lib/spark2)
Run export ANALYTICS_ZOO_HOME=the folder where you extract the downloaded Analytics Zoo zip package. (Ex: /usr/lib/zoo)
Run the following bash command to start the jupyter notebook. Change parameter settings as you need,

$ANALYTICS_ZOO_HOME/bin/jupyter-with-zoo.sh  \
    --master yarn \
    --num-executors 4 \
    --executor-cores 16 \
    --driver-memory 170g \
    --executor-memory 170g

Run spark-submit

Run export SPARK_HOME = the root directory of Spark. (Ex: /opt/cloudera/parcels/SPARK2-2.1.0.cloudera2-1.cdh5.7.0.p0.171658/lib/spark2)
Run export ANALYTICS_ZOO_HOME=the folder where you extract the downloaded Analytics Zoo zip package. (Ex: /usr/lib/zoo)
Run the following spark-submit command. Change parameter settings as you need,

$ANALYTICS_ZOO_HOME/bin/spark-submit-with-zoo.sh \
    --master yarn \
    --deploy-mode cluster \
    --num-executors 4 \
    --executor-cores 8 \
    --driver-memory 300g \
    --executor-memory 300g \
    path/to/python_file.py \
    batch_size \
    num_epochs \
    path/to/pretrained model file \
    path/to/dataset \ 
    path/to/save the model

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
ChestXray-Training.ipynb		ChestXray-Training.ipynb
ChestXray_Inference.ipynb		ChestXray_Inference.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary

Environment

Hardware Infrastructure

Download and Install Analytics Zoo and BigDL

Run jupyter Notebook

Run spark-submit

About

Releases

Packages

Contributors 4

Languages

License

dell-ai-engineering/BigDL-ImageProcessing-Examples

Folders and files

Latest commit

History

Repository files navigation

Summary

Environment

Hardware Infrastructure

Download and Install Analytics Zoo and BigDL

Run jupyter Notebook

Run spark-submit

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages