Distributed Training Example with Intel® Optimization for Horovod*
-Model Information
-Use Case | -Framework | -Model Repo | -Branch Commit Tag | -Optional Patch | -
Training | -TensorFlow | -Tensorflow-Models | -v2.8.0 | -itex.yaml itex_dummy.yaml hvd_support_light.patch or hvd_support.patch |
- -
- -
- -
others show as below
pip install gin gin-config tensorflow-addons tensorflow-model-optimization tensorflow-datasets
Model examples preparation
-Model Repo
-WORKSPACE=xxxx # set your workspace folder
-git clone -b v2.8.0 https://github.com/tensorflow/models.git tensorflow-models
-cd tensorflow-models
-git apply path/to/hvd_support_light.patch # or path/to/hvd_support.patch
hvd_support_light.patch is the minimum change.
hvd.init() is Horovod initialization, including resource allocation.
-tf.config.experimental.set_memory_growth(): If memory growth is enabled, the runtime initialization will not allocate all memory on the device.
-tf.config.experimental.set_visible_devices(): Set the list of visible devices.
-strategy_scope: Remove native distributed.
-hvd.DistributedOptimizer(): use Horovod distributed optimizer.
-dataset.shard(): Multiple workers run the same code but with different data. Dataset is split equally between different index workers.
hvd_support.patch adds LARS optimizer paper
-Download Dataset
-Download imagenet dataset from https://image-net.org/download-images.php
-Note Only for non-commercial research and/or educational purposes
-Set Model Parameters
-Export those parameters to script or environment.
-export PYTHONPATH=${WORKSPACE}/tensorflow-models
and set one of them as CONFIG_FILE, then model would correspondingly run withreal data
ordummy data
. Default value is itex.yaml.
according to hvd rank number you need. Default value is a 2 rank task.
HVD command
-if [ ! -d "$MODEL_DIR" ]; then
- mkdir -p $MODEL_DIR
- rm -rf $MODEL_DIR && mkdir -p $MODEL_DIR
-mpirun -np $NUMBER_OF_PROCESS -ppn $PROCESS_PER_NODE --prepend-rank \
-python ${PYTHONPATH}/official/vision/image_classification/classifier_trainer.py \
---mode=train_and_eval \
---model_type=resnet \
---dataset=imagenet \
---model_dir=$MODEL_DIR \
---data_dir=$DATA_DIR \
-Performance Data
-[1] I0909 03:33:23.323099 140645511436096 keras_utils.py:145] TimeHistory: xxxx seconds, xxxx examples/second between steps 0 and 100
-[0] I0909 03:33:23.324534 140611700504384 keras_utils.py:145] TimeHistory: xxxx seconds, xxxx examples/second between steps 0 and 100
-[0] I0909 03:33:43.037004 140611700504384 keras_utils.py:145] TimeHistory: xxxx seconds, xxxx examples/second between steps 100 and 200
-[1] I0909 03:33:43.037142 140645511436096 keras_utils.py:145] TimeHistory: xxxx seconds, xxxx examples/second between steps 100 and 200
-[1] I0909 03:34:03.213994 140645511436096 keras_utils.py:145] TimeHistory: xxxx seconds, xxxx examples/second between steps 200 and 300
-[0] I0909 03:34:03.214127 140611700504384 keras_utils.py:145] TimeHistory: xxxx seconds, xxxx examples/second between steps 200 and 300