From 95b835b138c47d3e8bc7a33ced6ee8bfb59b8807 Mon Sep 17 00:00:00 2001 From: "Wang, Kai Lawrence" <109344418+wangkl2@users.noreply.github.com> Date: Fri, 22 Mar 2024 09:54:34 +0800 Subject: [PATCH] [Docs] Add itex performance document on Intel GPU (#2652) --- docs/README.md | 5 +- docs/build_docs/source/index.rst | 1 + docs/guide/performance.md | 235 +++++++++++++++++++++++++++++++ 3 files changed, 239 insertions(+), 2 deletions(-) create mode 100644 docs/guide/performance.md diff --git a/docs/README.md b/docs/README.md index 42a0664d4..70f10d238 100644 --- a/docs/README.md +++ b/docs/README.md @@ -17,8 +17,9 @@ Releases + Performance data Frequently asked questions - Contributing guidelines + Contributing guidelines @@ -122,4 +123,4 @@ * OpenXLA Support on GPU [Experimental] - Intel® Extension for TensorFlow\* adopts a uniform Device API PJRT as the supported device plugin mechanism to implement Intel GPU backend for OpenXLA experimental support. \ No newline at end of file + Intel® Extension for TensorFlow\* adopts a uniform Device API PJRT as the supported device plugin mechanism to implement Intel GPU backend for OpenXLA experimental support. diff --git a/docs/build_docs/source/index.rst b/docs/build_docs/source/index.rst index 3c31f0daa..b63a1f517 100644 --- a/docs/build_docs/source/index.rst +++ b/docs/build_docs/source/index.rst @@ -8,6 +8,7 @@ Welcome to Intel ® Extension for TensorFlow* documentation! get_started.md docs/guide/infrastructure.md docs/guide/features.rst + docs/guide/performance.md docs/install/installation_guide.rst examples/README.md docs/guide/practice_guide.md diff --git a/docs/guide/performance.md b/docs/guide/performance.md new file mode 100644 index 000000000..947947f65 --- /dev/null +++ b/docs/guide/performance.md @@ -0,0 +1,235 @@ +# Performance Data + +- [Overview](#overview) +- [Models](#models) + - [Training Workloads](#training-workloads) + - [Inference Workloads](#inference-workloads) +- [Training Accuracy Results](#training-accuracy-results) + - [Training Accuracy on 1-node of 4x Intel Data Center GPU Max 1550](#training-accuracy-on-1-node-of-4x-intel-data-center-gpu-max-1550) +- [Training Performance Results](#training-performance-results) + - [Training Performance on 1-node of 4x Intel Data Center GPU Max 1550](#training-performance-on-1-node-of-4x-intel-data-center-gpu-max-1550) + - [ResNet50v1-5 Training Performance Results](#resnet50v1-5-training-performance-results) + - [BERT-Large Phase2 Training Performance Results](#bert-large-phase2-training-performance-results) + - [Mask-RCNN Training Performance Results](#mask-rcnn-training-performance-results) + - [Medical Image 3D U-Net Training Performance Results](#medical-image-3d-u-net-training-performance-results) +- [Inference Performance Results](#inference-performance-results) + - [Inference Performance on 1x Intel Data Center GPU Flex 170](#inference-performance-on-1x-intel-data-center-gpu-flex-170) + - [ResNet50v1-5 Inference Performance Results](#resnet50v1-5-inference-performance-results) + - [EfficientNet-B0 Inference Performance Results](#efficientnet-b0-inference-performance-results) + - [EfficientNet-B3 Inference Performance Results](#efficientnet-b3-inference-performance-results) + - [Mask-RCNN Inference Performance Results](#mask-rcnn-inference-performance-results) + - [Stable Diffusion v1-4 Inference Performance Results](#stable-diffusion-v1-4-inference-performance-results) +- [Configuration](#configuration) + - [Software Configuration](#software-configuration) + - [Software Configuration for Intel Max 1550 GPU](#software-configuration-for-intel-max-1550-gpu) + - [Software Configuration for Intel Flex 170 GPU](#software-configuration-for-intel-flex-170-gpu) + - [Hardware Configuration](#hardware-configuration) + - [Hardware Configuration for Intel Max 1550 GPU](#hardware-configuration-for-intel-max-1550-gpu) + - [Hardware Configuration for Intel Flex 170 GPU](#hardware-configuration-for-intel-flex-170-gpu) +- [Additional Performance Data for Intel AI Data Center Products](#additional-performance-data-for-intel-ai-data-center-products) + +## Overview + +This document demonstrates the training and inference performance as well as accuracy results on several popular AI workloads with Intel® Extension for TensorFlow\* benchmarked on Intel GPUs. You can easily reproduce these results following the guidlines in [examples](../../examples/README.md). + + +## Models + +The following tables provide the links where you can get the original code repository and step-by-step guide running on Intel GPUs for each model. + +### Training Workloads + +|Model|Original Model Repo|ITEX Step-by-Step Guide| +|-|-|-| +|ResNet50v1.5|[TensorFlow-Models/ResNet50v1.5](https://github.com/tensorflow/models/tree/v2.14.0/official/legacy/image_classification/)|[Resnet50 train on Intel GPU](../../examples/train_resnet50/README.md)| +|BERT-Large|[DeepLearningExamples/BERT](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/LanguageModeling/BERT/)|[Accelerate BERT-Large Pretraining on Intel GPU](../../examples/pretrain_bert/README.md) +|Mask-RCNN|[DeepLearningExamples/Mask-RCNN](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Segmentation/MaskRCNN/)|[Accelerate Mask R-CNN Training on Intel GPU](../../examples/train_maskrcnn/README.md)| +|3D-UNet|[DeepLearningExamples/3D-UNet](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Segmentation/UNet_3D_Medical/)|[Accelerate 3D-UNet Training for medical image segmentation on Intel GPU](../../examples/train_3d_unet/README.md)| + +### Inference Workloads + +|Model|Original Model Repo|ITEX Step-by-Step Guide| +|-|-|-| +|ResNet50v1.5|[Intel-Reference-Models/ResNet50v1.5](https://github.com/IntelAI/models/tree/v3.1.0/models_v2/tensorflow/resnet50v1_5/inference/gpu/)|[ResNet50v1.5 Model Inference with Intel® Extention for TensorFlow\*](https://github.com/IntelAI/models/tree/v3.1.0/models_v2/tensorflow/resnet50v1_5/inference/gpu/)| +|EfficientNet-B0|[Keras-Applications/EfficientNet](https://keras.io/api/applications/efficientnet/)|Use the exact same codes and instructions as in the orignal model repo| +|EfficientNet-B3|[Keras-Applications/EfficientNet](https://keras.io/api/applications/efficientnet/)|Use the exact same codes and instructions as in the orignal model repo| +|Mask-RCNN|[DeepLearningExamples/Mask-RCNN](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Segmentation/MaskRCNN/)|Use the exact same codes and instructions as in the orignal model repo| +|Stable Diffusion v1-4|[KerasCV/Stable-Diffusion](https://github.com/keras-team/keras-cv/tree/master/keras_cv/models/stable_diffusion)|[Stable Diffusion Inference for Text2Image on Intel GPU](../../examples/stable_diffussion_inference/README.md)| + + +## Training Accuracy Results + +### Training Accuracy on 1-node of 4x Intel Data Center GPU Max 1550 + +The following table shows the BERT-Large performance, training loss and time-to-train (TTT) results for both the pre-training and fine-tuning phases on 1-node of 4x Intel® Data Center GPU Max 1550 (600W OAM, 2-stack for each GPU). + +||Pre-training Phase1|Pre-training Phase2|Fine-Tuning| +|-|-|-|-| +|**Dataset**|[Wikipedia](https://dumps.wikimedia.org/) and [BookCorpus](https://yknzhu.wixsite.com/mbweb/)|[Wikipedia](https://dumps.wikimedia.org/) and [BookCorpus](https://yknzhu.wixsite.com/mbweb/)|[SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) 1.1| +|**Maximum Sequence Length**|128|512|384| +|**Data Type**|BF16|BF16|BF16| +|**Throughput (sequences/sec)**|3265.35|699.25|523.55| +|**Time to Train (hours)**|39.32|20.40|0.67| +|**Loss**|1.6047|1.3870|0.6867| + + +## Training Performance Results + +### Training Performance on 1-node of 4x Intel Data Center GPU Max 1550 + +The following tables show the performance numbers for several popular training workloads on 1-node of 4x Intel® Data Center GPU Max 1550 (600W OAM, 2-stack for each GPU). For these workloads, we enable and benchmark both FP32 training and BF16 automatic mixed precision (AMP) training with 1-Stack of 1x Max 1550, 2-Stack of 1x Max 1550 as well as 4x Max 1550 (with 8 Stacks in total), to showcase the performance boost and scalability with Intel® Extension for TensorFlow\* and Intel® Optimization for Horovod\*. + +> **Note**: The training performance result on each workload below for `1x Max 1550 w/ 1-Stack` represents the minimum value of the performance results on 2 stacks of single GPU, with 2 instances initiated simultaneously, while each stack of the GPU executing the workload separately, without distributed training. + +#### ResNet50v1-5 Training Performance Results + +|GPUs|Ranks|Local Batch Size:
FP32, BF16|Training
Steps|Throughput w/
TF32 (images/sec)|Throughput w/
BF16 (images/sec)|Throughput Speedup
w/ AMP|Weak Scaling
w/ TF32|Weak Scaling
w/ BF16| +|-|-|-|-|-|-|-|-|-| +|1x Max 1550 w/ 1-Stack|1|256, 512|5000|918.96|1766.53|1.92x|1.00|1.00| +|1x Max 1550 w/ 2-Stack|2|256, 512|5000|1762.76|3461.86|1.96x|1.92|1.96| +|4x Max 1550|8|256, 256|5000|NA|12278.32|NA|NA|6.95| + +#### BERT-Large Phase2 Training Performance Results + +|GPUs|Ranks|Local
Batch Size
x Accumulation Steps|Training
Steps|Throughput
w/ TF32
(sequences/sec)|Throughput
w/ BF16
(sequences/sec)|Throughput Speedup
w/ AMP|Weak Scaling
w/ TF32|Weak Scaling
w/ BF16| +|-|-|-|-|-|-|-|-|-| +|1x Max 1550 w/ 1-Stack|1|32 x 30|20|36.22|93.22|2.57x|1.00|1.00| +|1x Max 1550 w/ 2-Stack|2|32 x 30|20|74.40|182.57|2.45x|2.05|1.96| +|4x Max 1550|8|32 x 30|20|NA|692.11|NA|NA|7.42| + +#### Mask-RCNN Training Performance Results + +|GPUs|Ranks|Local Batch Size|Training Steps|Throughput w/ BF16 (images/sec)|Weak Scaling w/ BF16| +|-|-|-|-|-|-| +|1x Max 1550 w/ 1-Stack|1|4|20|29.03|1.00| +|1x Max 1550 w/ 2-Stack|2|4|20|55.51|1.91| + +#### Medical Image 3D U-Net Training Performance Results + +|GPUs|Ranks|Local Batch Size|Training Steps|Throughput w/ BF16 (samples/sec)|Weak Scaling w/ BF16| +|-|-|-|-|-|-| +|1x Max 1550 w/ 1-Stack|1|1|1000|12.81|1.00| +|1x Max 1550 w/ 2-Stack|2|1|1000|23.56|1.84| +|4x Max 1550|8|1|1000|87.07|6.80| + + +## Inference Performance Results + +### Inference Performance on 1x Intel Data Center GPU Flex 170 + +The following tables show the performance numbers for several popular inference workloads on 1x Intel® Data Center GPU Flex 170 (150W PCIe, 1-stack for each GPU). + +>**Note**: Inference with online mode refers to running the workloads using 1 as the batch size, while inference with batch mode utilizes larger batch size. + +#### ResNet50v1-5 Inference Performance Results + +|GPUs|Dataset|Image Size|Mode|Batch Size|Data Type|Inference Steps|Throughput (images/sec)| +|-|-|-|-|-|-|-|-| +|1x Flex 170|Dummy|224x224|Online|1|INT8|5000|435.01| +|1x Flex 170|Dummy|224x224|Batch|1024|INT8|5000|9842.75| + +#### EfficientNet-B0 Inference Performance Results + +|GPUs|Dataset|Image Size|Mode|Batch Size|Data Type|Inference Steps|Throughput (images/sec)| +|-|-|-|-|-|-|-|-| +|1x Flex 170|Dummy|224x224|Batch|64|FP16 (AMP)|50|3007.60| +|1x Flex 170|Dummy|224x224|Batch|128|FP16 (AMP)|50|3587.29| + +#### EfficientNet-B3 Inference Performance Results + +|GPUs|Dataset|Image Size|Mode|Batch Size|Data Type|Inference Steps|Throughput (images/sec)| +|-|-|-|-|-|-|-|-| +|1x Flex 170|Dummy|300x300|Batch|64|FP16 (AMP)|50|928.56| +|1x Flex 170|Dummy|300x300|Batch|128|FP16 (AMP)|50|968.83| + +#### Mask-RCNN Inference Performance Results + +|GPUs|Dataset|Mode|Batch Size|Data Type|Inference Steps|Throughput (images/sec)| +|-|-|-|-|-|-|-| +|1x Flex 170|COCO 2017|Online|1|FP16 (AMP)|5000|19.38| +|1x Flex 170|COCO 2017|Batch|16|FP16 (AMP)|312|43.02| + +#### Stable Diffusion v1-4 Inference Performance Results + +|GPUs|Dataset|Output
Image Size|Mode|Batch Size|Data Type|Diffusion Steps|Throughput
(iterations/sec)|Throughput Speedup
w/ FP16| +|-|-|-|-|-|-|-|-|-| +|1x Flex 170|Text Prompt|512x512|Online|1|FP32|50|2.91|1.00x| +|1x Flex 170|Text Prompt|512x512|Online|1|FP16 (pure)|50|6.53|2.24x| + + +## Configuration + +### Software Configuration + +#### Software Configuration for Intel Max 1550 GPU + +|Software Component|Version| +|-|-| +|GPU Driver|[736.25](https://dgpu-docs.intel.com/releases/stable_736_25_20231031.html)| +|Intel® oneAPI Base Toolkit|2024.0| +|TensorFlow|v2.14.0| +|Intel® Extension for TensorFlow\*|v2.14.0.1| +|Intel® Optimization for Horovod\*|v0.28.1.2| + +#### Software Configuration for Intel Flex 170 GPU + +|Software Component|Version| +|-|-| +|GPU Driver|[736.25](https://dgpu-docs.intel.com/releases/stable_736_25_20231031.html)| +|Intel® oneAPI Base Toolkit|2024.0| +|TensorFlow|v2.14.0| +|Intel® Extension for TensorFlow\*|v2.14.0.1| + +### Hardware Configuration + +#### Hardware Configuration for Intel Max 1550 GPU + +|GPU System|4x Intel® Data Center GPU Max 1550| +|-|-| +|**Number of Nodes**|1| +|**Xe®-Cores per GPU**|128 in total 2-Stack| +|**Memory Size per GPU**|128 GB HBM2e in total 2-Stack| +|**TDP per GPU**|600W| +|**GPU ECC Setting**|OFF| +|**Server Board**|Intel® Denali Pass D50DNP1SBB| +|**OS**|SUSE Linux Enterprise Server 15 SP4| +|**Kernel**|5.14.21-150400.24.69-default| +|**CPU Model**|Intel® Xeon® Platinum 8480+ @ 2.00 GHz| +|**Number of Sockets**|2| +|**CPU Cores per Socket**|56| +|**Hyper Threading**|ON| +|**Turbo Boost**|ON| +|**Automatic NUMA Balancing**|Enabled| +|**CPU Frequency Governor**|Performance| +|**TDP per CPU**|350W| +|**Installed Memory**|1024GB (16x64GB 4800 MT/s DDR5)| +|**NIC**|1x Intel® Ethernet Controller X710 for 10GBASE-T| +|**Storage**|1x WD® WD_BLACK SN850X 2TB NVMe SSD| + +#### Hardware Configuration for Intel Flex 170 GPU + +|GPU System|1x Intel® Data Center GPU Flex 170| +|-|-| +|**Number of Nodes**|1| +|**Xe®-Cores per GPU**|32| +|**Memory Size per GPU**|16 GB GDDR6| +|**TDP per GPU**|150W| +|**GPU ECC Setting**|ON| +|**Server Board**|Intel® Whitley| +|**OS**|Ubuntu 22.04.3 LTS| +|**Kernel**|5.15.0-57-generic| +|**CPU Model**|Intel® Xeon® Gold 6336Y CPU @ 2.40GHz| +|**Number of Sockets**|2| +|**CPU Cores per Socket**|24| +|**Hyper Threading**|ON| +|**Turbo Boost**|ON| +|**Automatic NUMA Balancing**|Enabled| +|**CPU Frequency Governor**|Performance| +|**TDP per CPU**|185W| +|**Installed Memory**|128GB (8x16GB 3200 MT/s DDR4)| +|**NIC**|2x Intel® Ethernet Controller X710 for 10GBASE-T,
1x Intel® 82574L Gigabit Ethernet Controller| +|**Storage**|1x Intel® SSDSC2KG960G8,
1x Samsung® 870 EVO 1TB SSD| + +## Additional Performance Data for Intel AI Data Center Products + +You can find the latest performance data on other Intel® AI Data Center Products such as 3rd, 4th, and 5th Gen Intel® Xeon® Scalable processors via [Performance Data for Intel® AI Data Center Products](https://www.intel.com/content/www/us/en/developer/topic-technology/artificial-intelligence/performance.html/).