Skip to content

Commit

Permalink
[Docs] Add Weight Prepack doc (#2649)
Browse files Browse the repository at this point in the history
  • Loading branch information
Lu Teng authored Mar 25, 2024
1 parent 009706e commit b2bad43
Show file tree
Hide file tree
Showing 4 changed files with 61 additions and 3 deletions.
7 changes: 4 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,15 +43,16 @@
<tbody>
<tr>
<td colspan="2" align="center"><a href="guide/environment_variables.md">Environment variables</a></td>
<td colspan="2" align="center"><a href="guide/python_api.md">Python API</a></td>
<td colspan="4" align="center"><a href="guide/advanced_auto_mixed_precision.md">Advanced auto mixed precision</a></td>
<td colspan="2" align="center"><a href="guide/python_api.md">Python API</a></td>
<td colspan="2" align="center"><a href="guide/advanced_auto_mixed_precision.md">Advanced auto mixed precision</a></td>
<td colspan="2" align="center"><a href="guide/itex_fusion.md">Graph optimization</a></td>
<td colspan="2" align="center"><a href="guide/threadpool.md">CPU Thread Pool</a></td>
<td colspan="2" align="center"><a href="guide/weight_prepack.md">Weight prepack</a></td>
</tr>
<tr>
<td colspan="2" align="center"><a href="guide/itex_ops.md">Custom operator</a></td>
<td colspan="2" align="center"><a href="guide/itex_ops_override.md">Operator override</a></td>
<td colspan="3" align="center"><a href="guide/INT8_quantization.md">INT8 quantization</a></td>
<td colspan="2" align="center"><a href="guide/INT8_quantization.md">INT8 quantization</a></td>
<td colspan="2" align="center"><a href="guide/XPUAutoShard.md">XPUAutoShard</a></td>
<td colspan="2" align="center"><a href="guide/how_to_enable_profiler.md">GPU profiler</a></td>
<td colspan="2" align="center"><a href="guide/launch.md">CPU launcher</a></td>
Expand Down
Binary file added docs/guide/images/prepack_workflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/guide/images/weight_reorder.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
57 changes: 57 additions & 0 deletions docs/guide/weight_prepack.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Online Weight Prepack

## Overview

In modern deep learning framework, **Weight Reorder** is widely used in inference models which converts weight from [plain layout](https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html#plain-data-formats) to [blocked layout](https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html#blocked-layout) for better performance. It performs operations faster but will use extra memory to store the original memory in plain layout. Here the stored original plain layout weight is called **master weight**.

To reduce the memory footprint, **Weight Prepack** graph optimization is introduced. It directly replaces the master weight with the reordered weight in runtime, instead of creating a new blocked layout weight:

<div align="center">
<table>
<tr>
<td align="center">
<img src="images/weight_reorder.png" /></br>
Fig. 1 Weight reorder & weight prepack
</td>
</tr>
</table>
</div>

There are 2 ways to prepack weight:
* **Offline,** prepack weight in the original model before execution by a third-party tool. It will change the original model stored in the disk.
* **Online,** prepack weight by framework online optimization pass in runtime. It won't change the original model stored in the disk, maintaining good portability of the model.

Intel® Extension for TensorFlow* has provided **Online Weight Prepack**.

## Usage & Effect
This feature is **always enabled**; no additional actions are required.

The optimization effect depends on the ratio of reordered weights in the model. In regular [BERT-large](https://github.com/google-research/bert) inference, it can reduce memory footprint by ~10%.

## Workflow
**Weight Prepack** is a graph optimization, which means it only happens once in the compilation phase. The graph optimizer will traverse the graph and find out the weights that need to be prepacked. A corresponding [oneDNN primitive](https://oneapi-src.github.io/oneDNN/dev_guide_basic_concepts.html#primitives) with proxy shape will be created to estimate the possible blocked layout when weight is found. After that, the estimated blocked layout info will be recorded to that weight node in the graph and used to do the real operation in the execution phase.

<div align="center">
<table>
<tr>
<td align="center">
<img src="images/prepack_workflow.png" width="70%" /></br>
Fig. 2 Weight prepack workflow
</td>
</tr>
</table>
</div>

1. Replacing master weights with the reordered weights is the key step to reduce the memory footprint.
2. The step for reordering master weights will be eliminated if weight prepack succeeds. It also reduces the number of operators during executing and improves performance.

**NOTE:** This estimation of blocked layout may not always be accurate (see [Limitation](#Limitation) for more details). Once the estimation fails, the prepacked weight will be reordered to another blocked layout required by the execution phase after it becomes the new master weight.

## Limitation
* Proxy shape is used to estimate blocked layout because the real shape is not available in the compilation phase. This will make the optimization not work if the estimated info doesn't match the real info in the execution phase, and make the application perform the same behavior as unoptimized.
* Only available for matrix multiplication ([MatMul](https://www.tensorflow.org/api_docs/cc/class/tensorflow/ops/mat-mul)) and related operators on CPU.
* May not work with [dynamic shapes](https://chromium.googlesource.com/external/github.com/tensorflow/tensorflow/+/r0.10/tensorflow/g3doc/resources/faq.md#tensor-shapes) since the required blocked layout will be changed in different iterations.

## Reference
* [OneDNN Documentation - Understanding Memory Formats](https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html#understanding-memory-formats)
* [TensorFlow - Frequently Asked Questions](https://chromium.googlesource.com/external/github.com/tensorflow/tensorflow/+/r0.10/tensorflow/g3doc/resources/faq.md#frequently-asked-questions)

0 comments on commit b2bad43

Please sign in to comment.