-
Notifications
You must be signed in to change notification settings - Fork 2
Model Optimization
Post-training model optimization is the process of improving the performance of a deep learning model by utilizing techniques on various levels, such as hardware, software, and network architecture. On hardware level, models are optimized to run on specific hardware accelerators (e.g., TPUs), while on software level to be executed on plugin architectures with accelerated performance (e.g., OpenVINO Runtime). Furthermore, techniques such as network pruning reduce the size of a network, thus increasing the inference speed. Currently, another popular technique is model quantization.
This technique refers to the process of converting the floating-point weights of a network to lower precision compact representations. The aforementioned decrease of the precision leads to a contraction of the model size, and thus an increase of the inference speed, with little degradation in terms of accuracy. Along with the model's weights, the operations can also be converted to match the corresponding precision. Using this technique, deep learning models can be converted from 16-bit or 32-bit floating-point precisions to 8-bit integer-precision, and potentially increase the performance of the model, in terms of inference speed, even by a factor of 4.