Deep Learning Model Compression and Optimization

clock-icon-white  7 min read

Large enterprises have vast computational requirements for up to hundreds of thousands of users. To accelerate business operations and increase customer satisfaction, it has become increasingly necessary to implement deep learning models driven by machine learning (ML).

Deep learning models consume computational resources during training or inference. Model training involves the use of big datasets that will be iterated dozens or hundreds of times, resulting in long-term hardware exploitation and memory load.

Currently, training on edge devices differs a lot from “normal” training on a cluster of GPUs in the cloud. Usually, it only means fine-tuning pre-trained models for a few epochs. On the other hand, the on-device inference is widely used in applications nowadays.

Still, there are constraints which make it impossible to make inference of the current neural networks on the edge devices:

  • Storage size
    Weights of most current models can take up from 50 to 500MB, while the storage size of the current mobile phones varies from 32 to 512GB. While it might not seem like a lot, you have dozens of applications on your mobile phone, most of which will be using more than one neural network. That storage usage adds up! Additionally, some application stores from companies like Google and Apple have constraints on the maximum size of an application.
  • CPU utilization
    A mobile phone CPU is used by a lot of applications simultaneously and is not as powerful as a personal computer CPU. Inference of the neural network requires a lot of computations, which can make your phone run slower and lead to a bad user experience.
  • Energy consumption
    As previously noted, neural network inference requires a lot of computation, which also means high CPU utilization. The higher CPU usage is, the more energy is consumed and mobile phones can run out of energy in just a few hours.

Given these constraints, it's critical to optimize computational demand and compress the weight of deep learning models.

Compression and optimization methods

In the last several years, artificial intelligence (AI) researchers have made great strides in developing neural network compression and optimization methods. Some of these efforts are focused on the time and computational efficiency of the deep learning models, while others are developed to shrink the size of the neural network.

Knowledge distillation

This method is to transfer knowledge from a large model that cannot be used on edge devices due to its size or computational inefficiency. This “teacher” model transfers knowledge to a smaller “student” model. It results in small neural networks with comparable performance. However, this method requires training from scratch and its performance varies based on the task.

Weight sharing

Weight sharing, which is also referred to as quantization, reduces the number of distinct neurons by gathering them into clusters. Such optimization results in approximately the same performance while reducing the size of the network by 10 to 15 times.


Pruning involves deleting neurons, whole channels, or even layers that have the lowest influence on the result. This method is popular because it simultaneously decreases the model size and speeds up the inference time.


Quantization, or numerical precision change, is the most popular technique to reduce the size of the network. The main idea behind this method is that the precision of the neural network weights does not necessarily have to be 32 bits, and could be reduced to 16, 8, and 4 bits, resulting in a high compression rate and speedup.


Binarization is an extreme case of quantization where the weight representation is reduced to only two bits. Its compression rate and speedup are some of the highest among other compression and optimization techniques. However, this technique should be used carefully since it can drastically decrease the accuracy of the model.

Efficient AI Demo Learn More


To test different compression methods, we applied them to the three different CNN classifiers: MobileNetV2, ResNet50, and SqueezeNet1.1. ImageNet dataset was used during the benchmarking of the accuracy of these models.

PyTorch Quantization

There are three types of quantization in the PyTorch: dynamic quantization, static quantization, and quantization aware training (QAT).

Dynamic quantization

This is the simplest form of quantization to apply. The weights are quantized ahead of time, but the activations are dynamically quantized during inference. The main limitation of this method is that it currently supports only a few layers (fully connected, LSTM, and RNN layers), which makes it useless to compress CNNs or some other NN architectures. During compressing CNN, only these certain layers will be quantized, which you can see from our results:

Compression table

MobileNetV2 and ResNet50 both have one fully-connected layer, which is why we see some compression in terms of the model size. For SqueezeNet1.1 we do not see that because there are no fully connected, LSTM, or RNN layers.

Static quantization

Static quantization, or post-training quantization--similar to dynamic quantization—quantizes weights ahead of time. It also involves a calibration process that is done on a subset of the data. Calibration is used to determine how specifically the different activations should be quantized at inference time.

In practice, with static quantization, we were able to obtain a model compression rate of 3 – 4 times, along with an inference speedup of 1.2 - 3 times, depending on the NN architecture:

NN Architecture

Quantization aware training (QAT)

During QAT, the model behaves as it would with quantized weights and activations. All computations are performed with floating-point weights and activations, and after each operation values are rounded up. After training, weights and activations are quantized.

This method yields the best accuracy in comparison to dynamic and static quantization, but it requires training from scratch.

PyTorch Pruning

In PyTorch, there are two types of pruning: local and global. In local pruning, neurons are pruned based on the statistics of one selected layer. In global, they’re pruned based on the statistics of the whole network.

Pruning in PyTorch is performed by assigning zero value to the selected neurons’ weights and biases. In practice, that means that model architecture will remain the same without any compression or speedup.

CoreML quantization

CoreML is Apple’s framework for working with deep learning models on mobile devices. Python’s coremltools package enables you to perform quantization of CoreML models using a simple API. Weights can be quantized to 16 bits, 8 bits, 7 bits, (and so on) down to 1 bit. To quantize a model to bits from 8 to 1 you can choose one of several quantization modes. Also, there is some more flexibility on how layers will be selected to be quantized.

In practice, we have found out that quantization to 16 and 8 bits with CoreML leads to the same accuracy and inference time, but the model size is reduced proportionally to the number of bits:


There are also toolkits like TensorFlow Lite (quantization, pruning and weights sharing), Microsoft NNI (quantization and pruning), and Distiller from IntelLabs (quantization, pruning, knowledge distillation, and more), with which you can compress and optimize your model.

SoftServe has proven expertise in helping organizations solve their most pressing deep learning model challenges including data engineering, machine learning, and production deployment for inference at scale.

LET’S TALK about how SoftServe experts can accelerate your business with the power of deep learning model compression and optimization.

Want to learn more about how mobile deep learning models improve user experience by moving computations to devices? Read the next blog in our mobile AI series.

Helpful resources

PyTorch blog on quantization

CoreML quantization article