[Paper Reading] Model Compression and Acceleration for Deep Neural Networks

Ya-Liang Allen Chang

3 min readMay 23, 2019

Problem Definition

A summary of different approaches for network compression and acceleration

Methods

Four categories:

Parameter pruning and sharing: The parameter pruning and sharing-based methods explore the redundancy in the model parameters and try to remove the redundant and noncritical ones.
Low-rank factorization: Low-rank factorization-based techniques use matrix/tensor decomposition to estimate the informative parameters of the deep convolutional neural networks (CNNs).
Transferred/compact convolutional filters: The transferred/compact convolutional filters-based approaches design special structural convolutional filters to reduce the storage and computation complexity.
Knowledge distillation (KD): The KD methods learn a distilled model and train a more compact neural network to reproduce the output of a larger network

Parameter pruning and sharing

Quantization and binarization

Network quantization compresses the original network by reducing the number of bits required to represent each weight

Drawbacks

The accuracy of such binary nets is significantly lowered when dealing with large CNNs such as GoogleNet
These binary nets is that existing binarization
schemes are based on simple matrix approximations and ignore
the effect of binarization on the accuracy loss

Pruning and sharing

Drawbacks

Pruning with l1 or l2 regularization requires more iterations to converge.
All pruning criteria require manual setup of sensitivity for layers, which demands fine-tuning of the parameters and could be cumbersome for
some applications.

Designing the structural matrix

Drawbacks

The structural constraint will cause loss in accuracy since the constraint
might bring bias to the model.
How to find a proper structural matrix is difficult. There is no theoretical way from which to derive it.

Low-rank factorization and sparsity

Drawbacks

The implementation is not that easy since it involves
a decomposition operation, which is computationally expensive
Current methods perform low-rank approximation layer by layer, and thus cannot perform global parameter compression, which is important as different layers hold different information
Factorization requires extensive model retraining to achieve convergence when compared to the original model.

Transferred/compact convolutional filters

Drawbacks

These methods can achieve competitive performance for wide/flat architectures (like VGGNet) but not narrow/special ones (like
GoogleNet and ResNet)
The transfer assumptions sometimes are too strong to guide the algorithm, making the results unstable on some data sets.

KD

Drawbacks

Can only be applied to classification tasks with softmax loss function, which hinders its usage.
The model assumptions sometimes are too strict to make the performance
competitive with other types of approaches.

Other types of approaches

Benchmarks, evaluation, and databases

Discussion and challenges

Technique challenges

Possible solutions

Reference

Cheng, Yu, et al. “Model compression and acceleration for deep neural networks: The principles, progress, and challenges.” IEEE Signal Processing Magazine 35.1 (2018): 126–136.

[Paper Reading] Model Compression and Acceleration for Deep Neural Networks

Problem Definition

Methods

Parameter pruning and sharing

Quantization and binarization

Pruning and sharing

Designing the structural matrix

Low-rank factorization and sparsity

Transferred/compact convolutional filters

KD

Other types of approaches

Benchmarks, evaluation, and databases

Discussion and challenges

Technique challenges

Possible solutions

Reference

Written by Ya-Liang Allen Chang

No responses yet