[Paper Reading] Dpp-net: Device-aware progressive search for pareto-optimal neural architectures

5 min readMay 9, 2019

Problem Definition

Optimize for both device-related (e.g., inference time and memory usage) and device-agnostic (e.g., accuracy and model size) objectives for neural architecture search (NAS)

Introduction

Approaches for NAS can mainly be categorized into two branches: based on Reinforcement Learning (RL) or Genetic Algorithm (GA)
Searching for neural architectures optimized for multiple objectives has posed an even more significant challenge. To this end, new architectures leveraging novel operations have been developed to achieve higher computing efficiency than conventional convolution
How to automatically search for network architectures jointly considering high accuracy and other objectives (e.g., inference time, model size, etc. to conforms to device-related constraints) remains a critical yet less addressed question

Contributions

Propose DPP-Net: Device-aware Progressive Search for Pareto-optimal Neural Architectures given multiple device-related (e.g., inference time and memory usage) and device-agnostic (e.g., accuracy and model size) objectives
It is an efficient search algorithm to find various network architectures at the Paretofront (Fig.1) in the multiple objectives space to explore the trade-off among these objectives

Related Work

RL-based approach

REINFORCE algorithm to learn a network architecture called “controller” RNN that generates a sequence of actions representing the architecture of a CNN
NASNet further improves NAS by replacing REINFORCE with proximal policy-optimization (PPO) and search architectures of a “block” which repeatedly concatenated itself to form a complete model
Other improvements

GA-based approach

Other approaches

Monte Carlo Tree Search (MCTS)
Sequential Model-based Optimization (SMBO)
…

Architecture search with multiple objectives

Handcrafted models with multiple objectives

ShuffleNet
CondenseNet

Method

Search Architecture

The overall architectures (e.g., how many cells are connected, initial output feature map size, growth rate) are fixed before searching, the only component to search is the cell structure, this idea follows the heuristics of searching for a “block”
Each cell to be searched consists of multiple layers of two types — normalization (Norm) and convolutional (Conv) layers
Progressively add layers following the Norm-Conv-Norm-Conv order (Fig.3(a)-Right).
The operations available for Norm (yellow boxes) and Conv (green boxes) layers are shown in the left and right column below, respectively:

Search Space

Search space covers well-designed efficient operations (e.g., Depth-wise Convolution, Learned Group Convolution) to take advantages of empirical knowledge when designing efficient CNNs
This ensures the robustness and efficiency of our searched architectures but also reduces the training time of the searched model, therefore reduce the search time as well
The number of operations in the Norm set is 3 and the number of operations in the Conv set is 6

Search Algorithm

Adopt Sequential Model-Based Optimization algorithm to navigate
through the search space efficiently

Three steps:

Train and Mutation: train K ℓ-layer models and acquire their accuracies after N epochs. Meanwhile, for each ℓ-layer model, mutate e it and acquire a ℓ+1-layer model by exploring all possible combinations.
Update and Inference: In Train and Mutation step, the algorithm will
generate a large number of candidate models that are usually beyond our
ability to evaluate. They use a surrogate function to predict the networks’
accuracies with the given architectures. The surrogate function is updated
with the evaluation accuracies (output) and the architectures (inputs) of the K ℓ-layer models from the Train and Mutation step. After the surrogate
function is updated, they predict the accuracies of the mutated ℓ + 1-layer
models. Using a surrogate function avoids time-consuming training to obtain true accuracy of a network with only a slight drawback of regression error
Model Selection: considers not only the accuracy of the models
but also the device-aware characteristics. Those characteristics include QoS (Quality of Service) and hardware requirements (e.g., memory size), which are critical metrics to be considered on mobile and embedded devices. Multiple hard constraints µ and soft constraints ξ are set. A hard constraint µ is considered to be the minimal requirement of the model. A model that does not meet the hard constraint will be removed from the candidate list. On the other hand, a soft constraint ξ
is treated as one of the objectives to be optimized which will be eventually
selected using Pareto Optimality selection

Pareto Optimality

Using Pareto Optimization, it is likely that there exists a number of optimal solutions. A solution is said to be Pareto optimal if none of the objectives can be improved without worsening some of the other objectives, and the solutions achieve Pareto-optimality are said to be in the Pareto front

Surrogate Function

The surrogate function is able to learn efficiently from a few data points and handle variable-sized inputs (models with different number of layers).
Choose a Recurrent Neural Network (RNN)