[Paper Reading] Dpp-net: Device-aware progressive search for pareto-optimal neural architectures
Problem Definition
Optimize for both device-related (e.g., inference time and memory usage) and device-agnostic (e.g., accuracy and model size) objectives for neural architecture search (NAS)
Introduction
- Approaches for NAS can mainly be categorized into two branches: based on Reinforcement Learning (RL) or Genetic Algorithm (GA)
- Searching for neural architectures optimized for multiple objectives has posed an even more significant challenge. To this end, new architectures leveraging novel operations have been developed to achieve higher computing efficiency than conventional convolution
- How to automatically search for network architectures jointly considering high accuracy and other objectives (e.g., inference time, model size, etc. to conforms to device-related constraints) remains a critical yet less addressed question
Contributions
- Propose DPP-Net: Device-aware Progressive Search for Pareto-optimal Neural Architectures given multiple device-related (e.g., inference time and memory usage) and device-agnostic (e.g., accuracy and model size) objectives
- It is an efficient search algorithm to find various network architectures at the Paretofront (Fig.1) in the multiple objectives space to explore the trade-off among these objectives
Related Work
RL-based approach
- REINFORCE algorithm to learn a network architecture called “controller” RNN that generates a sequence of actions representing the architecture of a CNN
- NASNet further improves NAS by replacing REINFORCE with proximal policy-optimization (PPO) and search architectures of a “block” which repeatedly concatenated itself to form a complete model
- Other improvements
GA-based approach
Other approaches
- Monte Carlo Tree Search (MCTS)
- Sequential Model-based Optimization (SMBO)
- …
Architecture search with multiple objectives
Handcrafted models with multiple objectives
- ShuffleNet
- CondenseNet
Method
Search Architecture
- The overall architectures (e.g., how many cells are connected, initial output feature map size, growth rate) are fixed before searching, the only component to search is the cell structure, this idea follows the heuristics of searching for a “block”
- Each cell to be searched consists of multiple layers of two types — normalization (Norm) and convolutional (Conv) layers
- Progressively add layers following the Norm-Conv-Norm-Conv order (Fig.3(a)-Right).
- The operations available for Norm (yellow boxes) and Conv (green boxes) layers are shown in the left and right column below, respectively:
Search Space
- Search space covers well-designed efficient operations (e.g., Depth-wise Convolution, Learned Group Convolution) to take advantages of empirical knowledge when designing efficient CNNs
- This ensures the robustness and efficiency of our searched architectures but also reduces the training time of the searched model, therefore reduce the search time as well
- The number of operations in the Norm set is 3 and the number of operations in the Conv set is 6
Search Algorithm
- Adopt Sequential Model-Based Optimization algorithm to navigate
through the search space efficiently
Three steps:
- Train and Mutation: train K ℓ-layer models and acquire their accuracies after N epochs. Meanwhile, for each ℓ-layer model, mutate e it and acquire a ℓ+1-layer model by exploring all possible combinations.
- Update and Inference: In Train and Mutation step, the algorithm will
generate a large number of candidate models that are usually beyond our
ability to evaluate. They use a surrogate function to predict the networks’
accuracies with the given architectures. The surrogate function is updated
with the evaluation accuracies (output) and the architectures (inputs) of the K ℓ-layer models from the Train and Mutation step. After the surrogate
function is updated, they predict the accuracies of the mutated ℓ + 1-layer
models. Using a surrogate function avoids time-consuming training to obtain true accuracy of a network with only a slight drawback of regression error - Model Selection: considers not only the accuracy of the models
but also the device-aware characteristics. Those characteristics include QoS (Quality of Service) and hardware requirements (e.g., memory size), which are critical metrics to be considered on mobile and embedded devices. Multiple hard constraints µ and soft constraints ξ are set. A hard constraint µ is considered to be the minimal requirement of the model. A model that does not meet the hard constraint will be removed from the candidate list. On the other hand, a soft constraint ξ
is treated as one of the objectives to be optimized which will be eventually
selected using Pareto Optimality selection
Pareto Optimality
Using Pareto Optimization, it is likely that there exists a number of optimal solutions. A solution is said to be Pareto optimal if none of the objectives can be improved without worsening some of the other objectives, and the solutions achieve Pareto-optimality are said to be in the Pareto front
Surrogate Function
- The surrogate function is able to learn efficiently from a few data points and handle variable-sized inputs (models with different number of layers).
- Choose a Recurrent Neural Network (RNN)
Experiments and Results
Dataset: search architectures on CIFAR-10, and then test on ImageNet
Conclusion
This paper proposed DPP-Net,the first device-aware neural architecture search approach. It outperforms state-of-the-art handcrafted mobile CNNs