[Paper Reading] Dpp-net: Device-aware progressive search for pareto-optimal neural architectures

Ya-Liang Allen Chang
5 min readMay 9, 2019

--

Problem Definition

Optimize for both device-related (e.g., inference time and memory usage) and device-agnostic (e.g., accuracy and model size) objectives for neural architecture search (NAS)

Introduction

  • Approaches for NAS can mainly be categorized into two branches: based on Reinforcement Learning (RL) or Genetic Algorithm (GA)
  • Searching for neural architectures optimized for multiple objectives has posed an even more significant challenge. To this end, new architectures leveraging novel operations have been developed to achieve higher computing efficiency than conventional convolution
  • How to automatically search for network architectures jointly considering high accuracy and other objectives (e.g., inference time, model size, etc. to conforms to device-related constraints) remains a critical yet less addressed question

Contributions

  • Propose DPP-Net: Device-aware Progressive Search for Pareto-optimal Neural Architectures given multiple device-related (e.g., inference time and memory usage) and device-agnostic (e.g., accuracy and model size) objectives
  • It is an efficient search algorithm to find various network architectures at the Paretofront (Fig.1) in the multiple objectives space to explore the trade-off among these objectives

Related Work

RL-based approach

  • REINFORCE algorithm to learn a network architecture called “controller” RNN that generates a sequence of actions representing the architecture of a CNN
  • NASNet further improves NAS by replacing REINFORCE with proximal policy-optimization (PPO) and search architectures of a “block” which repeatedly concatenated itself to form a complete model
  • Other improvements

GA-based approach

Other approaches

  • Monte Carlo Tree Search (MCTS)
  • Sequential Model-based Optimization (SMBO)

Architecture search with multiple objectives

Handcrafted models with multiple objectives

  • ShuffleNet
  • CondenseNet

Method

Search Architecture

  • The overall architectures (e.g., how many cells are connected, initial output feature map size, growth rate) are fixed before searching, the only component to search is the cell structure, this idea follows the heuristics of searching for a “block”
  • Each cell to be searched consists of multiple layers of two types — normalization (Norm) and convolutional (Conv) layers
  • Progressively add layers following the Norm-Conv-Norm-Conv order (Fig.3(a)-Right).
  • The operations available for Norm (yellow boxes) and Conv (green boxes) layers are shown in the left and right column below, respectively:

Search Space

  • Search space covers well-designed efficient operations (e.g., Depth-wise Convolution, Learned Group Convolution) to take advantages of empirical knowledge when designing efficient CNNs
  • This ensures the robustness and efficiency of our searched architectures but also reduces the training time of the searched model, therefore reduce the search time as well
  • The number of operations in the Norm set is 3 and the number of operations in the Conv set is 6

Search Algorithm

  • Adopt Sequential Model-Based Optimization algorithm to navigate
    through the search space efficiently

Three steps:

  • Train and Mutation: train K ℓ-layer models and acquire their accuracies after N epochs. Meanwhile, for each ℓ-layer model, mutate e it and acquire a ℓ+1-layer model by exploring all possible combinations.
  • Update and Inference: In Train and Mutation step, the algorithm will
    generate a large number of candidate models that are usually beyond our
    ability to evaluate. They use a surrogate function to predict the networks’
    accuracies with the given architectures. The surrogate function is updated
    with the evaluation accuracies (output) and the architectures (inputs) of the K ℓ-layer models from the Train and Mutation step. After the surrogate
    function is updated, they predict the accuracies of the mutated ℓ + 1-layer
    models. Using a surrogate function avoids time-consuming training to obtain true accuracy of a network with only a slight drawback of regression error
  • Model Selection: considers not only the accuracy of the models
    but also the device-aware characteristics. Those characteristics include QoS (Quality of Service) and hardware requirements (e.g., memory size), which are critical metrics to be considered on mobile and embedded devices. Multiple hard constraints µ and soft constraints ξ are set. A hard constraint µ is considered to be the minimal requirement of the model. A model that does not meet the hard constraint will be removed from the candidate list. On the other hand, a soft constraint ξ
    is treated as one of the objectives to be optimized which will be eventually
    selected using Pareto Optimality selection

Pareto Optimality

Using Pareto Optimization, it is likely that there exists a number of optimal solutions. A solution is said to be Pareto optimal if none of the objectives can be improved without worsening some of the other objectives, and the solutions achieve Pareto-optimality are said to be in the Pareto front

Surrogate Function

  • The surrogate function is able to learn efficiently from a few data points and handle variable-sized inputs (models with different number of layers).
  • Choose a Recurrent Neural Network (RNN)

Experiments and Results

Dataset: search architectures on CIFAR-10, and then test on ImageNet

Conclusion

This paper proposed DPP-Net,the first device-aware neural architecture search approach. It outperforms state-of-the-art handcrafted mobile CNNs

--

--

No responses yet