Abstract
We propose a new building block, IdleBlock, which naturally prunesconnections within the block. To fully utilize the IdleBlock we break thetradition of monotonic design in stateoftheart networks, and introducinghybrid composition with IdleBlock. We study hybrid composition on MobileNet v3and EfficientNetB0, two of the most efficient networks. Without any neuralarchitecture search, the deeper "MobileNet v3" with hybrid composition designsurpasses possibly all stateoftheart image recognition network designed byhuman experts or neural architecture search algorithms. Similarly, thehybridized EfficientNetB0 networks are more efficient than previousstateoftheart networks with similar computation budgets. These resultssuggest a new simpler and more efficient direction for network design andneural architecture search.
Quick Read (beta)
Hybrid Composition with IdleBlock: More Efficient Networks for Image Recognition
Abstract
We propose a new building block, IdleBlock, which naturally prunes connections within the block. To fully utilize the IdleBlock we break the tradition of monotonic design in stateoftheart networks, and introducing hybrid composition with IdleBlock. We study hybrid composition on MobileNet v3 [10] and EfficientNetB0 [24], two of the most efficient networks. Without any neural architecture search, the deeper “MobileNet v3” with hybrid composition design surpasses possibly all stateoftheart image recognition network designed by human experts or neural architecture search algorithms. Similarly, the hybridized EfficientNetB0 networks are more efficient than previous stateoftheart networks with similar computation budgets. These results suggest a new simpler and more efficient direction for network design and neural architecture search.
1 Introduction
Convolutional Neural Networks (CNN) have dominated computer vision tasks in recent years. Since AlexNet [14], the computer vision community has sought improved CNN designs to make the backbone network more powerful and more efficient. Remarkable single branch backbones include Network in Network [16], VGGNet [22], ResNet [8], DenseNet [12], ResNext [27], MobileNet v1/v2/v3 [9, 21, 10], and ShuffleNet v1/v2 [30, 19]. In recent years, backbone networks with multiple resolutions have also attracted attention from the research community. To learn with multiple resolutions, researchers design complex connections inside a block to handle information exchanges of different resolutions. Some efficient examples of this approach include the MultiGridConv [13], OctaveConv [5], and HRNet [23]. All these contributions have significantly developed the philosophy of backbone network design.
To design more efficient CNNs, there are two mainstream efforts: Neural Architecture Search (NAS) and Network Pruning (NP). The motivation of neural architecture search is: given a constraint on computation resources, NAS attempts to automatically determine the best network connections, block designs and hyperparameters. Hyperparameter searching is a classic topic in machine learning, and in our paper we refer NAS as specially to searching over the connections and block design of neural network. The motivation of network pruning is: given a pretrained network, an automatic algorithm is able to remove unimportant connections, to reduce computation and parameters.
In contrast to NAS over connections and NP, EfficientNet [24] provides a joint hyperparameter for a backbone: the depth scaling factor $d$, width scaling factor $w$ and input resolution scaling factor $r$, which is referred as compound scaling factors. Based on a variant of MobileNet v3, these jointlysearched scaling factors make the EfficientNet family 5$\times $ to 10$\times $ more efficient than all previous backbones in measured of computation cost (MAdds) or the number of parameters.
In this work, We revisit the modern workflow of obtaining efficient convolutional networks, and partition the problem into two separate steps. In the first step, we design a network architecture, and in the second step, we prune connections from the network.
In the first step, we note a common pattern in both human expertdesigned architectures and searched architectures: for each backbone, the architecture is characterised by the design of a normal block and a reduction block. At the beginning of each stage, we insert a reduction block, and repeatedly stack the normal block. We repeat each stage multiple times, and for each stage we may have different number of normal blocks. We call this design pattern Monotonous Design (Figure 3). For example, ResNet monotonically repeats a Bottleneck block, ShuffleNet monotonically repeats a ShuffleBlock, MobileNet v2/v3 and EfficientNet monotonically repeats and Inverted Residual Block (MBBlock), NASNet repeats a Normal Cell, and FBNet repeats a variant of MBBlock with different hyperparameters. For all the stateofart networks, to our best knowledge, the blocks are guaranteed to have full information exchange. In the second step, some connections are pruned, and it is not be guaranteed that each block has full information exchange.
In this paper, we seek to design a more efficient network for image recognition tasks, by considering pruning in the network design step. We create a new block design methodology: Idle. In the Idle design, a subspace of the input is not transformed: it is simply “idle” and passed directly to the output (Figure 2). We also break the monotonous design constraint in stateoftheart architectures. We call our nonmonotonous composition method Hybrid Composition (HC) (Figure 4).
The results are initially as expected — if we monotonically compose a network with IdleBlock, we obtain a pruned network with acceptable accuracy loss. If we use hybrid composition with IdleBlock (and MBBlock), we significantly reduce the accuracy loss while substantially saving computation. The surprising result is this: leveraging the pruned computation budget from IdleBlock to go deeper with hybrid composition achieves new stateofart network architectures — without complex multiresolution design or neural architecture search.
The paper is organized as follows. In Section 2, We discuss modern convolutional network block design, and introduce the Idle design and IdleBlock. In Section 3, we introduce Hybrid Composition (HC) with IdleBlock. In Section 4 we study applying HC with IdleBlock to MobileNet v3 and EfficientNetB0, which leads to new stateoftheart network architectures under equivalent computation budgets. We also conduct ablation studies on our design choice. Section 5 contains conclusion and future work.
2 Idle & IdleBlock: Motivation and Discussion
In this section, we will revisit key milestones in convolution building block design. We will then introduce our Idle design and Idled version of MBBlock: IdleBlock.
2.1 Bottleneck Block & Inverted Residual Block
The Inverted Residual Block (MBBlock) (Figure 6) is the fundamental building block behind success of MobileNet v2/v3 and EfficientNet. In MBBlock, a pointwise convolution will expand input dimension by a factor of $r$, apply a depthwise convolution to the expanded features, then apply another pointwise convolution to squeeze the high dimension features into a lower dimension (usually the same as input to allow residual addition). The key difference between the Bottleneck block [8, 9] design (Figure 5 and MBBlock is — Bottleneck applies a spatial transformation ($k\times k$ convolution or $k\times k$ depthwise convolution) on a narrowed feature map, and MBBlock applies a spatial transformation on an expanded feature map. The Bottleneck design is proposed in ResNet, where $3\times 3$ convolutions are the most computationally intensive component. With $3\times 3$ convolutions, narrowed feature maps significantly reduce computation cost. However, when depthwise convolutions are used, the spatial transformation is no longer a computation bottleneck. A $k\times k$ depthwise convolution corresponds to running a linear regression with ${k}^{2}$ input elements repeatly in each channel. Intuitively, with more channels, the spatial transformation capacity will be higher. Empirically, with similar computational budgets (MAdds), MobileNet v1 in Bottleneck design achieves 70.6% Top 1 accuracy on ImageNet [6], while MobileNet v2 (1.4) achieves 74.7% Top 1 accuracy [21]. The accuracy gap between MobileNet v1 and v2 indicates that for depthwise convolutions, an expanded feature map is helpful to improve a network’s expression power.
2.2 ShuffleBlock v1/v2
ShuffleBlock v1/v2 are depthwise convolution blocks that heavily rely on the channel shuffle operation — a lightweight information exchange operation.
ShuffleBlock v1 (Figure 7) is inherited from Bottleneck block. A grouped pointwise convolution is used to replace the first pointwise convolution in Bottleneck block to narrow the channel dimension. A channel shuffle operator is followed by grouped pointwise convolution operator to provide information exchange between groups. The block topology is the same as the Bottleneck block.
ShuffleBlock v2 (Figure 8) is a following work of ShuffleBlock v1. Similar to Bottleneck block and ShuffleBlock v1, ShuffleBlock v2’s input and output are expanded feature maps. To further reduce computation cost, the group pointwise convolution which is used for narrowing feature map is removed. Instead, expanded feature map is split into two equal channel narrow feature maps: one narrowed feature map is transformed with a special Bottleneck block without internal dimension change, the other feature map keeps the same until it is shuffled with transformed feature map.
However, the channel shuffle operation is not efficiently realizable on many heterogeneous hardware, such as new neural network accelerators. For example on iPhone equipped with Apple Neural Engine (ANE) using CoreML, networks without channel shuffle operations run up to 10 times slower than equivalent networks without channel shuffles.
2.3 ”Idle” Design
When we think about design principles for an efficient block structure, attempting to reduce computation (MAdds) is an obvious direction. For example, to reduce computation in a $k\times k$ spatial convolution, depthwise convolutions can be used to replace the original convolution operator (as in MBBlock).
When we optimize directly for modern hardware platforms, there are more challenges than simply reducing MAdds. For example, posttraining pruning will reduce computation, but without very specialized implementations and high levels of sparsity, it is hard to realize these computational improvements in practice. We seek to introduce a structured pruned topology into our block design, which allows us to achieve realworld speed ups. With this insight, we create a new design pattern: Idle, which aims to directly pass a subspace of the input to the output tensor, without any transformations. Figure 1 demonstrates motivation of Idle and network pruning. In Idle design, we introduce an “idle” factor $\alpha \in (0,1)$, which is also can be viewed as pruning factor. Given an input tensor $x$ with $C$ channels, the tensor will be sliced to two branches: one “active” branch ${x}_{1}$ containing $C\cdot (1\alpha )$ channels, and outputting a tensor ${y}_{1}$ with $C\cdot (1\alpha )$ channels; and the other “idle” branch ${x}_{2}$ contains $C\cdot \alpha $ channels, copied directly to the output tensor $y$ with $C$ channels directly.
Contrasted with Residual Connections
With residual connections, the input tensor $x$ is added to the output tensor. In the Idle design, $C\cdot \alpha $ channels of input tensor $x$ will be copied into the output tensor directly, without any elementwise addition.
Contrasted with Dense Connections
In a densely connected block, the entire input $x$ is part of the output $y$, and entire input $x$ in used in the block transformation. In Idle design, only $C\cdot \alpha $ channels of $x$ are copied into the output tensor directly, and the other $C\cdot (1\alpha )$ channels are used in the block transformation.
Contrasted with ShuffleBlock v2
The Idle design is closely related to ShuffleBlock v2 but with a few differences. First, ShuffleBlock v2 requires expanded feature map as input and outputs an expanded feature map; Idle is a design pattern doesn’t require on input/output dimension. Second, in ShuffleBlock v2, all information will be exchanged by using channel shuffle operator; In Idle design, we intend to don’t apply any change for a subspace of input. Third, Idle design’s motivation is to design a prepruned structure, ShuffleBlock v2 aims to obtain a narrowed feature map for spatial transformation. We can view ShuffleBlock v2 in a perspective of Idled Bottleneck block with extra channel shuffle.
2.4 IdleBlock
We summarize a few intuitive and empirical lessons we learned from ShuffleBlock v1/v2 and MBBlock:

1.
We need to apply the depthwise convolution on an expanded feature map. (MobileNet v1 vs MobileNet v2)

2.
Grouped convolutions are not necessary. (ShuffleNet v1 vs ShuffleNet v2)

3.
The channel shuffle operation is not friendly to various accelerators, and should be avoided.
Based on these lessons, we introduce IdleBlock, an Idled version of MBBlock.
There are two variants of IdleBlock. If we use the concatenation function $\text{concat}({y}_{1},{x}_{2})$ when constructing the output tensor from the two splits, we refer to the it as an LIdleBlock (Figure 9), and if we use the concatenation function $\text{concat}({x}_{2},{y}_{1})$, we refer to it as an RIdleBlock. If an information exchange block is strictly followed by an IdleBlock, the LIdleBlock and RIdleBlock are equivalent. When stacking two or more IdleBlocks, a mix of L/RIdleBlock units is different to a monotonic composition of L/RIdleBlock units.
A key distinction of mixed composition is the enhanced receptive field of the stacked output. Using two stacked IdleBlocks as example. Given inputs with receptive field ${R}_{0}$, if we monotonically stack two RIdleBlocks, the first output branch’s receptive field will remain as ${R}_{0}$ as the first branch is idle, while the second output branch’s receptive field is ${R}_{0}\cdot {k}^{2}$ as the second branch of input is updated by two $k$ by $k$ depthwise convolution operations. If we mix an L and RIdleBlock, L/RIdleBlock, both of the output branches will be updated with one $k$ by $k$ depthwise convolution operator, which will change the receptive field of both output branches to ${R}_{0}\cdot k$.
Given an MBBlock with input tensor shape $(1,C,H,W)$, expansion factor $r$, stride $s$, and depthwise convolution kernel $k$, the theoretical computation complexity is
$\sum}($  $2\cdot r\cdot {C}^{2}\cdot HW,$  //1x1 expand  
$2\cdot r\cdot C\cdot ({k}^{2}\cdot {\displaystyle \frac{HW}{{s}^{2}}}),$  //k by k depthwise  (1)  
$2\cdot r\cdot {C}^{2}\cdot {\displaystyle \frac{HW}{{s}^{2}}})$  //1x1 squeeze 
For IdleBlock, the theoretical computation complexity is
$\sum}($  $2\cdot r\cdot {C}^{2}\cdot (1\alpha )\cdot HW,$  //1x1 expand  
$2\cdot r\cdot C\cdot ({k}^{2}\cdot {\displaystyle \frac{HW}{{s}^{2}}}),$  //k by k depthwise  (2)  
$2\cdot r\cdot {C}^{2}\cdot (1\alpha )\cdot {\displaystyle \frac{HW}{{s}^{2}}})$  //1x1 squeeze 
By replacing a MBBlock with an equivalent IdleBlock, we save:
$2\cdot \alpha \cdot r\cdot {C}^{2}\cdot (HW+{\displaystyle \frac{HW}{{s}^{2}}})$  (3) 
3 Hybrid Composition Networks
In the previous section we introduced the IdleBlock. In this section we introduce Hybrid Composition (HC), a novel nonmonotonic network composition method.
In hybrid composition, at each stage of the network, we use a nonmonotonic composition of multiple types building blocks. This is only possible if the different blocks have identical constraints on the input and output dimensions. For example, ShuffleBlock v1/v2 can be hybridized with a Bottleneck block, because both of the blocks accept a narrow input and output an expanded tensor. A Bottleneck block is not able to be hybridized with MBBlock, because these two blocks contains different bottleneck designs and the input and output dimensions are very different. Hybrid composition attempts to jointly utilize the different properties of multiple building blocks, which monotonically design is not able to do.
For a monotonic MBBlock network, such as MobileNet v2/v3 and EfficientNet, the first pointwise convolution operator in MBBlock is only used to expand the input dimension for the depthwise operation.
In our case with IdleBlock, both IdleBlock and MBBlock satisfy the input and output constraints for hybrid composition. Moreover, once we hybridize IdleBlock with MBBlock, the first pointwise convolution operator in MBBlock will be able to help us exchange information of two branches in IdleBlock, without an explicit channel shuffle operation as in ShuffleBlock.
However, hybrid composition introduces another challenge. If a network stage contains $n$ MBBlock units, there are ${2}^{n}$ candidate combinations of MBBlock and IdleBlock placements in the Idle network, but we seek to explore only explore a small subset of these candidates.
To solve the challenge, we explore three configurations for the hybrid composition of MBBlock with IdleBlock: Maximum, None and Adjacent.
Maximum configuration
We only use MBBlock as reduction block, or when the network changes the output dimension. All other MBBlocks will be replaced by IdleBlock. This is the computational lower bound of hybrid composition with IdleBlock, as we need at least one MBBlock unit to exchange information in each stage.
None configuration
In this configuration, none of any blocks in the stage will be replaced with IdleBlock. This is the the computational upper bound of hybrid composition with IdleBlock.
Adjacent configuration
Adjacent configuration is a greedy strategy. We iteratively replace each MBBlock that has an MBBlock input with an IdleBlock, stopping when we reach our specified computational budget.
When we set $\alpha =0.5$ in the IdleBlock during the repeated stage (stride=1), we notice that according to Equation 3, the theoretical computation cost of one MBBlock is roughly equal to cost of two IdleBlocks. This means that in the Adjacent configuration, to keep same computation complexity, we can replace one MBBlock with two IdleBlocks — resulting in an approximately 30% increase in overall network depth with same computation budget. We denote this special case as ”Adjacent +1 IdleBlock” configuration.
Intuitively, a neural architecture search over the space of block placement policies should fall (in terms of accuracy and computation cost) between the lower bound and upper bounds described here (Maximum and None).
4 Experiments
We present experimental results on ImageNet 2012 classification dataset [6] to demonstrate the effectiveness of hybrid composition with IdleBlocks. We set pruning factor $\alpha $ to 0.5 for all our experiments. We focus on studying MobileNet v3 and EfficientNetB0, two stateofart efficient image recognition architectures. We report execution times on Intel Xeon CPUs and ARM Cortex CPUs. We also conduct ablation studies to validate our design decisions. In this section, we use ”Top1” to refer ”Top1 Accuracy” on ImageNet.
4.1 Training Setup
We use MXNet [2] and GluonCV [7] for both single node training and distributed training. Our source code and pretrained models can be found at: URL.
Single Node
We provide MobileNet v3 experiment results on single node with 8 NVIDIA V100 GPU. We use fp16/32 mixedprecision during training. The batch size is set to 256 for each GPU (2048 in total). We use a Nesterov optimizer with learning rate set to 2.6 and weight decay set to 3.0e5 for all models. All models are trained with 360 epoches. We warm up training in the first 5 epochs.
Distributed
We provide both MobileNet v3 and EfficientNetB0 experiment results in distributed setting. All models are trained with 8 nodes, contains 64 V100 in total. Without special note, batch size is set to 128 per GPU. We use LBSGD [15] to optimize all models, with initial learning rate 0.8, momentum 0.9, L2 weight decay 3.0e5. All models are trained with 80 epochs (equivalent to 640 epoches on single node). We warm up training in the first 5 epoches. All distributed models are trained with Mixup [29] augmentation.
4.2 Inference Setup
We evaluate the inference time on Google Cloud Engine in single threaded and multithreaded (4 threads) settings, on an Intel SkyLake CPU @ 2.0GHz. For ARM Cortex CPUs, we evaluate on an ARM CortexA53 @ 600MHz. All models are compiled with TVM [3] and AutoTVM [4], which provides substantial latency improvements on these architectures.
4.3 Result on MobileNet v3
Config  MAdds  Params  #MBBlock  #IdleBlock  Top1  ARM  x86 

(M)  (M)  (%)  1T/4T (ms)  1T/4T (ms)  
None (baseline)  219.4  6.44  15  0  75.20/75.11${}^{\star}$  780/230  20.1/10.1 
\hdashlineMaximum  168.5  5.51  5  10  74.67/74.42${}^{\star}$  685/201  19.5/10.3 
Adjacent  185.9  5.64  10  5  75.03/74.86${}^{\star}$  711/207  18.9/9.9 
Adjacent + 1 IdleBlock  224.4  6.06  10  10  75.58/75.82${}^{\star}$  844/246  22.0/11.9 
Adjacent + 1 IdleBlock L/R  224.4  6.06  10  10  75.66/76.04${}^{\mathrm{\star}}$  844/245  22.2/11.8 
Model  MAdds (M)  Params (M)  Top1 

MobileNet v3 [10]  219  5.4  75.2 
FBNetA [26]  249  4.3  73.0 
FBNetB [26]  295  4.5  74.1 
MobileNet v2 [21]  300  3.4  72.0 
Proxyless [1]  320  4.1  74.6 
FBNetC [26]  375  5.5  74.9 
EfficientNetB0 [24]  390  5.3  76.3 
NASNetB [31]  488  5.3  72.5 
NASNetC [31]  558  4.9  74.5 
NASNetA [31]  564  5.3  74.0 
RandomWiredWS [28]  583  5.6  74.7 
AmoebaNet [20]  580  6.4  75.7 
MobileNet v2 (1.4) [21]  585  6.9  74.7 
PNAS [17]  588  5.1  74.2 
ShuffleNet v2 [19]  591  7.4  74.9 
DARTS [18]  595  4.9  73.1 
HC(M=5, I=10)  169  5.5  74.7 
HC(M=10, I=10)  224  6.0  75.7 
HC(M=15, I=10)  299  7.3  76.8 
HC(M=15, I=20)  380  8.1  77.2 
HC(M=5, I=10)${}^{\star}$  169  5.5  74.4 
HC(M=10, I=10)${}^{\star}$  224  6.0  76.0 
HC(M=15, I=10)${}^{\star}$  299  7.3  77.0 
HC(M=15, I=20)${}^{\star}$  380  8.1  77.5 
We first study hybrid composition with IdleBlock on MobileNet v3. We evaluate different hybridizing configurations (Table 1). With HC and $\alpha =0.5$ IdleBlock, if we seek to prune the network, we can prune up to 23% of the MAdds with only approximately 0.7 percentage point Top1 accuracy loss. If we seek similar MAdds, we can improve network accuracy by approximately 0.4 to 0.9 percentage point under different training settings. If we go deeper with HC, the deeper networks are potentially more efficient than all stateofart human expertdesigned network, and all stateofart searched networks (Table 2).
4.4 Result on EfficientNetB0
Config  MAdds  Params  #MBBlock  #IdleBlock  Top1  ARM  x86 

(M)  (M)  (%)  1T/4T (ms)  1T/4T (ms)  
None  385.3  5.07  16  0  77.05  1250/345  28.7/14.7 
\hdashlineMaximum  299.7  4.21  7  9  76.34  1066/298  27.2/13.8 
Adjacent  334.4  4.54  11  5  76.61  1137/316  28.3/14.6 
Adjacent + 1 IdleBlock  397.81  5.46  11  10  77.60  1373/384  32.2/17.2 
Adjacent + 1 IdleBlock L/R  397.81  5.46  11  10  77.25  1372/383  32.1/17.4 
Model  MAdds (M)  Params (M)  Top1 

EfficientNetB0 [24]  390  5.3  76.3 
EfficientNetB1 [24]  700  7.8  78.8 
EfficientNetB2 [24]  1000  9.2  79.8 
EfficientNetB3 [24]  1800  12  81.1 
ResNet50 v1${}^{\mathrm{\u25c6}}$  4100  26.0  77.3 
ResNeXt50 32x4${}^{\mathrm{\u25c6}}$  4200  25.0  79.3 
ResNeXt101 32x4${}^{\mathrm{\u25c6}}$  7800  44.3  80.3 
HRNetW32C [23]  8310  41.2  79.5 
ResNet152 v1${}^{\mathrm{\u25c6}}$  11000  60.0  79.2 
SqueezeExciteNet [11]  21000  115.1  81.3 
EfficientNetB0${}^{\star}$  385  5.1  77.0 
HC(M=16, I=14)  569  7.1  79.0 
HC(M=16, I=28)  752  9.2  79.5 
HC(M=16, I=28)${}^{\mathrm{\square}}$  1532  9.2  81.0 
Similar to the MobileNetv3 experiment, we experiment with different configurations of hybrid composition on EfficientNetB0 (Table 3). With HC and $\alpha =0.5$ IdleBlock, if we seek to prune the network, we can prune up to 22% of the MAdds with only approx 0.7 percentage point Top1 accuracy loss, which is similar to observations on MobileNet v3. If we seek similar MAdds, we can improve network accuracy by approx 0.6 percentage point. If we go deeper with HC, the deeper networks are consistently more efficient. (Table 2).
4.5 Inference Result on x86/ARM CPU
Our current implementation, based on TVM, realizes approximately onethird of the theoretical speedups obtained from our new block structure across a range of hardware. Future work may involve representing the entire block structure in the tensor expression IR, which will allow more aggressive optimizations such as completely eliminating the concatenation, splitting, and various layout transformation operations that are present in our implementation. Nevertheless, we still obtain real speed up with same depth without these specialized optimization on very different hardware platforms (Figure 10, Figure 11).
4.6 Ablation Study
4.6.1 Impact of Channel Shuffle Operator
In Section 2 we mentioned that the channel shuffle operator is not friendly to heterogeneous hardware accelerators for inference, despite its minimal theoretical computation complexity. In this experiment, we want to quantitatively understand channel shuffle operator’s contribution in network accuracy. We refer IdleBlock with channel shuffle operator as Inverted Shuffle Block (ISB). The difference between ISB and ShuffleBlock v2 is: ISB is based on MBBlock while ShuffleBlock v2 is based on a special case of Bottleneck block.
Model  Madds  Params  Top1 

M=1, ISB=14  157.8  4.8  73.8/73.7${}^{\star}$ 
M=1, Idle=14  157.8  4.8  73.8/73.8${}^{\mathrm{\star}}$ 
HC(M=10, ISB=10)  299.8  7.3  76.4/76.6${}^{\star}$ 
HC(M=10, Idle=10)  299.8  7.3  76.8/76.7${}^{\mathrm{\star}}$ 
HC(M=15. ISB=20)  380.1  8.1  77.0/77.2${}^{\star}$ 
HC(M=15, Idle=20)  380.1  8.1  77.2/77.4${}^{\mathrm{\star}}$ 
We use MobileNetv3 to study this problem. First we replace 14 out of 15 MBBlock to ISB or IdleBlock, only keep the very first MBBlock. We also tried different hybrid composition configurations in Table 2. Detailed results can be found in Table. 5. This experiment suggests that channel shuffle operators are unnecessary in our design.
4.6.2 Deeper Net with Monotonous MBBlock vs Hybrid Composition with IdleBlock
By applying hybrid composition with IdleBlock, we can save computational cost for a fixed depth. In this experiment, we want to understand the the accuracy difference between monotonous stacking with MBBlock vs hybrid composition with IdleBlock when going deeper. We study on both MobileNet v3 and EfficientNetB0.
MboileNet v3  MAdds (M)  Params (M)  Top1 (%) 
M=15, I=10  299.8  7.26  76.8/77.0${}^{\mathrm{\star}}$ 
M=25, I=0  368.4  8.85  76.9/76.7${}^{\star}$ 
\hdashlineM=15, I=20  380.1  8.09  77.2/77.5${}^{\star}$ 
M=35, I=0  517.5  11.26  77.6/77.2${}^{\star}$ 
EfficientNet B0  MAdds (M)  Params (M)  Top1 (%) 
M=16, I=14  569.0  7.12  79.0 
M=30, I=0  715.94  8.32  79.2 
\hdashlineM=16, I=28  752.2  9.18  79.5 
M=44, I=0  1046.0  11.57  79.8 
From Table 6, we find stacking more MBBlock monotonically does not always guarantee better accuracy with different training methods. Stacking MBBlock usually costs 1.3X more MAdds to achieve the same depth, compared to hybrid composition with IdleBlock. The accuracy difference is approximately 0.3 percentage points at various depth settings.
4.6.3 Impact of L/RIdleBlock
Empirically, for shallow networks, a larger receptive field will improve accuracy. We use MobileNet v3 to study this phenomena. Result is listed in Table 7. We find L/RIdleBlock placement has an impact on the final accuracy. Most of our experiments empirically validate our hypothesis, as monotonically stacked IdleBlock create a larger receptive field. In one of one settings, L/R cross placement performs better than monotonically use IdleBlock. L/RIdle block placement introduce a new hyperparameter in network design, but this hyperparameter is substantially less sensitive than other hyperparameters.
MobileNet v3  Top1  Dist. Top1 

M=1, L/RIdle=14  73.09  73.48 
M=1, Idle=14  73.82  73.78 
\hdashlineM=5, L/RIdle=10  74.23  74.42 
M=5, Idle=10  74.67  74.42 
\hdashlineM=15, L/RIdle=20  76.38  77.57 
M=15, Idle=20  77.20  77.39 
5 Conclusion and Future work
In this paper, we introduce a new network design methodology inspired by network pruning: Idle and its MBBlock version IdleBlock. In order to utilize IdleBlock efficiently, we break the tradition of monotonous design in network stage and introduce hybrid composition of IdleBlock with MBBlock. Our empirical results suggest that using the pruned computation to make the network deeper with hybrid composition is more efficient than various stateofart neural architecture search methods for small models. Hybrid composition also makes intermediate and large networks more efficient. This may point to a simpler direction for future neural architecture search directions for more efficient image recognition network. The principles of Idle design may also be worth to extending to neural networks for other tasks and domains.
References
 [1] (2018) Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: Table 2.
 [2] (2015) Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274. Cited by: §4.1.
 [3] (2018) $\{$tvm$\}$: An automated endtoend optimizing compiler for deep learning. In 13th $\{$USENIX$\}$ Symposium on Operating Systems Design and Implementation ($\{$OSDI$\}$ 18), pp. 578–594. Cited by: §4.2.
 [4] (2018) Learning to optimize tensor programs. In Advances in Neural Information Processing Systems, pp. 3389–3400. Cited by: §4.2.
 [5] (2019) Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution. arXiv preprint arXiv:1904.05049. Cited by: §1.
 [6] (2009) Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §2.1, §4.
 [7] (2019) GluonCV and gluonnlp: deep learning in computer vision and natural language processing. arXiv preprint arXiv:1907.04433. Cited by: §4.1, Table 4.
 [8] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §2.1.
 [9] (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1, §2.1.
 [10] (2019) Searching for mobilenetv3. arXiv preprint arXiv:1905.02244. Cited by: Hybrid Composition with IdleBlock: More Efficient Networks for Image Recognition, §1, Table 2.
 [11] (2018) Squeezeandexcitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: Table 1, Table 4.
 [12] (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1.
 [13] (2017) Multigrid neural architectures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6665–6673. Cited by: §1.
 [14] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
 [15] (2019) Dynamic minibatch sgd for elastic distributed training: learning in the limbo of resources. arXiv preprint arXiv:1904.12043. Cited by: §4.
 [16] (2013) Network in network. arXiv preprint arXiv:1312.4400. Cited by: §1.
 [17] (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: Table 2.
 [18] (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: Table 2.
 [19] (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: §1, Table 2.
 [20] (2019) Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4780–4789. Cited by: Table 2.
 [21] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §1, §2.1, Table 2.
 [22] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
 [23] (2019) Deep highresolution representation learning for human pose estimation. arXiv preprint arXiv:1902.09212. Cited by: §1, Table 4.
 [24] (2019) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: Hybrid Composition with IdleBlock: More Efficient Networks for Image Recognition, §1, Table 2, Table 4.
 [25] (2013) Regularization of neural networks using dropconnect. In International conference on machine learning, pp. 1058–1066. Cited by: Table 3.
 [26] (2019) Fbnet: hardwareaware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10734–10742. Cited by: Table 2.
 [27] (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §1.
 [28] (2019) Exploring randomly wired neural networks for image recognition. arXiv preprint arXiv:1904.01569. Cited by: Table 2.
 [29] (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §4.
 [30] (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §1.
 [31] (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: Table 2.