Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks

  • 2019-11-21 20:32:04
  • Saurabh Singh, Shankar Krishnan
  • 22

Abstract

Batch Normalization (BN) is a highly successful and widely used batchdependent training method. Its use of mini-batch statistics to normalize theactivations introduces dependence between samples, which can hurt the trainingif the mini-batch size is too small, or if the samples are correlated. Severalalternatives, such as Batch Renormalization and Group Normalization (GN), havebeen proposed to address these issues. However, they either do not match theperformance of BN for large batches, or still exhibit degradation inperformance for smaller batches, or introduce artificial constraints on themodel architecture. In this paper we propose the Filter Response Normalization(FRN) layer, a novel combination of a normalization and an activation function,that can be used as a drop-in replacement for other normalizations andactivations. Our method operates on each activation map of each batch sampleindependently, eliminating the dependency on other batch samples or channels ofthe same sample. Our method outperforms BN and all alternatives in a variety ofsettings for all batch sizes. FRN layer performs $\approx 0.7-1.0\%$ better ontop-1 validation accuracy than BN with large mini-batch sizes on Imagenetclassification on InceptionV3 and ResnetV2-50 architectures. Further, itperforms $>1\%$ better than GN on the same problem in the small mini-batch sizeregime. For object detection problem on COCO dataset, FRN layer outperforms allother methods by at least $0.3-0.5\%$ in all batch size regimes.

 

Quick Read (beta)

Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks

Saurabh Singh   Shankar Krishnan
Google Research
{saurabhsingh,skrishnan}@google.com
Abstract

Batch Normalization (BN) is a highly successful and widely used batch dependent training method. Its use of mini-batch statistics to normalize the activations introduces dependence between samples, which can hurt the training if the mini-batch size is too small, or if the samples are correlated. Several alternatives, such as Batch Renormalization and Group Normalization (GN), have been proposed to address these issues. However, they either do not match the performance of BN for large batches, or still exhibit degradation in performance for smaller batches, or introduce artificial constraints on the model architecture. In this paper we propose the Filter Response Normalization (FRN) layer, a novel combination of a normalization and an activation function, that can be used as a drop-in replacement for other normalizations and activations. Our method operates on each activation map of each batch sample independently, eliminating the dependency on other batch samples or channels of the same sample. Our method outperforms BN and all alternatives in a variety of settings for all batch sizes. FRN layer performs 0.7-1.0% better on top-1 validation accuracy than BN with large mini-batch sizes on Imagenet classification on InceptionV3 and ResnetV2-50 architectures. Further, it performs >𝟏% better than GN on the same problem in the small mini-batch size regime. For object detection problem on COCO dataset, FRN layer outperforms all other methods by at least 0.3-0.5% in all batch size regimes.

\usetikzlibrary

backgrounds,calc,chains,fit,matrix,positioning,shadows,shapes.misc,circuits.ee \tikzsetinput/.style= \tikzsetoutput/.style= \tikzsetoperator/.style=circle, draw, fill=white, minimum size=2.5ex, inner sep=0pt \tikzsetfilter/.style=rectangle, draw, fill=white, minimum size=3.5ex, inner xsep=1.5ex \tikzsetother/.style=rounded rectangle, draw, fill=white, minimum size=3.5ex, inner xsep=1ex \tikzsetbranch/.style=circle, draw, fill=black, minimum size=.5ex, inner sep=0pt \tikzsetrv/.style=circle, draw, thick, fill=white, minimum size=2.75ex, inner sep=0pt \tikzsetob/.style=circle, draw, thick, fill=lightgray, minimum size=2.75ex, inner sep=0pt \tikzsetpa/.style=circle, draw, thick, fill=black, minimum size=1ex, inner sep=0pt \tikzset/tikz/thin/.style=line width=.9pt \tikzset/tikz/thick/.style=line width=1.4pt \tikzsetevery path/.style=thin \tikzset¿=direction ee \pgfplotssetcompat=1.14 \pgfplotssetevery axis/.append style=enlargelimits=abs=3pt,grid,axis lines=left \pgfplotssetevery axis plot/.append style=thick,mark size=1.5pt,line join=bevel,mark options=solid \pgfplotssetlabel style=font= \pgfplotssettick label style=font= \pgfplotssetgrid style=color=black!5 \pgfplotssetlegend style=draw=none,opacity=.85,font=,cells=anchor=west,opacity=1 \pgfplotssetevery non boxed x axis/.style=xtick align=center,shorten ¡=-.5\pgflinewidth \pgfplotssetevery non boxed y axis/.style=ytick align=center,shorten ¡=-.5\pgflinewidth \pgfplotssetevery non boxed z axis/.style=ztick align=center,shorten ¡=-.5\pgflinewidth \pgfplotsset/pgf/number format/1000 sep=  \newcolumntypeL[1]¿\arraybackslashm#1 \newcolumntypeC[1]¿\arraybackslashm#1 \newcolumntypeR[1]¿\arraybackslashm#1

1 Introduction

Figure 1: Our method consistently outperforms other normalization methods, even at the largest batch size where other methods struggle in comparison to Batch Normalization (see inset). The figure reports the validation performance of ResNetV2-50 models trained using 8 GPUs with different batch sizes on ImageNet.

Batch normalization (BN) [batchnorm] is a cornerstone of current high performing deep neural network models and has been instrumental in the recent success and wide application of deep learning. One often discussed drawback of BN is its reliance on sufficiently large batch sizes [batchrenorm, groupnorm, evalnorm]. When trained with small batch sizes, as is common in many applications like object detection, BN exhibits a significant degradation in performance. The source of this issue has been attributed to training and testing discrepancy arising from BN’s reliance on stochastic mini-batches [evalnorm]. As a result, several approaches have been proposed that aim to ameliorate the issues due to stochasticity [batchrenorm, evalnorm] or offer alternatives [groupnorm, layernorm] by removing batch dependence. However, these approaches don’t match the performance of BN for large batch sizes (creftypecap 1). Further, either they still exhibit a degradation in performance for smaller batch sizes e.g. Batch Renormalization, or introduce constraints on the model architecture and size e.g. Group Normalization requires number of channels in a layer to be multiples of an ideal group size, such as 32. In this work we propose Filter Response Normalization (FRN) layer, consisting of a normalization and activation function, that eliminates these shortcomings altogether. Our method does not have any batch dependence, as it operates on each activation channel (filter response) of each batch sample independently, and outperforms BN and alternatives in a wide variety of evaluation settings. For example, in creftypecap 1, FRN layer outperforms other approaches by more than 1% at all batch sizes for ResNetV2-50 on ImageNet classification.

The reliance of BN on large batch sizes is prohibitive in a variety of ways. As pointed out by groupnorm, this hinders the exploration of higher capacity models due to significantly higher memory requirements resulting from use of larger batch sizes. This imposes limitations on the performance of tasks that need to process larger inputs. For example, object detection and segmentation perform better with higher resolution inputs; similarly, video data inherently tends to be significantly higher dimensional. As a result, these systems a forced to trade-off model capacity with ability to train with larger batch sizes. As evidenced in creftypecap 1 and experiment section, our method maintains a consistent performance across a range of batch sizes making it a promising replacement for BN on these tasks.

FRN layer consists of two novel components that work together to yield high performance of our method: 1) A normalization method, referred to as Filter Response Normalization (FRN), that independently normalizes the responses of each filter for each batch sample by dividing them by the square root of their uncentered second moment, without performing any mean subtraction and 2) a pointwise activation, termed Thresholded Linear Unit (TLU), that is parameterized by a learned rectification threshold allowing for activations that are biased away from zero. FRN layer outperforms BN by more than 0.7-1.0% with large mini-batch sizes on Imagenet classification on InceptionV3 and ResnetV2-50 architectures. Further, it performs >𝟏% better than Group Normalization on the same problem in the small mini-batch size regime. For object detection problem on COCO dataset, FRN layer outperforms all other methods by at least 0.3-0.5% in all batch size regimes. Lastly, FRN layer maintains a consistent performance across all the batch sizes that we tested. The proposed FRN layer does not rely on other batch elements or channels for normalization, yet outperforms BN and other alternatives for all batch sizes and in a variety of settings.

Contributions: The main contributions in this paper are the following:

  • Filter Response Normalization (FRN), a normalization method that enables models trained with per-channel normalization to achieve high accuracy.

  • The Thresholded Linear Unit (TLU), an activation function to use with FRN resulting in a further improvement in accuracy outperforming BN even at large batch sizes without any batch dependency. We refer to this combination as FRN layer.

  • Several insights and practical considerations that lead to the success of the combination of FRN and TLU.

  • A detailed experimental study comparing popular normalization methods on large image classification and object detection tasks on a variety of real world architectures.

2 Related work

Normalization of training data has been known to aid in optimization. For example, whitening of inputs is a common practice for training shallow models such as Support Vector Machines and Logistic regression. Similarly, for training deep networks, normalization of inputs and intermediate representations has been recommended for efficient learning [lecun2012efficient, lecun1998efficient, glorot2010understanding]. Batch Normalization (BN) [batchnorm] aims to accelerate learning by stabilizing the intermediate feature distributions. BN normalizes each activation channel independently by using the mean and variance statistics computed for that channel over the entire mini-batch. This has been shown to accelerate learning and enable training of very deep neural network architectures. However, BN exhibits a dramatic degradation in performance when trained with smaller mini-batches [groupnorm, evalnorm]. Several approaches have been proposed to address this shortcoming, and can be grouped into two major categories: 1) Methods that reduce the train-test discrepancy in batch normalized models, 2) Sample based normalization methods that avoid batch normalization.

Methods reducing train-test discrepancy in batch normalization. batchrenorm notes that the discrepancy between the statistics that are used for normalization during training and testing may arise from the stochasticity due to small mini-batches and bias due to non-iid samples. They propose Batch Renormalization (BR) to reduce this discrepancy by constraining the mini-batch moments to a specific range, limiting the variation in mini-batch statistics during training. A key benefit of this approach is that the test time evaluation scheme of a model trained with Batch Renormalization is exactly the same as that for model trained with BN. On the other hand, Evalnorm [evalnorm] does not modify the training scheme. Instead, it proposes a correction to the normalization statistics to be used during evaluation. The major advantage of this method is that the model does not need to be retrained. However, both these methods still exhibit a degradation in performance for small mini-batches. Another approach is to engineer systems that can circumvent the issue by distributing larger batches across GPUs for tasks that require large inputs [peng2017megdet]. However, this approach requires considerable GPU infrastructure.

Methods avoiding normalization using mini-batches. Several approaches sidestep the issues encountered by BN by not relying on the stochastic mini-batch altogether [layernorm, groupnorm, instancenorm]. Instead, the normalization statistics are computed from the sample itself. Layer Normalization (LN) [layernorm] computes the normalization statistics from the entire layer i.e. using all the activation channels. In contrast, like BN, Instance Normalization (IN) [instancenorm] computes the normalization statistics for each channel independently, but only from the sample being normalized, as opposed to the entire batch, as BN does. IN was shown to be useful for style transfer applications, but was not successfully applied for recognition. Group Normalization (GN) [groupnorm] fills the middle ground between the two. It computes the normalization statistics over groups of channels. The ideal group size is experimentally determined. While, GN doesn’t show performance degradation for smaller batch sizes, it performs worse than BN for larger mini-batches (See creftypecap 1 here and Figure 1 in [groupnorm]). In addition, the size of groups required by GN imposes a constraint on the network size and architecture as every normalized layer needs to have number of channels that are multiple of the ideal group size determined by GN.

Other approaches. Weight Normalization [salimans2016weight] proposes a reparameterization of the filters in terms of a direction and a scale and reports accelerated convergence. Normalization Propagation [arpit2016normalization] uses idealized moment estimates to normalize every layer. Refer to ren2016normalizing for a unifying view of various normalization approaches.

3 Approach

Our goal is to eliminate the batch dependency in the training of deep neural networks without sacrificing the performance gains of BN at large batch sizes. We start this section with the main details of our proposal. We will follow that with a discussion of the rationale behind our proposal.

3.1 Filter Response Normalization with Thresholded Activation

We will assume for the purpose of exposition that we are dealing with the feed-forward convolutional neural network. We follow the usual convention that the filter responses (activation maps) produced after a convolution operation are a 4D tensor 𝑿 with shape [B,W,H,C], where B is the mini-batch size, W,H are the spatial extents of the map, and C is the number of filters used in convolution. C is also referred to as output channels. Let 𝒙=𝑿b,:,:,cN, where N=W×H, be the vector of filter responses for the cth filter for the bth batch point. Let ν2=ixi2/N, be the mean squared norm of 𝒙. Then we propose Filter Response Normalization (FRN) as following:

𝒙^=𝒙ν2+ϵ, (1)

where ϵ is a small positive constant to prevent division by zero errors. A few observations are in order about the normalization scheme we propose:

  • Similar to other normalization schemes, Filter Response Normalization removes the scaling effect of both the filter weights and pre-activations. This has been known [salimans2016weight] to remove noisy updates along the direction of the weights and reduce gradient covariance.

  • One of the main differences in our proposal is that we do not remove the mean prior to normalization. While mean subtraction was an important aspect of Batch Normalization, it is arbitrary and without real justification for normalization schemes that are batch independent.

  • Our normalization is done on a per-channel basis. This ensures that all filters (or rows of a weight matrix) have the same relative importance in the final model.

  • At first glance, Filter Response Normalization would appear very similar to Local Response Normalization (LRN) proposed in  Alexnet2012. However, among other differences, LRN does normalization over adjacent channels at the same spatial location, while ours is a global normalization over the spatial extent.

{tikzpicture}

[x=1em,y=1em] \node[input] (x) 𝒙; \node[other] (frn) at ((x)+(7.5,0)) ν2=ixi2/N yi=γxiν2+ϵ+β ; \node[other] (tlu) at ((frn)+(11,0)) zi=max(yi,τ); \node[output] (z) at ((tlu)+(6.5,0)) 𝒛;

\node

[below=1.8em] at (frn) FRN; \node[below=1.8em] at (tlu) TLU;

\draw

[-¿] (x) – (frn); \draw[-¿] (frn) – (tlu) node[midway,above] 𝒚; \draw[-¿] (tlu) – (z);

{pgfonlayer}

background \node[fill=black!8,rounded corners=3ex,draw,thick,fit=(frn)(tlu),inner xsep=2ex,inner ysep=3.5ex] (frn_layer) ; \node[above,text height=1.5ex, text depth=.25ex] at (frn_layer.north) FRN Layer;

Figure 2: A schematic of the proposed FRN Layer.

As with other schemes, we also perform an affine transform after normalization so that the network can undo the effects of the normalization:

𝒚=γ𝒙^+β, (2)

where γ and β are learned parameters. The final addition to our FRN layer is the activation function.

3.1.1 Thresholded Linear Unit (TLU)

Lack of mean centering in Filter Response Normalization can lead to activations having an arbitrary bias away from zero. Such a bias in conjunction with ReLU can have detrimental effect on learning and lead to poor performance and dead units. We propose to address this issue by augmenting ReLU with a learned threshold τ to yield TLU defined as:

𝒛=max(𝒚,τ) (3)

Since max(𝒚,τ)=max(𝒚-τ,0)+τ=ReLU(𝒚-τ)+τ, the effect of TLU activation is the same as having a shared bias before and after ReLU. However, this does not appear to be identical to absorbing the biases in the previous and subsequent layers based on our experiments. We hypothesize that the form of TLU is more favorable for optimization. TLU significantly improves the performance of models using FRN (see creftypecap 5), outperforming BN and other alternatives, and leads to our method, FRN layer. Figure 2 shows the schematic for our proposed FRN layer.

3.2 Gradients of FRN Layer

In this section, we briefly derive expressions for the gradients that flow through the network in the presence of the FRN layer. Since all the transformations are performed channel-wise, we only derive the per-channel gradients below. Let us assume that somewhere in the network, the activations 𝒙 are fed to the FRN layer and the output is 𝒛 (following the transformations described in equations (1),  (2), and  (3)). Let f(𝒛) be the mapping that the network applies to 𝒛, with gradients f𝒛 flowing backwards. Note that the parameters γ, β and τ are vectors of size num_channels, and so the per channel updates are scalar.

ziτ={0,if yiτ1,otherwise (4)

Note that the gradients ziyi are just the same as above, but with the cases reversed. Then the gradient update to τ is of the form

fτ=b=1B(f𝒛b)T𝒛bτ, (5)

where 𝒛b is the vector of per-channel activations of the bth batch point. Gradients w.r.t γ and β are as follows:

(fγ,fβ) =(b=1BfT𝒚b𝒙b^,b=1Bf𝒚b) (6)

Using eqn. (2), we can see that f𝒙^=γf𝒚. Finally, the gradients that flow back from the FRN layer can be written as

f𝒙 =1ν2+ϵ(I-𝒙^𝒙^T)f𝒙^ (7)

We make a couple of observations about the gradients. Eqn. (5) suggests that part of the gradients that get suppressed in a regular ReLU activation are now used to update τ, and in some sense are not wasted. Eqn. (7) shows that the gradients w.r.t to 𝒙 are orthogonal to 𝒙 (provided ϵ = 0) because (I-𝒙^𝒙^T) projects out the component in the direction of 𝒙^. This property is not unique to our normalization, but is known to help in reducing variance of gradients during SGD and benefit optimization [salimans2016weight].

3.3 Parameterizing ϵ

In our discussion so far, we have assumed that the filter responses have a large spatial extent of size N=W×H. However, there are situations in real networks like InceptionV3 [inceptionv3] and VGG-A [vggnet], where some layers produce 1×1 activation maps. In this setting (N=1), for small value of ϵ, the proposed normalization as in creftypecap 1 turns into a sign function (see creftypecap 3), and has very small gradients almost everywhere. This will invariable affect the learning adversely. In contrast, higher values of ϵ lead to variants of smoother soft sign function that are more amenable to learning. Appropriate value of ϵ becomes crucial for models that are fully connected or lead to 1×1 activation maps. Empirically, we turn ϵ into a learnable parameter (initialized at 10-4) for such models. For other models, we use a fixed constant value of 10-6. In our experiments, we show that the learnable parameterization is useful for training InceptionV3 model where the Auxiliary logits head produces 1×1 activation maps, and for the VGG-A [vggnet] architecture that uses fully connected layers.

Figure 3: Effect of ϵ on normalized activations for the case of N=1. For very small values of ϵ, FRN turns into a step function while for higher values it behaves like a softsign function, allowing the gradients to flow. Having a learnable epsilon is crucial in models with fully connected layers or low-dimensional activation maps.

Since ϵ>0, we explored two alternative parameterizations to enforce this constraint: absolute value and exponential. While both trained well, the absolute value parameterization ϵ=10-6+|ϵl| (ϵl being a learned parameter), produced consistently better empirical results. Parameterizations of this form are also preferable because the gradient magnitudes for ϵl are independent of the value of ϵ.

3.4 Mean Centering

Batch Normalization was proposed to counter the effects of internal covariate shift during training of a deep neural network. The solution was to keep the statistics of distribution of activations over the data set invariant; and as a practical matter, they choose to normalize the first and second moments of mini-batch at each step. Batch independent alternatives that include mean centering are not justified by any particular consideration, and seem merely as a legacy of BatchNorm.

Consider the example of Instance Normalization (IN). Using the same notation as creftypecap 3.1, IN computes the normalized activations using the channel statistics μ=ixi/N and σ2=i(xi-μ)2/N as following:

𝒙^=𝒙-μσ2+ϵ (8)

As the size of the activation map decreases (as is common in the layers closer to the output which are subject to downsampling, or due to the presence of fully connected layers), IN produces zero activations. Layer and Group Normalization are ways to circumvent this issue by normalizing across (all or subset of) channels. Since individual filters are responsible for specific channel activations, normalizing across channels causes complicated interaction in the filter updates. Hence, it appears that the only principled approach is to normalize each channel of the activation map separately without resorting to mean centering. This also has a desirable effect of removing the relative scaling between filters, which has been known to greatly aid in optimization.

A negative impact of not performing mean centering is that activations can have bias arbitrarily away from zero, rendering ReLU activation less than ideal. We mitigate this issue by introducing the Thresholded Linear Unit (TLU) in creftypecap 3.1.1. Empirically, the combination of uncentered normalization with the TLU activation outperforms BN and all other alternatives.

3.5 Implementation

FRN is easy to implement in automatic differentiation frameworks. We provide an example implementation using python API for Tensorflow in LABEL:lst:implementation.

Listing 1: Tensorflow implementation of FRN layer
def FRNLayer(x, tau, beta, gamma, eps=1e-6):
  # x: Input tensor of shape [BxHxWxC].
  # alpha, beta, gamma: Variables of shape [1, 1, 1, C].
  # eps: A scalar constant or learnable variable.
  # Compute the mean norm of activations per channel.
  nu2 = tf.reduce_mean(tf.square(x), axis=[1, 2], keepdims=True)
  # Perform FRN.
  x = x * tf.rsqrt(nu2 + tf.abs(eps))
  # Return after applying the Offset-ReLU non-linearity.
  return tf.maximum(gamma * x + beta, tau)

4 Experiments

We evaluate our method extensively on two tasks: 1) Image classification on Imagenet, and 2) Object detection on COCO. While Image classification is the de-facto standard for evaluation, Object detection typically requires high resolution inputs and is particularly constrained by the large batch size requirements of BN. On Imagenet classification we show that our method outperforms other normalization methods on three different network architecture. Further, our method does this consistently at all batch sizes we experimented with. Finally, we validate the performance of our method on Object Detection where it outperforms other normalization methods on all batch sizes as well.

4.1 Learning Rate Schedule

Since FRN does not do mean centering, we empirically found that certain architectures are more sensitive to the choice of initial learning rate. Setting a high initial rate causes large updates that lead to large activations in the early part of the training and result in a slowdown in the learning. This is due to the 1ν2+ϵ factor in the gradient of f𝒙 (see creftypecap 7). This happens more often in architectures that employ several max pooling layers like VGG-A. We address this by using a ramp-up in the learning rate that slowly increases the learning rate from 0 to the peak value during an initial warmup phase. Since all our experiments use cosine learning rate decay schedule, we use a cosine ramp-up schedule as well. Ramping up the learning rate in a warmup phase is quite common and frequently used in training  [resnets, resnetsv2, Imagenet2017].

4.2 ImageNet Classification

Dataset: We evaluate our method on the ImageNet classification dataset [imagenet] consisting of 1000 classes. We train on the 1.28M training images and report results on the 50k validation images. For all models in this section, we resize the images to 299×299 and use data augmentation from [szegedy2017inception] at training time.

Model architectures: We provide comparisons on three different model architectures: 1) ResnetV2-50 [resnetsv2]: Popular model with identity shortcuts, 2) InceptionV3 [szegedy2016rethinking]: High performing model without identity shortcuts and fully connected layers and, 3) VGG-A [vggnet]: Feed forward model with a mix of convolutional and fully connected layers. For all models using GN we use a group size of 32. However, since VGG-A does not use a multiple of 32 filters in all layers, we increase the number of filters to nearest multiple.

Training: We follow the training setup used by resnets. All models are trained using synchronous SGD across 8 GPUs for 300K steps. Gradients are computed by averaging across all GPUs. For BatchNorm, the normalization statistics are computed per GPU. This setup is common for multi-GPU training using synchronous SGD in Tensorflow and PyTorch. An initial learning rate of 0.1×batch_size/256 and cosine decay schedule is used. We follow [resnets, resnetsv2] for other implementation details. Results are reported using two image classification metrics: 1) ‘[email protected]’ measures the accuracy using the highest scoring class (top-1 prediction) while, 2) ‘[email protected]’ measures the accuracy using top-5 scoring classes.

Table 1: FRN layer outperforms BN and other normalization methods for large batch size on Imagenet Classification for ResnetV2-50 [resnetsv2] and InceptionV3 [szegedy2016rethinking].
Method ResnetV2 50 InceptionV3
[email protected] [email protected] [email protected] [email protected]
Batchnorm 76.21 92.98 78.24 94.07
BatchRenorm 75.85 92.90 78.19 94.01
Groupnorm 75.67 92.70 78.14 93.98
Layernorm 72.75 91.19 76.75 93.37
Instancenorm 71.63 90.46 73.93 91.60
FRN layer [Ours] 77.21 93.57 78.95 94.49
Table 2: Effect of mini-batch size used for normalization on ImageNet classification for ResnetV2-50 [resnetsv2].
Images per GPU 32 16 8 4 2
Batchnorm 76.21 75.55 74.04 71.96 65.09
Renorm 75.85 75.96 75.59 74.18 70.75
Groupnorm 75.67 75.77 76.14 76.02 76.20
FRN layer [Ours] 77.21 77.10 77.16 77.18 77.33
Δ +1.54 +1.33 +1.02 +1.16 +1.13
Batchnorm 92.98 92.81 92.12 90.98 86.51
Renorm 92.90 92.98 92.80 92.10 89.81
Groupnorm 92.70 92.72 92.89 92.87 92.92
FRN layer [Ours] 93.62 93.59 93.60 93.49 93.61
Δ +0.92 +0.87 +0.71 +0.62 +0.69

Comparison with normalization methods: In creftypecap 1 we compare our method with various normalization methods for the regular batch size of 32 images/GPU. This results in an effective batch size of 32×8=256 and is the most favorable configuration for BN. This is the strongest baseline for image classification and all the alternatives to BN have struggled in this setting, underperforming BN. Even for this large batch size, FRN outperforms all the methods including BN with a healthy margin on both the architectures indicating that batch dependent training is not necessary for high performance. At this large batch size, the next best performing normalization schemes are BN and BatchRenorm, both of which are batch normalized methods, followed by other sample based normalization methods. creftypecap 4 compares the training and validation ’[email protected]’ curves for various normalization methods using the ResnetV2-50 architecture. We observe that FRN layer achieves both higher training and validation accuracies than BN indicating that removal of stochastic batch dependence eases optimization allowing model to train better. The generalization gap, i.e. difference between training and validation accuracy, has also increased, however improved optimization results in a net performance gain on validation. In comparison, GN also achieves lower training error than BN but performs worse on validation.

Effect of small number of images per GPU: We study the impact of mini-batch sizes used for normalization (images/GPU) on the performance of various methods in creftypecap 1 and creftypecap 2. All methods are trained with 8 GPUs with five different total batch sizes of 16, 32, 64, 128, 256, divided into equal number of images per GPU leading to 2, 4, 8, 16, and 32 images/GPU. BN is known to degrade in performance when the batch size is small [batchrenorm, evalnorm] as evident in creftypecap 1. GroupNorm (GN) exhibits a more consistent performance underperforming BN only at the largest batch size. Batch renormalization outperforms GN at the largest two batch sizes but shows a degradation in performance for the smaller batch sizes. Our method, FRN, consistently outperforms all the normalization methods at all batch sizes.

Figure 4: Comparison of training and validation curves of various normalization method for Imagenet Classification using ResnetV2-50 model.

Analyzing the effect of FRN and TLU: In creftypecap 3 we perform a detailed ablation study of the effect of FRN and TLU. We combine them with various normalization methods – BatchNorm (BN), GroupNorm (GN), LayerNorm (LN) and InstanceNorm (IN), and train models for each combination for two high performing, but different, model architectures – ResnetV2-50 and InceptionV3. We either replace ReLU activation with TLU, or modify the normalization technique to suppress mean centering and dividing by uncentered second moments instead of variance (creftypecap 1 instead of creftypecap 8). The corresponding normalization are named with a FRN suffix in creftypecap 3 – for example, GN becomes GFRN, LN becomes LFRN etc. For BN, we just replaced the activation function without changing the normalizing technique, and we observe no significant difference in performance. We note, however, that IN benefits from use of FRN (IN+ReLU vs. FRN+ReLU) resulting in 3.61 [email protected] gain for ResnetV2-50. Adding TLU leads to another 1.97 points gain (FRN + TLU). Similar improvements are observed for InceptionV3. In fact, similar improvement trends can be seen for GN and LN as well. This experimental result suggests that both FRN and TLU are critical for the high performance of our method and provide complementary gains.

Table 3: Ablation of our method on Imagenet Classification for ResnetV2-50 [resnetsv2] and InceptionV3 [szegedy2016rethinking]. We evaluate various combinations of our method with existing normalizations. Combinations that include one of our proposals are marked as . Our method, FRN + TLU, at the bottom is marked as [Ours].
Method ResnetV2-50 InceptionV3
[email protected] [email protected] [email protected] [email protected]
BN + ReLU 76.21 92.98 78.24 94.07
BN + TLU 76.03 92.94 78.22 94.13
GN + ReLU 75.67 92.70 78.14 93.98
GN + TLU 76.59 93.16 78.50 94.18
GFRN + ReLU 75.93 92.65 78.16 94.03
GFRN + TLU 76.44 92.80 78.18 94.05
LN + RELU 72.75 91.19 76.75 93.37
LN + TLU 73.99 91.60 77.21 93.48
LFRN + RELU 75.03 92.50 77.62 93.65
LFRN + TLU 76.17 92.89 78.12 94.02
IN + ReLU 71.63 90.46 73.93 91.60
IN + TLU 71.72 90.53 74.81 92.01
FRN + ReLU 75.24 92.65 77.98 94.02
FRN + TLU [Ours] 77.21 93.57 78.95 94.49

Models with Fully Connected (FC) layers: FC layers are a pathological case for normalization methods, especially for per sample methods (GN, LN, IN, FRN), since the number of activations to be normalized over is relatively small. As a result, normalization layers are typically not applied after FC layers. In this section we evaluate the effect of applying normalization after all the layers irrespective of whether they are FC or convolutional layers. Note that FC layers are the most challenging scenario for FRN since we are normalizing over a single activation (N=1). We report results for two architectures where the output of FC layers is normalized: 1) InceptionV3 in creftypecap 1 and 2) VGG-A in creftypecap 4. Note that while ResnetV2-50 also has a FC layer after the global pooling to produce logits, normalization is performed before pooling and is thus not relevant here. InceptionV3 has fully connected layers in an auxiliary logits branch while VGG-A has them in the main network. FRN outperforms all other normalization methods even in this challenging scenario on both the architectures.

While training InceptionV3 and VGG-A, it was crucial to use learning rate rampup (refer creftypecap 4.1) and learned ϵ (refer creftypecap 3.3) for FRN to achieve peak performance. FRN underperformed other methods on InceptionV3 and failed to learn entirely on VGG-A without rampup. Other methods were not significantly affected. We discovered that without the rampup phase, the output of max pooling layers grew to very large magnitudes in first few steps. This saturates the normalized activations (see creftypecap 3) and prevents learning due to poor flow of gradients.

Interestingly, for VGG-A, BN performs worse than ‘No normalization’ at the default learning rate of 0.01. In creftypecap 4 we also report results for models trained with a higher learning rate of 0.1. A rampup phase was useful for all the models at this learning rate. However, the ‘No normalization’ model eventually diverges, while BN shows instability in training (even with rampup) and performs significantly worse than other methods. In contrast, both FRN and GN benefit from training at higher learning rate and yield improved performance with FRN outperforming GN.

Table 4: Model with fully connected layer. We provide a comparison on Imagenet Classification for the VGG-A model that uses two fully connected layers. Top half shows the results training with an initial learning rate of 0.01 (the default rate). Bottom half shows the results for training with a higher learning rate of 0.1. The base model diverges at this rate, while the model with Batchnorm exhibits instability. FRN and Groupnorm train well, with FRN outperforming all others.
Method Learning rate [email protected] [email protected]
No normalization 0.01 69.04 88.99
Batchnorm 0.01 67.82 88.11
Groupnorm 0.01 69.35 89.12
FRN 0.01 70.04 89.42
No normalization 0.1 Diverged Diverged
Batchnorm 0.1 62.61 84.56
Groupnorm 0.1 69.94 89.57
FRN 0.1 71.66 90.69

Comparison of TLU with related variants: tried a version called Affine-TLU, which is a combination of PReLU and TLU. In creftypecap 5 we compare TLU with three related variants for ResnetV2-50 on ImageNet. All four correspond to different combinations of having a scale κ and bias τ to compute the threshold. First observe that TLU, despite having a less general form, outperforms others. Second, all variants with a learnable threshold outperform BN, which doesn’t benefit from it. We conclude that a learnable threshold is necessary for high performance in conjunction with FRN however it doesn’t need to be input dependent. Interestingly, while two of the variants correspond to commonly known activations – ReLU and Parametric ReLU (PReLU) [he2015delving], the third more general form, termed Affine-TLU, outperforms the previous two and has not been explored to the best of our knowledge. Note that Affine-TLU is different from Maxout [goodfellow2013maxout], which computes maximum across groups of channels and, unlike Affine-TLU, results in reduced number of channels.

Table 5: Comparison of activations in conjunction with FRN on Imagenet Classification for ResnetV2-50. We observe that learnable threshold is key to high performance of our method in comparison to BN, which doesn’t benefit from it.
Method [email protected] [email protected]
BN + max(x,0) (ReLU) 76.21 92.98
BN + max(x,τ) (TLU) 76.03 92.94
FRN + max(x,0) (ReLU) 75.24 92.65
max(x,κx) (PReLU) [he2015delving] 76.43 93.30
max(x,κx+τ) (Affine-TLU) 76.71 93.32
max(x,τ) (TLU) 77.21 93.57

4.3 Object Detection on COCO

Next, we evaluate our method on the task of Object Detection (OD) and demonstrate that it consistently outperforms other normalization methods at all the batch sizes we evaluated on. Since OD frameworks are typically trained with high resolution inputs, they are limited to using small mini-batch sizes. This constraint makes OD an ideal evaluation benchmark for sample based normalization methods that enable training with small batch sizes.

Experimental setup. We perform experiments on the COCO dataset [coco] with 80 object classes. We train using the train2017 set, and evaluate on the 5k images in val2017 (minival) split. We report the standard COCO evaluation metrics of mean average precision with different IoU thresholds, namely AP, AP50, AP75 coco.

Model: We use the RetinaNet [lin2017focal] object detection framework. RetinaNet is a unified single stage detector that comprises of three conceptual components: 1) A backbone network, with an off-the-shelf architecture, that acts as a convolutional feature extractor for a given high resolution input image, 2) a convolutional object classification sub-network that acts on the features extracted by the backbone network and, 3) a convolutional bounding box regression sub-network. We use a ResnetV1-101 Feature Pyramid Network backbone [lin2017feature] and resize the input images to 1024×1024.

Table 6: Object detection results on COCO. Our method, FRN, outperforms other methods for all batch sizes. Note that while BN shows a dramatic drop in performance for smaller batch sizes, FRN exhibits a comparatively smaller degradation and consistently outperforms GN that also exhibits similarly stable performance. Note that BN* models were trained by fine-tuning a imagenet pre-trained model, while others are trained from scratch.
Method AP AP50 AP75
imgs/gpu 8 4 2 8 4 2 8 4 2
BN* 38.3 37.1 32.9 57.2 55.4 49.1 41.5 40.4 35.9
BN 38.7 37.9 30.2 56.6 55.2 44.5 42.1 41.4 32.5
GN 39.3 39.0 38.7 57.8 57.5 56.9 42.6 42.3 41.8
FRN 39.6 39.5 39.1 58.5 58.4 57.5 43.1 43.3 42.3

Training: To simplify experimentation and evaluation, we only compare all methods on models trained from scratch. We justify this choice based on conclusions from [he2019rethinking] that, by training longer, model trained from scratch can catch up with models trained by fine-tuning pre-trained models. To ensure this, we start with a baseline fine-tuned model, trained by us at the largest batch size 64, that achieves an AP of 38.3 in 25K training steps (BN*, creftypecap 6) and is close to the corresponding result of 39.1 reported in [lin2017focal]. Next, we empirically find the nearest multiple of 25K that achieves similar accuracy when training from scratch to be 125K steps (BN, creftypecap 6). We set 125K as the base number of training steps for the largest batch size. We train our models using 8 GPUs and experiment with batch sizes in {64, 32, 16} leading to {8, 4, 2} images per GPU respectively. For smaller batch size M we set the training steps 125000×64/M and learning rate as base_lr×M/64. We report best performance using base_lr{0.01,0.05,0.1}. All models are trained using a momentum of 0.9 and weight decay of 4×10-4.

Comparison of normalization methods: In creftypecap 6 we observe that FRN outperforms both BN and GN at all batch sizes, further validating our results in the previous section. In agreement with the observations from creftypecap 2 both FRN and GN achieve higher accuracy than BN at the evaluated batch sizes. FRN outperforms BN by a significant difference of 0.9 AP points at the largest batch size, and this gap widens to 8.9 AP points at the smallest batch size. Further, FRN consistently achieves higher accuracy than GN.

Effect of batch size: BN exhibits a dramatic degradation in performance, dropping by 8.5 AP points for the model trained from scratch, as the number of images per GPU is reduced to 2. In comparison, both FRN and GN show a relatively more stable accuracy and degrade by less than 0.6 AP points. Interestingly, the finetuned BN* model for the smallest batch size performs 2.7 AP points better than the corresponding BN model trained from scratch, indicating that longer training at this batch size is detrimental to the performance of batchnorm. In contrast, FRN maintains a consistent lead for all the metrics across all batch sizes.

5 Conclusion

In this paper we proposed the FRN layer, a novel combination of Filter Response Normalization (FRN) and a Thresholded activation (TLU) function that eliminates the need for batch dependent training. It outperforms BN in a variety of settings and exhibits a consistently high performance in large as well as small batch training. Further, FRN also outperforms Group Normalization, a leading sample based normalization alternative to BN, in all the explored settings. We also demonstrated the success of FRN even in the pathological case of fully connected layers which are typically not normalized. However, since different normalization methods have been successful in different problem domains, e.g. Layer Normalization has been successful in NLP, we leave exploration of these areas with FRN as future work.

Acknowledgement. We would like to thank Vivek Rathod for help with object detection experiments.

References