Abstract
Batch Normalization (BN) is a highly successful and widely used batchdependent training method. Its use of minibatch statistics to normalize theactivations introduces dependence between samples, which can hurt the trainingif the minibatch size is too small, or if the samples are correlated. Severalalternatives, such as Batch Renormalization and Group Normalization (GN), havebeen proposed to address these issues. However, they either do not match theperformance of BN for large batches, or still exhibit degradation inperformance for smaller batches, or introduce artificial constraints on themodel architecture. In this paper we propose the Filter Response Normalization(FRN) layer, a novel combination of a normalization and an activation function,that can be used as a dropin replacement for other normalizations andactivations. Our method operates on each activation map of each batch sampleindependently, eliminating the dependency on other batch samples or channels ofthe same sample. Our method outperforms BN and all alternatives in a variety ofsettings for all batch sizes. FRN layer performs $\approx 0.71.0\%$ better ontop1 validation accuracy than BN with large minibatch sizes on Imagenetclassification on InceptionV3 and ResnetV250 architectures. Further, itperforms $>1\%$ better than GN on the same problem in the small minibatch sizeregime. For object detection problem on COCO dataset, FRN layer outperforms allother methods by at least $0.30.5\%$ in all batch size regimes.
Quick Read (beta)
Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks
Abstract
Batch Normalization (BN) is a highly successful and widely used batch dependent training method. Its use of minibatch statistics to normalize the activations introduces dependence between samples, which can hurt the training if the minibatch size is too small, or if the samples are correlated. Several alternatives, such as Batch Renormalization and Group Normalization (GN), have been proposed to address these issues. However, they either do not match the performance of BN for large batches, or still exhibit degradation in performance for smaller batches, or introduce artificial constraints on the model architecture. In this paper we propose the Filter Response Normalization (FRN) layer, a novel combination of a normalization and an activation function, that can be used as a dropin replacement for other normalizations and activations. Our method operates on each activation map of each batch sample independently, eliminating the dependency on other batch samples or channels of the same sample. Our method outperforms BN and all alternatives in a variety of settings for all batch sizes. FRN layer performs $\approx \mathbf{\text{0.71.0}}\%$ better on top1 validation accuracy than BN with large minibatch sizes on Imagenet classification on InceptionV3 and ResnetV250 architectures. Further, it performs $>\text{\U0001d7cf}\%$ better than GN on the same problem in the small minibatch size regime. For object detection problem on COCO dataset, FRN layer outperforms all other methods by at least $\mathbf{\text{0.30.5}}\%$ in all batch size regimes.
backgrounds,calc,chains,fit,matrix,positioning,shadows,shapes.misc,circuits.ee \tikzsetinput/.style= \tikzsetoutput/.style= \tikzsetoperator/.style=circle, draw, fill=white, minimum size=2.5ex, inner sep=0pt \tikzsetfilter/.style=rectangle, draw, fill=white, minimum size=3.5ex, inner xsep=1.5ex \tikzsetother/.style=rounded rectangle, draw, fill=white, minimum size=3.5ex, inner xsep=1ex \tikzsetbranch/.style=circle, draw, fill=black, minimum size=.5ex, inner sep=0pt \tikzsetrv/.style=circle, draw, thick, fill=white, minimum size=2.75ex, inner sep=0pt \tikzsetob/.style=circle, draw, thick, fill=lightgray, minimum size=2.75ex, inner sep=0pt \tikzsetpa/.style=circle, draw, thick, fill=black, minimum size=1ex, inner sep=0pt \tikzset/tikz/thin/.style=line width=.9pt \tikzset/tikz/thick/.style=line width=1.4pt \tikzsetevery path/.style=thin \tikzset¿=direction ee \pgfplotssetcompat=1.14 \pgfplotssetevery axis/.append style=enlargelimits=abs=3pt,grid,axis lines=left \pgfplotssetevery axis plot/.append style=thick,mark size=1.5pt,line join=bevel,mark options=solid \pgfplotssetlabel style=font= \pgfplotssettick label style=font= \pgfplotssetgrid style=color=black!5 \pgfplotssetlegend style=draw=none,opacity=.85,font=,cells=anchor=west,opacity=1 \pgfplotssetevery non boxed x axis/.style=xtick align=center,shorten ¡=.5\pgflinewidth \pgfplotssetevery non boxed y axis/.style=ytick align=center,shorten ¡=.5\pgflinewidth \pgfplotssetevery non boxed z axis/.style=ztick align=center,shorten ¡=.5\pgflinewidth \pgfplotsset/pgf/number format/1000 sep= \newcolumntypeL[1]¿\arraybackslashm#1 \newcolumntypeC[1]¿\arraybackslashm#1 \newcolumntypeR[1]¿\arraybackslashm#1
1 Introduction
Batch normalization (BN) [batchnorm] is a cornerstone of current high performing deep neural network models and has been instrumental in the recent success and wide application of deep learning. One often discussed drawback of BN is its reliance on sufficiently large batch sizes [batchrenorm, groupnorm, evalnorm]. When trained with small batch sizes, as is common in many applications like object detection, BN exhibits a significant degradation in performance. The source of this issue has been attributed to training and testing discrepancy arising from BN’s reliance on stochastic minibatches [evalnorm]. As a result, several approaches have been proposed that aim to ameliorate the issues due to stochasticity [batchrenorm, evalnorm] or offer alternatives [groupnorm, layernorm] by removing batch dependence. However, these approaches don’t match the performance of BN for large batch sizes (creftypecap 1). Further, either they still exhibit a degradation in performance for smaller batch sizes e.g. Batch Renormalization, or introduce constraints on the model architecture and size e.g. Group Normalization requires number of channels in a layer to be multiples of an ideal group size, such as 32. In this work we propose Filter Response Normalization (FRN) layer, consisting of a normalization and activation function, that eliminates these shortcomings altogether. Our method does not have any batch dependence, as it operates on each activation channel (filter response) of each batch sample independently, and outperforms BN and alternatives in a wide variety of evaluation settings. For example, in creftypecap 1, FRN layer outperforms other approaches by more than $1\%$ at all batch sizes for ResNetV250 on ImageNet classification.
The reliance of BN on large batch sizes is prohibitive in a variety of ways. As pointed out by groupnorm, this hinders the exploration of higher capacity models due to significantly higher memory requirements resulting from use of larger batch sizes. This imposes limitations on the performance of tasks that need to process larger inputs. For example, object detection and segmentation perform better with higher resolution inputs; similarly, video data inherently tends to be significantly higher dimensional. As a result, these systems a forced to tradeoff model capacity with ability to train with larger batch sizes. As evidenced in creftypecap 1 and experiment section, our method maintains a consistent performance across a range of batch sizes making it a promising replacement for BN on these tasks.
FRN layer consists of two novel components that work together to yield high performance of our method: 1) A normalization method, referred to as Filter Response Normalization (FRN), that independently normalizes the responses of each filter for each batch sample by dividing them by the square root of their uncentered second moment, without performing any mean subtraction and 2) a pointwise activation, termed Thresholded Linear Unit (TLU), that is parameterized by a learned rectification threshold allowing for activations that are biased away from zero. FRN layer outperforms BN by more than $\mathbf{\text{0.71.0}}\%$ with large minibatch sizes on Imagenet classification on InceptionV3 and ResnetV250 architectures. Further, it performs $>\text{\U0001d7cf}\%$ better than Group Normalization on the same problem in the small minibatch size regime. For object detection problem on COCO dataset, FRN layer outperforms all other methods by at least $\mathbf{\text{0.30.5}}\%$ in all batch size regimes. Lastly, FRN layer maintains a consistent performance across all the batch sizes that we tested. The proposed FRN layer does not rely on other batch elements or channels for normalization, yet outperforms BN and other alternatives for all batch sizes and in a variety of settings.
Contributions: The main contributions in this paper are the following:

•
Filter Response Normalization (FRN), a normalization method that enables models trained with perchannel normalization to achieve high accuracy.

•
The Thresholded Linear Unit (TLU), an activation function to use with FRN resulting in a further improvement in accuracy outperforming BN even at large batch sizes without any batch dependency. We refer to this combination as FRN layer.

•
Several insights and practical considerations that lead to the success of the combination of FRN and TLU.

•
A detailed experimental study comparing popular normalization methods on large image classification and object detection tasks on a variety of real world architectures.
2 Related work
Normalization of training data has been known to aid in optimization. For example, whitening of inputs is a common practice for training shallow models such as Support Vector Machines and Logistic regression. Similarly, for training deep networks, normalization of inputs and intermediate representations has been recommended for efficient learning [lecun2012efficient, lecun1998efficient, glorot2010understanding]. Batch Normalization (BN) [batchnorm] aims to accelerate learning by stabilizing the intermediate feature distributions. BN normalizes each activation channel independently by using the mean and variance statistics computed for that channel over the entire minibatch. This has been shown to accelerate learning and enable training of very deep neural network architectures. However, BN exhibits a dramatic degradation in performance when trained with smaller minibatches [groupnorm, evalnorm]. Several approaches have been proposed to address this shortcoming, and can be grouped into two major categories: 1) Methods that reduce the traintest discrepancy in batch normalized models, 2) Sample based normalization methods that avoid batch normalization.
Methods reducing traintest discrepancy in batch normalization. batchrenorm notes that the discrepancy between the statistics that are used for normalization during training and testing may arise from the stochasticity due to small minibatches and bias due to noniid samples. They propose Batch Renormalization (BR) to reduce this discrepancy by constraining the minibatch moments to a specific range, limiting the variation in minibatch statistics during training. A key benefit of this approach is that the test time evaluation scheme of a model trained with Batch Renormalization is exactly the same as that for model trained with BN. On the other hand, Evalnorm [evalnorm] does not modify the training scheme. Instead, it proposes a correction to the normalization statistics to be used during evaluation. The major advantage of this method is that the model does not need to be retrained. However, both these methods still exhibit a degradation in performance for small minibatches. Another approach is to engineer systems that can circumvent the issue by distributing larger batches across GPUs for tasks that require large inputs [peng2017megdet]. However, this approach requires considerable GPU infrastructure.
Methods avoiding normalization using minibatches. Several approaches sidestep the issues encountered by BN by not relying on the stochastic minibatch altogether [layernorm, groupnorm, instancenorm]. Instead, the normalization statistics are computed from the sample itself. Layer Normalization (LN) [layernorm] computes the normalization statistics from the entire layer i.e. using all the activation channels. In contrast, like BN, Instance Normalization (IN) [instancenorm] computes the normalization statistics for each channel independently, but only from the sample being normalized, as opposed to the entire batch, as BN does. IN was shown to be useful for style transfer applications, but was not successfully applied for recognition. Group Normalization (GN) [groupnorm] fills the middle ground between the two. It computes the normalization statistics over groups of channels. The ideal group size is experimentally determined. While, GN doesn’t show performance degradation for smaller batch sizes, it performs worse than BN for larger minibatches (See creftypecap 1 here and Figure 1 in [groupnorm]). In addition, the size of groups required by GN imposes a constraint on the network size and architecture as every normalized layer needs to have number of channels that are multiple of the ideal group size determined by GN.
Other approaches. Weight Normalization [salimans2016weight] proposes a reparameterization of the filters in terms of a direction and a scale and reports accelerated convergence. Normalization Propagation [arpit2016normalization] uses idealized moment estimates to normalize every layer. Refer to ren2016normalizing for a unifying view of various normalization approaches.
3 Approach
Our goal is to eliminate the batch dependency in the training of deep neural networks without sacrificing the performance gains of BN at large batch sizes. We start this section with the main details of our proposal. We will follow that with a discussion of the rationale behind our proposal.
3.1 Filter Response Normalization with Thresholded Activation
We will assume for the purpose of exposition that we are dealing with the feedforward convolutional neural network. We follow the usual convention that the filter responses (activation maps) produced after a convolution operation are a 4D tensor $\bm{X}$ with shape $[B,W,H,C]$, where $B$ is the minibatch size, $W,H$ are the spatial extents of the map, and $C$ is the number of filters used in convolution. $C$ is also referred to as output channels. Let $\bm{x}={\bm{X}}_{b,:,:,c}\in {\mathbb{R}}^{N}$, where $N=W\times H$, be the vector of filter responses for the ${c}^{th}$ filter for the ${b}^{th}$ batch point. Let ${\nu}^{2}={\sum}_{i}{x}_{i}^{2}/N$, be the mean squared norm of $\bm{x}$. Then we propose Filter Response Normalization (FRN) as following:
$\widehat{\bm{x}}={\displaystyle \frac{\bm{x}}{\sqrt{{\nu}^{2}+\u03f5}}},$  (1) 
where $\u03f5$ is a small positive constant to prevent division by zero errors. A few observations are in order about the normalization scheme we propose:

•
Similar to other normalization schemes, Filter Response Normalization removes the scaling effect of both the filter weights and preactivations. This has been known [salimans2016weight] to remove noisy updates along the direction of the weights and reduce gradient covariance.

•
One of the main differences in our proposal is that we do not remove the mean prior to normalization. While mean subtraction was an important aspect of Batch Normalization, it is arbitrary and without real justification for normalization schemes that are batch independent.

•
Our normalization is done on a perchannel basis. This ensures that all filters (or rows of a weight matrix) have the same relative importance in the final model.

•
At first glance, Filter Response Normalization would appear very similar to Local Response Normalization (LRN) proposed in Alexnet2012. However, among other differences, LRN does normalization over adjacent channels at the same spatial location, while ours is a global normalization over the spatial extent.
As with other schemes, we also perform an affine transform after normalization so that the network can undo the effects of the normalization:
$\bm{y}=\gamma \widehat{\bm{x}}+\beta ,$  (2) 
where $\gamma $ and $\beta $ are learned parameters. The final addition to our FRN layer is the activation function.
3.1.1 Thresholded Linear Unit (TLU)
Lack of mean centering in Filter Response Normalization can lead to activations having an arbitrary bias away from zero. Such a bias in conjunction with ReLU can have detrimental effect on learning and lead to poor performance and dead units. We propose to address this issue by augmenting ReLU with a learned threshold $\tau $ to yield TLU defined as:
$\bm{z}=\mathrm{max}(\bm{y},\tau )$  (3) 
Since $\mathrm{max}(\bm{y},\tau )=\mathrm{max}(\bm{y}\tau ,0)+\tau =ReLU(\bm{y}\tau )+\tau $, the effect of TLU activation is the same as having a shared bias before and after ReLU. However, this does not appear to be identical to absorbing the biases in the previous and subsequent layers based on our experiments. We hypothesize that the form of TLU is more favorable for optimization. TLU significantly improves the performance of models using FRN (see creftypecap 5), outperforming BN and other alternatives, and leads to our method, FRN layer. Figure 2 shows the schematic for our proposed FRN layer.
3.2 Gradients of FRN Layer
In this section, we briefly derive expressions for the gradients that flow through the network in the presence of the FRN layer. Since all the transformations are performed channelwise, we only derive the perchannel gradients below. Let us assume that somewhere in the network, the activations $\bm{x}$ are fed to the FRN layer and the output is $\bm{z}$ (following the transformations described in equations (1), (2), and (3)). Let $f(\bm{z})$ be the mapping that the network applies to $\bm{z}$, with gradients $\frac{\partial f}{\partial \bm{z}}$ flowing backwards. Note that the parameters $\gamma $, $\beta $ and $\tau $ are vectors of size num_channels, and so the per channel updates are scalar.
$\frac{\partial {z}_{i}}{\partial \tau}}=\{\begin{array}{cc}0,\hfill & \text{if}{y}_{i}\ge \tau \hfill \\ 1,\hfill & \text{otherwise}\hfill \end{array$  (4) 
Note that the gradients $\frac{\partial {z}_{i}}{\partial {y}_{i}}$ are just the same as above, but with the cases reversed. Then the gradient update to $\tau $ is of the form
$\frac{\partial f}{\partial \tau}}={\displaystyle \sum _{b=1}^{B}}{\left({\displaystyle \frac{\partial f}{\partial {\bm{z}}_{b}}}\right)}^{T}{\displaystyle \frac{\partial {\bm{z}}_{b}}{\partial \tau}},$  (5) 
where ${\bm{z}}_{b}$ is the vector of perchannel activations of the ${b}^{th}$ batch point. Gradients w.r.t $\gamma $ and $\beta $ are as follows:
$({\displaystyle \frac{\partial f}{\partial \gamma}},{\displaystyle \frac{\partial f}{\partial \beta}})$  $=({\displaystyle \sum _{b=1}^{B}}{\displaystyle \frac{\partial {f}^{T}}{\partial {\bm{y}}_{b}}}\widehat{{\bm{x}}_{b}},{\displaystyle \sum _{b=1}^{B}}{\displaystyle \frac{\partial f}{\partial {\bm{y}}_{b}}})$  (6) 
Using eqn. (2), we can see that $\frac{\partial f}{\partial \widehat{\bm{x}}}=\gamma \frac{\partial f}{\partial \bm{y}}$. Finally, the gradients that flow back from the FRN layer can be written as
$\frac{\partial f}{\partial \bm{x}}$  $={\displaystyle \frac{1}{\sqrt{{\nu}^{2}+\u03f5}}}\left(I\widehat{\bm{x}}{\widehat{\bm{x}}}^{T}\right){\displaystyle \frac{\partial f}{\partial \widehat{\bm{x}}}}$  (7) 
We make a couple of observations about the gradients. Eqn. (5) suggests that part of the gradients that get suppressed in a regular ReLU activation are now used to update $\tau $, and in some sense are not wasted. Eqn. (7) shows that the gradients w.r.t to $\bm{x}$ are orthogonal to $\bm{x}$ (provided $\u03f5$ = 0) because $(I\widehat{\bm{x}}{\widehat{\bm{x}}}^{T})$ projects out the component in the direction of $\widehat{\bm{x}}$. This property is not unique to our normalization, but is known to help in reducing variance of gradients during SGD and benefit optimization [salimans2016weight].
3.3 Parameterizing $\u03f5$
In our discussion so far, we have assumed that the filter responses have a large spatial extent of size $N=W\times H$. However, there are situations in real networks like InceptionV3 [inceptionv3] and VGGA [vggnet], where some layers produce $1\times 1$ activation maps. In this setting ($N=1$), for small value of $\u03f5$, the proposed normalization as in creftypecap 1 turns into a sign function (see creftypecap 3), and has very small gradients almost everywhere. This will invariable affect the learning adversely. In contrast, higher values of $\u03f5$ lead to variants of smoother soft sign function that are more amenable to learning. Appropriate value of $\u03f5$ becomes crucial for models that are fully connected or lead to $1\times 1$ activation maps. Empirically, we turn $\u03f5$ into a learnable parameter (initialized at ${10}^{4}$) for such models. For other models, we use a fixed constant value of ${10}^{6}$. In our experiments, we show that the learnable parameterization is useful for training InceptionV3 model where the Auxiliary logits head produces $1\times 1$ activation maps, and for the VGGA [vggnet] architecture that uses fully connected layers.
Since $\u03f5>0$, we explored two alternative parameterizations to enforce this constraint: absolute value and exponential. While both trained well, the absolute value parameterization $\u03f5={10}^{6}+{\u03f5}_{l}$ (${\u03f5}_{l}$ being a learned parameter), produced consistently better empirical results. Parameterizations of this form are also preferable because the gradient magnitudes for ${\u03f5}_{l}$ are independent of the value of $\u03f5$.
3.4 Mean Centering
Batch Normalization was proposed to counter the effects of internal covariate shift during training of a deep neural network. The solution was to keep the statistics of distribution of activations over the data set invariant; and as a practical matter, they choose to normalize the first and second moments of minibatch at each step. Batch independent alternatives that include mean centering are not justified by any particular consideration, and seem merely as a legacy of BatchNorm.
Consider the example of Instance Normalization (IN). Using the same notation as creftypecap 3.1, IN computes the normalized activations using the channel statistics $\mu ={\sum}_{i}{x}_{i}/N$ and ${\sigma}^{2}={\sum}_{i}{({x}_{i}\mu )}^{2}/N$ as following:
$\widehat{\bm{x}}={\displaystyle \frac{\bm{x}\mu}{\sqrt{{\sigma}^{2}+\u03f5}}}$  (8) 
As the size of the activation map decreases (as is common in the layers closer to the output which are subject to downsampling, or due to the presence of fully connected layers), IN produces zero activations. Layer and Group Normalization are ways to circumvent this issue by normalizing across (all or subset of) channels. Since individual filters are responsible for specific channel activations, normalizing across channels causes complicated interaction in the filter updates. Hence, it appears that the only principled approach is to normalize each channel of the activation map separately without resorting to mean centering. This also has a desirable effect of removing the relative scaling between filters, which has been known to greatly aid in optimization.
A negative impact of not performing mean centering is that activations can have bias arbitrarily away from zero, rendering ReLU activation less than ideal. We mitigate this issue by introducing the Thresholded Linear Unit (TLU) in creftypecap 3.1.1. Empirically, the combination of uncentered normalization with the TLU activation outperforms BN and all other alternatives.
3.5 Implementation
FRN is easy to implement in automatic differentiation frameworks. We provide an example implementation using python API for Tensorflow in LABEL:lst:implementation.
4 Experiments
We evaluate our method extensively on two tasks: 1) Image classification on Imagenet, and 2) Object detection on COCO. While Image classification is the defacto standard for evaluation, Object detection typically requires high resolution inputs and is particularly constrained by the large batch size requirements of BN. On Imagenet classification we show that our method outperforms other normalization methods on three different network architecture. Further, our method does this consistently at all batch sizes we experimented with. Finally, we validate the performance of our method on Object Detection where it outperforms other normalization methods on all batch sizes as well.
4.1 Learning Rate Schedule
Since FRN does not do mean centering, we empirically found that certain architectures are more sensitive to the choice of initial learning rate. Setting a high initial rate causes large updates that lead to large activations in the early part of the training and result in a slowdown in the learning. This is due to the $\frac{1}{\sqrt{{\nu}^{2}+\u03f5}}$ factor in the gradient of $\frac{\partial f}{\partial \bm{x}}$ (see creftypecap 7). This happens more often in architectures that employ several max pooling layers like VGGA. We address this by using a rampup in the learning rate that slowly increases the learning rate from 0 to the peak value during an initial warmup phase. Since all our experiments use cosine learning rate decay schedule, we use a cosine rampup schedule as well. Ramping up the learning rate in a warmup phase is quite common and frequently used in training [resnets, resnetsv2, Imagenet2017].
4.2 ImageNet Classification
Dataset: We evaluate our method on the ImageNet classification dataset [imagenet] consisting of 1000 classes. We train on the $\sim 1.28$M training images and report results on the 50k validation images. For all models in this section, we resize the images to $299\times 299$ and use data augmentation from [szegedy2017inception] at training time.
Model architectures: We provide comparisons on three different model architectures: 1) ResnetV250 [resnetsv2]: Popular model with identity shortcuts, 2) InceptionV3 [szegedy2016rethinking]: High performing model without identity shortcuts and fully connected layers and, 3) VGGA [vggnet]: Feed forward model with a mix of convolutional and fully connected layers. For all models using GN we use a group size of 32. However, since VGGA does not use a multiple of 32 filters in all layers, we increase the number of filters to nearest multiple.
Training: We follow the training setup used by resnets. All models are trained using synchronous SGD across 8 GPUs for 300K steps. Gradients are computed by averaging across all GPUs. For BatchNorm, the normalization statistics are computed per GPU. This setup is common for multiGPU training using synchronous SGD in Tensorflow and PyTorch. An initial learning rate of $0.1\times \mathtt{\text{batch\_size}}/256$ and cosine decay schedule is used. We follow [resnets, resnetsv2] for other implementation details. Results are reported using two image classification metrics: 1) ‘[email protected]’ measures the accuracy using the highest scoring class (top1 prediction) while, 2) ‘[email protected]’ measures the accuracy using top5 scoring classes.
Method  ResnetV2 50  InceptionV3  
[email protected]  [email protected]  [email protected]  [email protected]  
Batchnorm  76.21  92.98  78.24  94.07 
BatchRenorm  75.85  92.90  78.19  94.01 
Groupnorm  75.67  92.70  78.14  93.98 
Layernorm  72.75  91.19  76.75  93.37 
Instancenorm  71.63  90.46  73.93  91.60 
FRN layer [Ours]  77.21  93.57  78.95  94.49 
Images per GPU $\to $  32  16  8  4  2  
Batchnorm  76.21  75.55  74.04  71.96  65.09  
Renorm  75.85  75.96  75.59  74.18  70.75  
Groupnorm  75.67  75.77  76.14  76.02  76.20  
FRN layer [Ours]  77.21  77.10  77.16  77.18  77.33  
$\mathrm{\Delta}$  +1.54  +1.33  +1.02  +1.16  +1.13  
Batchnorm  92.98  92.81  92.12  90.98  86.51  
Renorm  92.90  92.98  92.80  92.10  89.81  
Groupnorm  92.70  92.72  92.89  92.87  92.92  
FRN layer [Ours]  93.62  93.59  93.60  93.49  93.61  
$\mathrm{\Delta}$  +0.92  +0.87  +0.71  +0.62  +0.69 
Comparison with normalization methods: In creftypecap 1 we compare our method with various normalization methods for the regular batch size of 32 images/GPU. This results in an effective batch size of $32\times 8=256$ and is the most favorable configuration for BN. This is the strongest baseline for image classification and all the alternatives to BN have struggled in this setting, underperforming BN. Even for this large batch size, FRN outperforms all the methods including BN with a healthy margin on both the architectures indicating that batch dependent training is not necessary for high performance. At this large batch size, the next best performing normalization schemes are BN and BatchRenorm, both of which are batch normalized methods, followed by other sample based normalization methods. creftypecap 4 compares the training and validation ’[email protected]’ curves for various normalization methods using the ResnetV250 architecture. We observe that FRN layer achieves both higher training and validation accuracies than BN indicating that removal of stochastic batch dependence eases optimization allowing model to train better. The generalization gap, i.e. difference between training and validation accuracy, has also increased, however improved optimization results in a net performance gain on validation. In comparison, GN also achieves lower training error than BN but performs worse on validation.
Effect of small number of images per GPU: We study the impact of minibatch sizes used for normalization (images/GPU) on the performance of various methods in creftypecap 1 and creftypecap 2. All methods are trained with 8 GPUs with five different total batch sizes of 16, 32, 64, 128, 256, divided into equal number of images per GPU leading to 2, 4, 8, 16, and 32 images/GPU. BN is known to degrade in performance when the batch size is small [batchrenorm, evalnorm] as evident in creftypecap 1. GroupNorm (GN) exhibits a more consistent performance underperforming BN only at the largest batch size. Batch renormalization outperforms GN at the largest two batch sizes but shows a degradation in performance for the smaller batch sizes. Our method, FRN, consistently outperforms all the normalization methods at all batch sizes.
Analyzing the effect of FRN and TLU: In creftypecap 3 we perform a detailed ablation study of the effect of FRN and TLU. We combine them with various normalization methods – BatchNorm (BN), GroupNorm (GN), LayerNorm (LN) and InstanceNorm (IN), and train models for each combination for two high performing, but different, model architectures – ResnetV250 and InceptionV3. We either replace ReLU activation with TLU, or modify the normalization technique to suppress mean centering and dividing by uncentered second moments instead of variance (creftypecap 1 instead of creftypecap 8). The corresponding normalization are named with a FRN suffix in creftypecap 3 – for example, GN becomes GFRN, LN becomes LFRN etc. For BN, we just replaced the activation function without changing the normalizing technique, and we observe no significant difference in performance. We note, however, that IN benefits from use of FRN (IN+ReLU vs. FRN+ReLU) resulting in 3.61 [email protected] gain for ResnetV250. Adding TLU leads to another 1.97 points gain (FRN + TLU). Similar improvements are observed for InceptionV3. In fact, similar improvement trends can be seen for GN and LN as well. This experimental result suggests that both FRN and TLU are critical for the high performance of our method and provide complementary gains.
Method  ResnetV250  InceptionV3  
[email protected]  [email protected]  [email protected]  [email protected]  
BN + ReLU  76.21  92.98  78.24  94.07 
BN + TLU $\mathbf{\u2020}$  76.03  92.94  78.22  94.13 
GN + ReLU  75.67  92.70  78.14  93.98 
GN + TLU $\mathbf{\u2020}$  76.59  93.16  78.50  94.18 
GFRN + ReLU $\mathbf{\u2020}$  75.93  92.65  78.16  94.03 
GFRN + TLU $\mathbf{\u2020}$  76.44  92.80  78.18  94.05 
LN + RELU  72.75  91.19  76.75  93.37 
LN + TLU $\mathbf{\u2020}$  73.99  91.60  77.21  93.48 
LFRN + RELU $\mathbf{\u2020}$  75.03  92.50  77.62  93.65 
LFRN + TLU$\mathbf{\u2020}$  76.17  92.89  78.12  94.02 
IN + ReLU  71.63  90.46  73.93  91.60 
IN + TLU $\mathbf{\u2020}$  71.72  90.53  74.81  92.01 
FRN + ReLU $\mathbf{\u2020}$  75.24  92.65  77.98  94.02 
FRN + TLU [Ours]  77.21  93.57  78.95  94.49 
Models with Fully Connected (FC) layers: FC layers are a pathological case for normalization methods, especially for per sample methods (GN, LN, IN, FRN), since the number of activations to be normalized over is relatively small. As a result, normalization layers are typically not applied after FC layers. In this section we evaluate the effect of applying normalization after all the layers irrespective of whether they are FC or convolutional layers. Note that FC layers are the most challenging scenario for FRN since we are normalizing over a single activation ($N=1$). We report results for two architectures where the output of FC layers is normalized: 1) InceptionV3 in creftypecap 1 and 2) VGGA in creftypecap 4. Note that while ResnetV250 also has a FC layer after the global pooling to produce logits, normalization is performed before pooling and is thus not relevant here. InceptionV3 has fully connected layers in an auxiliary logits branch while VGGA has them in the main network. FRN outperforms all other normalization methods even in this challenging scenario on both the architectures.
While training InceptionV3 and VGGA, it was crucial to use learning rate rampup (refer creftypecap 4.1) and learned $\u03f5$ (refer creftypecap 3.3) for FRN to achieve peak performance. FRN underperformed other methods on InceptionV3 and failed to learn entirely on VGGA without rampup. Other methods were not significantly affected. We discovered that without the rampup phase, the output of max pooling layers grew to very large magnitudes in first few steps. This saturates the normalized activations (see creftypecap 3) and prevents learning due to poor flow of gradients.
Interestingly, for VGGA, BN performs worse than ‘No normalization’ at the default learning rate of 0.01. In creftypecap 4 we also report results for models trained with a higher learning rate of 0.1. A rampup phase was useful for all the models at this learning rate. However, the ‘No normalization’ model eventually diverges, while BN shows instability in training (even with rampup) and performs significantly worse than other methods. In contrast, both FRN and GN benefit from training at higher learning rate and yield improved performance with FRN outperforming GN.
Method  Learning rate  [email protected]  [email protected] 
No normalization  0.01  69.04  88.99 
Batchnorm  0.01  67.82  88.11 
Groupnorm  0.01  69.35  89.12 
FRN  0.01  70.04  89.42 
No normalization  0.1  Diverged  Diverged 
Batchnorm  0.1  62.61  84.56 
Groupnorm  0.1  69.94  89.57 
FRN  0.1  71.66  90.69 
Comparison of TLU with related variants: tried a version called AffineTLU, which is a combination of PReLU and TLU. In creftypecap 5 we compare TLU with three related variants for ResnetV250 on ImageNet. All four correspond to different combinations of having a scale $\kappa $ and bias $\tau $ to compute the threshold. First observe that TLU, despite having a less general form, outperforms others. Second, all variants with a learnable threshold outperform BN, which doesn’t benefit from it. We conclude that a learnable threshold is necessary for high performance in conjunction with FRN however it doesn’t need to be input dependent. Interestingly, while two of the variants correspond to commonly known activations – ReLU and Parametric ReLU (PReLU) [he2015delving], the third more general form, termed AffineTLU, outperforms the previous two and has not been explored to the best of our knowledge. Note that AffineTLU is different from Maxout [goodfellow2013maxout], which computes maximum across groups of channels and, unlike AffineTLU, results in reduced number of channels.
Method  [email protected]  [email protected] 
BN + $\mathrm{max}(x,0)$ (ReLU)  76.21  92.98 
BN + $\mathrm{max}(x,\tau )$ (TLU)  76.03  92.94 
FRN + $\mathrm{max}(x,0)$ (ReLU)  75.24  92.65 
$\mathrm{max}(x,\kappa x)$ (PReLU) [he2015delving]  76.43  93.30 
$\mathrm{max}(x,\kappa x+\tau )$ (AffineTLU)  76.71  93.32 
$\mathrm{max}(x,\tau )$ (TLU)  77.21  93.57 
4.3 Object Detection on COCO
Next, we evaluate our method on the task of Object Detection (OD) and demonstrate that it consistently outperforms other normalization methods at all the batch sizes we evaluated on. Since OD frameworks are typically trained with high resolution inputs, they are limited to using small minibatch sizes. This constraint makes OD an ideal evaluation benchmark for sample based normalization methods that enable training with small batch sizes.
Experimental setup. We perform experiments on the COCO dataset [coco] with 80 object classes. We train using the train2017 set, and evaluate on the 5k images in val2017 (minival) split. We report the standard COCO evaluation metrics of mean average precision with different IoU thresholds, namely AP, AP${}^{50}$, AP${}^{75}$ coco.
Model: We use the RetinaNet [lin2017focal] object detection framework. RetinaNet is a unified single stage detector that comprises of three conceptual components: 1) A backbone network, with an offtheshelf architecture, that acts as a convolutional feature extractor for a given high resolution input image, 2) a convolutional object classification subnetwork that acts on the features extracted by the backbone network and, 3) a convolutional bounding box regression subnetwork. We use a ResnetV1101 Feature Pyramid Network backbone [lin2017feature] and resize the input images to 1024$\times $1024.
Method  AP  AP${}^{50}$  AP${}^{75}$  
imgs/gpu  8  4  2  8  4  2  8  4  2 
${\text{BN}}^{*}$  38.3  37.1  32.9  57.2  55.4  49.1  41.5  40.4  35.9 
BN  38.7  37.9  30.2  56.6  55.2  44.5  42.1  41.4  32.5 
GN  39.3  39.0  38.7  57.8  57.5  56.9  42.6  42.3  41.8 
FRN  39.6  39.5  39.1  58.5  58.4  57.5  43.1  43.3  42.3 
Training: To simplify experimentation and evaluation, we only compare all methods on models trained from scratch. We justify this choice based on conclusions from [he2019rethinking] that, by training longer, model trained from scratch can catch up with models trained by finetuning pretrained models. To ensure this, we start with a baseline finetuned model, trained by us at the largest batch size 64, that achieves an AP of 38.3 in 25K training steps (${\text{BN}}^{*}$, creftypecap 6) and is close to the corresponding result of 39.1 reported in [lin2017focal]. Next, we empirically find the nearest multiple of 25K that achieves similar accuracy when training from scratch to be 125K steps (BN, creftypecap 6). We set 125K as the base number of training steps for the largest batch size. We train our models using 8 GPUs and experiment with batch sizes in {64, 32, 16} leading to {8, 4, 2} images per GPU respectively. For smaller batch size $M$ we set the training steps $125000\times 64/M$ and learning rate as $\mathtt{\text{base\_lr}}\times M/64$. We report best performance using $\mathtt{\text{base\_lr}}\in \{0.01,0.05,0.1\}$. All models are trained using a momentum of 0.9 and weight decay of $4\times {10}^{4}$.
Comparison of normalization methods: In creftypecap 6 we observe that FRN outperforms both BN and GN at all batch sizes, further validating our results in the previous section. In agreement with the observations from creftypecap 2 both FRN and GN achieve higher accuracy than BN at the evaluated batch sizes. FRN outperforms BN by a significant difference of 0.9 AP points at the largest batch size, and this gap widens to 8.9 AP points at the smallest batch size. Further, FRN consistently achieves higher accuracy than GN.
Effect of batch size: BN exhibits a dramatic degradation in performance, dropping by 8.5 AP points for the model trained from scratch, as the number of images per GPU is reduced to 2. In comparison, both FRN and GN show a relatively more stable accuracy and degrade by less than 0.6 AP points. Interestingly, the finetuned ${\text{BN}}^{*}$ model for the smallest batch size performs 2.7 AP points better than the corresponding BN model trained from scratch, indicating that longer training at this batch size is detrimental to the performance of batchnorm. In contrast, FRN maintains a consistent lead for all the metrics across all batch sizes.
5 Conclusion
In this paper we proposed the FRN layer, a novel combination of Filter Response Normalization (FRN) and a Thresholded activation (TLU) function that eliminates the need for batch dependent training. It outperforms BN in a variety of settings and exhibits a consistently high performance in large as well as small batch training. Further, FRN also outperforms Group Normalization, a leading sample based normalization alternative to BN, in all the explored settings. We also demonstrated the success of FRN even in the pathological case of fully connected layers which are typically not normalized. However, since different normalization methods have been successful in different problem domains, e.g. Layer Normalization has been successful in NLP, we leave exploration of these areas with FRN as future work.
Acknowledgement. We would like to thank Vivek Rathod for help with object detection experiments.