Abstract
Crowd counting aims to count the number of instantaneous people in a crowdedspace, which plays an increasingly important role in the field of publicsafety. More and more researchers have already proposed many promisingsolutions to the crowd counting task on the image. With the continuousextension of the application of crowd counting, how to apply the technique tovideo content has become an urgent problem. At present, although researchershave collected and labeled some video clips, less attention has been drawn tothe spatiotemporal characteristics of videos. In order to solve this problem,this paper proposes a novel framework based on dynamic temporal modeling of therelationship between video frames. We model the relationship between adjacentfeatures by constructing a set of dilated residual blocks for crowd countingtask, with each phase having an expanded set of time convolutions to generatean initial prediction which is then improved by the next prediction. We extractfeatures from the density map as we find the adjacent density maps share moresimilar information than original video frames. We also propose a smaller basicnetwork structure to balance the computational cost with a good featurerepresentation. We conduct experiments using the proposed framework on fivecrowd counting datasets and demonstrate its superiority in terms ofeffectiveness and efficiency over previous approaches.
Quick Read (beta)
Video Crowd Counting via
Dynamic Temporal Modeling
Abstract
Crowd counting aims to count the number of instantaneous people in a crowded space, which plays an increasingly important role in the field of public safety. More and more researchers have already proposed many promising solutions to the crowd counting task on the image. With the continuous extension of the application of crowd counting, how to apply the technique to video content has become an urgent problem. At present, although researchers have collected and labeled some video clips, less attention has been drawn to the spatiotemporal characteristics of videos. In order to solve this problem, this paper proposes a novel framework based on dynamic temporal modeling of the relationship between video frames. We model the relationship between adjacent features by constructing a set of dilated residual blocks for crowd counting task, with each phase having an expanded set of time convolutions to generate an initial prediction which is then improved by the next prediction. We extract features from the density map as we find the adjacent density maps share more similar information than original video frames. We also propose a smaller basic network structure to balance the computational cost with a good feature representation. We conduct experiments using the proposed framework on five crowd counting datasets and demonstrate its superiority in terms of effectiveness and efficiency over previous approaches.
I Introduction
Rapid development of surveillance devices has led to an explosive growth of images and videos, which creates a demand for analyzing visual content. In addition to object recognition, crowd counting, which focuses on estimating the number of people in a still image or a video clip, has received increasing interests in recent years. Many researchers have explored crowd counting task on still images [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], while limited efforts have been focused on videos. Nevertheless, crown counting in videos has many realworld applications, such as video surveillance, traffic monitoring, and emergency management.
Despite the attention that the crowd counting problem has received, it still remains a challenging task. Challenges arise from the nonuniform distribution of people, complex illumination, distortion of camera and occlusion. Although many studies have proposed multiple columns/branches network to learn more contextual information and achieve excellent performance [4, 7, 13], these existing methods may ignore the temporal relations between nearby frame since crowd counting data often collected by surveillance videos. Furthermore, the regressionbased or CNN based frameworks have explored variance model to generate a density map, while the strong correlation between neighboring density maps and video frames is also overlooked.
To cope with these difficulties, we employ a novel framework to take advantage of temporal information extracted by continuously frames and the architecture is illustrated in Fig. 1. We model the relationship between adjacent features by constructing a multistage architecture for time segmentation tasks, with each phase having an expanded set of time convolutions to generate an initial prediction that is then improved by the next prediction. We also introduce density map as another branch of our architecture. The density map reports the distribution of people, which can be regarded as attention map. As we observe, the adjacent video frames may have different visual content due to the background and occlusion, while the adjacent density maps demonstrate more similar content with each other. The contextual information between consequent frames and density maps would benefit the current counting state. Comprehensive experiments on public datasets show the improvement with the help of temporal and contextual information.
The main contributions of this work are summarized as follows.

•
We propose a novel but lightweight architecture combing both spatial and temporal features for crowd counting in videos, by dynamic temporal modeling of the continuously video frames.

•
Our framework also utilize information from density maps to boost accuracy. The neighboring density maps would share more similar features than video frames.

•
Extensive experiments and evaluations on benchmark datasets demonstrate the superior performance of our proposed method. Notably, we achieve stateoftheart results on the video datasets comparing with the existing videobased methods. Further, our network achieves 25 FPS crowd counting speed on a moderate commercial CPU.
The rest of this paper is organized as follows. Section II introduces background of crowd counting in images and videos. Section III discusses the model design, network architecture and training process in detail. In Section IV, we demonstrate the qualitative and quantitative study of the proposed framework. We conclude our work in Section V.
II Related Work
IIA Crowd Counting in Still Images
Over the past few years, researchers have attempted to solve crowd counting in images using a variety of approaches. Early works focused on detection methods to recognize specific body parts or full body using handcrafted features [15, 16]. While detection based methods are difficult to deal with dense crowds because of occlusion. Some studies investigated to learn a mapping function between features to the number of peoples [17]. Furthermore, Lempitsky et al. [18] proposed local features for the density map to exploit spatial information. However, the handcrafted features are not enough facing the clutter and low resolution of images.
Recently, the convolutional neural network has shown great success in computer vision fields. Inspired by the promising performance of the neural network, many researchers have explored CNNbased methods in crowd counting. Zhang et al. [4] proposed a multicolumn CNN with different sizes of filters to deal with the variations of density differences. Similarly, two parallel pathways architecture with the different receptive field was introduced by [13] and achieved good results on benchmark datasets. [5] used pyramid images to extract density signatures on multiple scales. Sam et al. [19] and Sindagi et al. [6] have achieved remarkable results in a multisubnet structure. Li et al. [7] used an expanded kernel to provide larger receive fields and replace pooling operations to further improve the accuracy. To address the problem of limited training data, Kang et al. [20] introduced side information such as camera angle and hight to boost the network performance. Due to limited training data, Liu et al. [9] investigated enhance data such as collect scene datasets from Google using keyword searches and querybyexample image retrieval and then applying a learningtorank method. Shi et al. [8] considered that the adaptation of the previous method to the crowd relying on a single image is still in its infancy. They proposed the DConvNet structure that can be endtoend trained and can be independent of the backbone full convolutional neural network. Sam et al. [10] proposed a framework named TDFCNN with topdown feedback to correct the initial prediction of the CNN that is very limited for detecting the space background of people. These methods are all designed for image crowd counting, thus treating videos as image sequences would ignore the important temporal information in videos.
IIB Crowd Counting in Videos
There are fewer researchers studied on video crowd counting compared with still images. Brostow et al. [21] and Chan et al. [22] proposed to use the Bayesian function to detect individuals using motion information. Rodriguez et al. [23] further proposed optimization of energy function combing crowd density estimation and individual tracking. Chen et al. [24] proposed an errordriven iteration framework aiming to cope with the noisy input videos.
Although these methods based on motion or handcrafted features showed satisfactory performance on the pedestrian or football datasets, they are still lack of the generalization ability when applying them to extremely dense crowds. More recently, Xiong et al. [25] proposed the convLSTM framework to capture both spatial and temporal dependencies. The CNNbased method demonstrated the effectiveness of benchmark crowd counting datasets, such as UCF_CC_50 [26] and UCSD [17]. However, due to the limited training data of videos and various scenes, it is usually difficult to train the complex and deeper networks for effective crowd counting. In this paper, we propose a novel framework considering temporal information and density maps as well. Even though using a lightweight network architecture, our method can achieve promising results on multiple datasets, with the help of the auxiliary information extracted from temporal dependencies and density maps.
III Framework
In this section, we will introduce the dynamic temporal modeling with the convolutional network for the crowd counting task in the video. We describe the basic network in Section IIIA and architecture of dynamic temporal modeling in Section IIIB. The implementation details will be described in Section IIIC.
IIIA Basic Network
The basic network in our framework is the convolutional neural network for crowd counting of a still image or single video frame. As mentioned in the previous section, networks with multiple subnets and single branch are employed. Since we focus on video crowd counting problem in this paper, the inference speed is an important issue and our goal is to use a small enough architecture to win a competitive result. Here the single branch network with few parameters is preferred. We design a lightweight convolutional neural network (LCNN), and the overall structure of LCNN is illustrated in Fig. 2. We do not use a lot of sophisticated architecture, and the network consists of convolutional blocks with the convolution kernel of 3 and maxpooling layer, which is good for network acceleration. The network is with an endtoend architecture that is easy to train. In our preliminarily experiments, we find that using more convolutional layers with small kernels is more efficient than using fewer layers with larger kernels for crowd counting, which is consistent with the observations from recent research on image recognition [27]. Max pooling is applied for each $2\times 2$ region, and Rectified linear unit (ReLU) is adopted as the activation function for its good performance for CNN. To reduce the computational complexity, we limit the number of filters on each layer. Finally, we adopt filters with a size of 1 $\times $ 1 to generate the features vector. As will be shown in the experiments, within a small size of parameters, our model can achieve stateoftheart effects. The overall network parameter size is 0.03M, and the experiments will show that it can obtain realtime speed under the CPU environment.
The loss function of LCNN is defined as
$$L(\mathrm{\Theta})=\frac{1}{2N}\sum _{i=1}^{N}{f({x}_{i},\mathrm{\Theta}){F}_{i}}_{2}^{2},$$  (1) 
where $N$ is the number of training images, and ${F}_{i}$ is the ground truth density map of image ${x}_{i}$, and $f({x}_{i},\mathrm{\Theta})$ is the estimated density map parameterized with $\mathrm{\Theta}$ for ${x}_{i}$.
IIIB Dynamic Temporal Modeling
The selection of a temporal modeling approach is important to the success of the video crowd counting system. Ideally, we want a comprehensive collection of both longterm and shortterm frame correlations so that we can have accurate counting under any scene setting. However, video processing is timeconsuming and the training video dataset for crowd counting is also limited. With these in mind, we design the dynamic temporal LCNN (DTLCNN) with the dilated convolution to fully utilize the context and content information of the video, and the architecture is shown in Figure 1.
Formally, let $X=({x}_{1},\mathrm{\dots},{x}_{T})$ be a video with $T$ frames. Each frame ${x}_{i}$ go through the LCNN to produce the corresponding density map $f({x}_{i})$, which is then transformed into a onedimensional vector ${v}_{i}$. Vectors from several neighboring video frames are concatenated as the inputs of the first dilated block.
There are a few alternative choices to model the context with dilated convolution, such as dilated temporal convolution [28], dilation with densely connection [29], and dilated residual unit [30]. In this paper, we employ the design of dilated residual layer [30] for its computation efficiency.
Let ${\mathbf{w}}_{1,i}$ and ${b}_{1,i}$ be the filter weights and bias associated with the $i$th dilated residual layer and ${\mathbf{v}}_{i}$ be the input, the output for location $l$ after the 1D dilation is defined as
$${\widehat{\mathbf{v}}}_{i}[l]=\sum _{\mathrm{\Delta}l\in {\mathcal{R}}_{d}}{\mathbf{w}}_{1,i}[\mathrm{\Delta}l]\cdot {\mathbf{v}}_{i}[l+\mathrm{\Delta}l]+{b}_{1,i},$$  (2) 
where ${\mathcal{R}}_{d}=\{d,0,d\}$ construct the 1D filters with kernel size of 3 and $d={2}^{i1}$. The output of the whole dilated residual layer is
$${\mathbf{v}}_{i+1}={\mathbf{v}}_{i}+{\mathbf{w}}_{2,i}\cdot ReLU({\widehat{\mathbf{v}}}_{i})+{b}_{2,i},$$  (3) 
where ${\mathbf{v}}_{i+1}$ is the output of layer $i$, ${\mathbf{w}}_{2,i}$ and ${b}_{2,i}$ are the weights and bias of the dilated convolution filters. A dilated residual block consists of three dilated residual layer, and we use this architecture to help to provide more context to predict the result at each frame. Furthermore, our model can capture dependencies between this frame and the other video sequences, which helps smooth the prediction errors in the same video sequences.
In order to learn the parameters within the block, we use the loss function with two terms. The first is the MSE loss defined as
$${\mathcal{L}}_{mse}=\frac{1}{N}\sum _{i=1}^{N}{\left{C}_{p}{C}_{gt}\right}^{2},$$  (4) 
where $N$ is the total amount of video frames, ${C}_{p}$ is the predicted counting value, and ${C}_{gt}$ is groundtruth.
While the MSE loss already performs well, we observe that the predictions for some of the videos contain a few oversegmentation errors. To further improve the quality of the predictions, we use an additional smoothing loss to reduce such oversegmentation issue. Here a Smooth${L}_{1}$ loss is employed:
$$  (5) 
The block loss function for a dilated residual block is a combination of these losses:
$${\mathcal{L}}_{block}={\mathcal{L}}_{mse}+\lambda {\mathcal{L}}_{SL1},$$  (6) 
where $\lambda $ is a model hyperparameter to determine the contribution of the different terms. Several blocks will be applied in the DTLCNN framework, and the loss function is the sum of ${\mathcal{L}}_{block}$ in each block.
To utilize the context information gain more effectively, we normalize the output of the dynamic model and obtain a set of weight vectors. To keep the context of the original video frame, we reinput the continuous video frame into the network and deal with it uses the weight gain of the network output. We obtain the weight gain as follows,
$$W{O}_{j}=\frac{\sum _{i=1}^{m}{V}_{ij}}{\sum _{i=1}^{m}\sum _{j=1}^{n}{V}_{ij}}$$  (7) 
where ${V}_{ij}$ is the final output vector after the dynamic modeling, $m\times n$ is the vector size, and $W{O}_{j}$ is the information gain corresponding to the original video frame. We represent the information of $n$ frames before and after the original continuous video as: $Fr=\{{F}_{tn},\mathrm{\dots},F,\mathrm{\dots},{F}_{t+n}\}$. The final output is computed by
$$count=Fr\times WO.$$  (8) 
IIIC Implementation Details
Ground Truth Generation. There is significate diversity among different crowd counting datasets (see Figure 3). Thus, we use the geometryadaptive kernels to generate density maps from the ground truth. The geometryadaptive kernels are defined as
$$F(x)=\sum _{i=1}^{{N}_{t}}\delta (x{o}_{i})\times {G}_{{\sigma}_{i}}(x)$$  (9) 
Given object ${o}_{i}$ in the target set $\{{o}_{1},{o}_{2},\mathrm{\dots},{o}_{{N}_{t}}\}$, we calculate $k$ nearest neighbors to determine ${d}_{i}$. For the pixel position $i$ in the image, we use a Gaussian kernel with a parameter of ${\sigma}_{i}=\beta {\overline{d}}_{i}$ generate the density map $F(x)$.
In our experiments, we create density maps with the fixed kernel of 17 for UCSD dataset and 15 for others. We also follow the previous work [31] create density maps with using Region of Interest (ROI) and the perspective map deal with the WorldExpo’10 dataset.
Data Augmentation. We consider data augmentation based on the actual information of the data. For the image dataset, exists a problem that the insufficient number of single samples, we follow the data enhancement method in [7]. Nine color patches are cut from each image in different positions and the size is $\frac{1}{4}$ of the original image. The first four tiles contain threequarters of the images without overlapping, while the other five tiles are randomly cropped from the input image. After that, we mirrored the patch to double the training set. We do not apply any data enhancement for the video dataset, as we would like to consider more context information of the video frames within our model.
Training Details. Our dynamic temporal model is implemented using PyTorch [32]. To train the LCNN, we first initialize the layers of the network using a Gaussian distribution from 0.01 standard deviation and then use different learning rates training the model for each dataset. We set the learning rate of ${10}^{5}$ for all the datasets, and use Adam [33] for training. For the training of DTLCNN, we also use Adam optimizer with the learning rate of 0.0005.
IV Experiments
Dataset  Type  Resolution  Color  Num.  Max.  Min.  Avg.  Total 

ShanghaiTech Part A  Image  Varied  RGB  482  3139  33  501  241677 
ShanghaiTech Part B  Image  768 $\times $ 1024  RGB  716  578  9  123  88488 
UCF_CC_50  Image  Varied  Gray  50  4543  94  1279  63974 
UCSD  Video  158 $\times $ 238  Gray  2000  46  11  24.9  49885 
Mall  Video  640 $\times $ 480  RGB  2000  53  11  31.2  62315 
WorldExpo  Video  576 $\times $ 720  RGB  3980  253  1  50.2  199923 
Num.: the number of images/video frames;
Max. & Min.: the maximum and minimum numbers of people in the ROI of an image;
Average.: the average pedestrian count; Total: the total number of labeled pedestrians.
Method  Year  ShanghaiTech A  ShanghaiTech B  UCF  Params  Pretrained  

MAE  MSE  MAE  MSE  MAE  MSE  (M)  Model  
SwitchingCNN [19]  2017  90.4  135.0  21.6  33.4  318.1  439.2  15.30  VGG16 
CSRNet [7]  2018  68.2  115.0  10.6  16  266.1  397.5  16.26  VGG16 
L2R [9]  2018  72.0  106.6  14.4  23.8  291.5  397.6  16.75  VGG16 
ASD [13]  2019  65.6  98.0  8.5  13.7  196.2  270.9  16.26  VGG16 
DRSAN [34]  2018  69.3  96.4  11.1  18.2  219.2  250.2  24.10   
SANet [35]  2018  67.0  104.5  8.4  13.6  258.4  334.9  0.91   
icCNN [36]  2018  68.5  116.2  10.7  16.0  260.9  365.5  16.82   
ACSCP [37]  2018  75.7  102.7  17.2  27.4  291.0  404.6  5.10   
CPCNN [6]  2017  73.6  106.4  20.1  30.1  298.8  320.9  68.40   
IGCNN [38]  2018  72.5  118.2  13.6  21.1  291.4  349.4  4.70   
DConvNet [8]  2018  73.5  112.3  18.7  26.0  288.4  404.7  16.62   
MCNN [4]  2016  110.2  173.2  26.4  41.3  377.6  509.1  0.13   
HydraCNN [5]  2016          333.7  425.2  0.56   
BSAD [39]  2018      20.2  35.6  409.5  563.7  1.30   
TDFCNN [10]  2018  97.5  145.1  20.7  32.8  354.7  491.4  1.15   
LCNN  93.3  157.0  15.1  23.3  262.0  358.6  0.032   
We evaluate the proposed framework with five challenging benchmarks, i.e., ShanghaiTech [4], UCF_CC_50 [26], Mall [40], UCSD [17], and WorldExpo’10 [31]. Some statistics of these datasets are summarized in Table I. For ShanghaiTech and UCF_CC_50 datasets, as there is no timerelated information, we focus on the basic network LCNN and consider the imagelevel analysis. We evaluate the dynamic temporal modeling on Mall, UCSD, and WorldExpo’10 dataset.
Following existing stateoftheart methods, we use the mean absolute error (MAE) and mean squared error (MSE) to evaluate the performance of the testing datasets, which are defined as
$$MAE=\frac{1}{N}\sum _{i=1}^{N}\left{C}_{i}{C}_{i}^{GT}\right,$$  (10) 
$$MSE=\sqrt{\frac{1}{N}\sum _{i=1}^{N}{\left{C}_{i}{C}_{i}^{GT}\right}^{2}}.$$  (11) 
Here $N$ is the number of testing images, ${C}_{i}$ and ${C}_{i}^{GT}$ are the estimated people count and ground truth people count in the $i$th image respectively. We also report the number of neural networks parameters (Params) for the comparison.
IVA Results on Still Images
We first evaluate the performance of LCNN and compare it with several stateoftheart approaches.
ShanghaiTech Dataset. Table IILeft summarizes the MSE and MAE in both parts of the ShanghaiTech dataset. We compare LCNN with several baselines and stateoftheart approaches. Among them, the first group are the stateoftheart methods with pretrained models [19, 7, 9, 13] or more complex network designs [35, 36, 34, 37, 6, 38, 8]. Our results are comparable with these approaches, while the parameter size of the LCNN is orderofmagnitude smaller than all of these methods. The second group contains several networks with compact structure, including MCNN [41], HydraCNN [5], BSAD [39], and TDFCNN [10]. From the table we see that LCNN outperforms all these approaches. Fig. 4(a) and (b) illustrates some crowd images, their predicted density maps, and the counting results using LCNN.
UCF_CC_50 Dataset. We also study the performance of LCNN on UCF_CC_50 with both the stateoftheart and compact approach. Results are also given in Table II. Similar to the experiments on ShanghaiTech, LCNN shows better results than the other four approaches with a compact network. We also notice that the parameter size of SANet [35] is also small, by using the Inception unit. We believe that LCNN may be also complementary to such structure, however, the structure still 30x parameter size comparing with our model. Fig. 4(c) shows the sample crowd images and their predicted results with LCNN on UCF_CC_50.
IVB Results on Videos
There are a few parameters in DTLCNN, including the number of video frames for dynamic temporal modeling and dilated residual blocks. In this set of experiments, we use 5 video frames for the temporal modeling and 3 blocks as the default setting. The effect of these parameters will be evaluated in the next subsection.
Method  MALL  UCSD  

MAE  MSE  MAE  MSE  
Gaussian process regression [17]  3.72  20.1     
Ridge regression [40]  3.59  19.0     
Cumulative attribute regression [42]  3.43  17.7     
ConvLSTMnt [25]  2.53  11.2  1.73  3.52 
ConvLSTM [25]  2.24  8.5  1.30  1.79 
Bidirectional ConvLSTM [25]  2.10  7.6  1.13  1.43 
DTLCNN  2.03  2.6  1.08  1.41 
Mall Dataset. We now report results on the Mall dataset, as summarized in Table IIILeft. The experiments follow the same setting as [40], which use the first 800 frames for training and the remaining 1,200 frames for the test. we compare the DTLCNN with the methods which also make use of spatialtemporal information, including the regressionbased methods [17, 40, 42] and the LSTMbased methods [25]. As shown in the table, using the proposed dynamic temporal modeling leads to the MAE of 2.03 and MSE of 2.6, which is significantly higher than the baseline approaches. We list some predicted density maps as well as their corresponding counting results with DTLCNN in Fig. 5.
UCSD Dataset. Following the convention of the existing works [17], we use frames 6011400 as the training data and the remaining 1200 frames as the test data. We generate ground truth density maps with fixed spread Gaussian kernel. As the region of interest (ROI) and perspective map are provided, the intensities of pixels out of ROI is set to zero, and we also use ROI to revise the last convolution layer. Results on the UCSD dataset are presented in Table IIIRight. Again, DTLCNN shows better results than the LSTMbased crowd counting approaches. Some counting results with DTLCNN on the sample snippets are shown in Fig. 6.
WorldExpo’10 Dataset. The WorldExpo’10 dataset [31] consists of 3980 annotated frames from 1132 video sequences captured by 108 different surveillance cameras during the Shanghai WorldExpo in 2010. The training set includes of 3,380 annotated frames from 103 scenes, while the testing images are extracted from other five different scenes with 120 frames per scene. Table IV lists the perscene performance of DTLCNN and previous approaches. Here we also compare with two groups of approaches. The first contains methods with stateoftheart performance [36, 8, 7, 37], and the second group is the temporal modeling approach. Ours are comparable with the stateofthearts for four scenes (except in scene 2), while our model and pruning speed may be more suitable for inference. And the results of DTLCNN is significantly better than that of the LSTMbased methods. The qualitative results on different scenes are illustrated in Fig. 7.
Method  S1  S2  S3  S4  S5  Avg.  Params (M) 
icCNN [36]  17.0  12.3  9.2  8.1  4.7  10.3  16.82 
DConvNet [8]  1.9  12.1  20.7  8.3  2.6  9.1  16.26 
CSRNet [7]  2.9  11.5  8.6  16.6  3.4  8.6  16.26 
ACSCP [37]  2.8  14.1  9.6  8.1  2.9  7.5  5.10 
ConvLSTMnt [25]  8.6  16.9  14.6  15.4  4.0  11.9   
ConvLSTM [25]  7.1  15.2  15.2  13.9  3.5  10.9   
BiConvLSTM [25]  6.8  14.5  14.9  13.5  3.1  10.6   
DTLCNN  2.8  18.1  9.6  7.5  3.6  8.3  0.047 
IVC Ablation Study
In this section, we evaluate some parameters and alternative implementations of the proposed framework.
Number of Video Frames for Dynamic Modeling. We compare the performance of our framework with a varying number of video frames for dynamic modeling, as shown in Fig. 8(a). One intuitive way to add the temporal information is to smooth the density maps or counting numbers of neighboring frames, however, in some scenarios (such as WorldExpo’10), the MAE value is lower than using only single frames. We observe significant performance gains when the number of considered video frames increases from three to five. Using more frames does not improve performance.
Number of Dilated Residual Blocks. We also evaluate the effect of dilated residual block numbers in the DTLCNN model. As shown in Fig. 8(b), the best tradeoff is obtained by using three dilated residual blocks. Compared to using a single block, more blocks can boost performance. However, when the number gets larger, in some case the performances are decreased. This is probably because complex neural networks lead to underfitting when the scale of training data is limited.
Method  Dataset  MAE  MSE 

LCNN +LSTM  UCSD  1.21  1.69 
MALL  2.23  3.80  
LCNN + BILSTM  UCSD  1.11  1.48 
MALL  2.09  3.07  
DTLCNN  UCSD  1.08  1.41 
MALL  2.03  2.60 
Temporal Modelling. We compare our dynamic temporal modeling approach with previous LSTM based approaches by incorporating LCNN with them. As shown in Table V, the results of DTLCNN are better than LCNN with LSTM or Bidirectional LSTM.
Timing. Recall that our goal is to build a compact model for effective crowd counting in the videos, based on the proposed lightweight network. The parameter number of LCNN and DTLCNN are 0.032M and 0.047M, respectively. For a video with the resolution of $320\times 240$ pixels, the DTLCNN model achieves 120 FPS detection speed on an Nvidia GTX TITAN X GPU and during inference it only consumes less than 500M GPU memory. Our approach can produce realtime (25FPS) crowd counting speed with a moderate Intel Corei5 desktop CPU.
V Conclusions
We propose DTLCNN, a new dynamic temporal modeling system with the LCNN unit to solve crowd counting in the video. Highlights are twofold: (1) the novel lightweight architecture to produce good performance with the compact network, and (2) we explicitly model the temporal information with both crowd images and the predicted density maps. We show that by leverage contexture information of the video contents, promising results are achieved for crowd counting. The runtime speed is 25 FPS on a moderate commercial CPU. For future work, we plan to incorporate the proposed framework with the edge computing device to support the rapid decision for realworld scenarios.
References
 [1] T. Li, H. Chang, M. Wang, B. Ni, R. Hong, and S. Yan, “Crowded scene analysis: A survey,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 3, pp. 367–386, 2014.
 [2] B. Sheng, C. Shen, G. Lin, J. Li, W. Yang, and C. Sun, “Crowd counting via weighted vlad on a dense attribute feature map,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 8, pp. 1788–1797, 2016.
 [3] R. Stewart, M. Andriluka, and A. Y. Ng, “EndtoEnd people detection in crowded scenes,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2325–2333.
 [4] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Singleimage crowd counting via multicolumn convolutional neural network,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 589–597.
 [5] D. Oñoro Rubio and R. J. LópezSastre, “Towards perspectivefree object counting with deep learning,” in European Conference on Computer Vision (ECCV), 2016, pp. 615–629.
 [6] V. A. Sindagi and V. M. Patel, “Generating highquality crowd density maps using contextual pyramid cnns,” in International Conference on Computer Vision (ICCV), 2017, pp. 1879–1888.
 [7] Y. Li, X. Zhang, and D. Chen, “CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1091–1100.
 [8] Z. Shi, L. Zhang, Y. Liu, X. Cao, Y. Ye, M. Cheng, and G. Zheng, “Crowd counting with deep negative correlation learning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5382–5390.
 [9] X. Liu, J. van de W., and A. D. Bagdanov, “Leveraging unlabeled data for crowd counting by learning to rank,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7661–7669.
 [10] D. Sam and R. V. Babu, “Topdown feedback for crowd counting convolutional neural network,” in AAAI Conference on Artificial Intelligence (AAAI), 2018.
 [11] D. Kang, Z. Ma, and A. B. Chan, “Beyond counting: Comparisons of density maps for crowd analysis taskscounting, detection, and tracking,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 5, pp. 1408–1422, 2018.
 [12] H. Zheng, Z. Lin, J. Cen, Z. Wu, and Y. Zhao, “Crossline pedestrian counting based on spatiallyconsistent twostage local crowd density estimation and accumulation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 3, pp. 787–799, 2018.
 [13] X. Wu, Y. Zheng, H. Ye, W. Hu, J. Yang, and L. He, “Adaptive scenario discovery for crowd counting,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019, pp. 2382–2386.
 [14] J. Gao, Q. Wang, and X. Li, “Pcc net: Perspective crowd counting via spatial convolutional network,” IEEE Transactions on Circuits and Systems for Video Technology, 2019.
 [15] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 886–893.
 [16] M. Li, Z. Zhang, K. Huang, and T. Tan, “Estimating the number of people in crowded scenes by mid based foreground segmentation and headshoulder detection,” in International Conference on Pattern Recognition (ICPR), 2008.
 [17] A. B. Chan, Z. J. Liang, and N. Vasconcelos, “Privacy preserving crowd monitoring: Counting people without people models or tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
 [18] V. Lempitsky and A. Zisserman, “Learning to count objects in images,” in Advances in Neural Information Processing Systems (NeurIPS), 2010, pp. 1324–1332.
 [19] D. B. Sam, S. Surya, and R. V. Babu, “Switching convolutional neural network for crowd counting,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5744–5752.
 [20] D. Kang, D. Dhar, and A. B. Chan, “Crowd counting by adapting convolutional neural networks with side information,” arXiv:1611.06748, 2016.
 [21] G. J. Brostow and R. Cipolla, “Unsupervised bayesian detection of independent motion in crowds,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006, pp. 594–601.
 [22] A. B. Chan and N. Vasconcelos, “Counting people with lowlevel features and bayesian regression,” IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 2160–2177, 2012.
 [23] M. Rodriguez, I. Laptev, J. Sivic, and J.Y. Audibert, “Densityaware person detection and tracking in crowds,” in International Conference on Computer Vision (ICCV), 2011, pp. 2423–2430.
 [24] S. Chen, A. Fern, and S. Todorovic, “Person count localization in videos from noisy foreground and detections,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1364–1372.
 [25] F. Xiong, X. Shi, and D.Y. Yeung, “Spatiotemporal modeling for crowd counting in videos,” in International Conference on Computer Vision (ICCV), 2017, pp. 5151–5159.
 [26] H. Idrees, I. Saleemi, C. Seibert, and M. Shah, “Multisource multiscale counting in extremely dense crowd images,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2547–2554.
 [27] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv:1409.1556, 2014.
 [28] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 156–165.
 [29] B. Xu, H. Ye, Y. Zheng, H. Wang, T. Luwang, and Y.G. Jiang, “Dense dilated network for video action recognition,” IEEE Transactions on Image Processing, 2019.
 [30] Y. A. Farha and J. Gall, “Mstcn: Multistage temporal convolutional network for action segmentation,” arXiv:1903.01945, 2019.
 [31] C. Zhang, H. Li, X. Wang, and X. Yang, “Crossscene crowd counting via deep convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 833–841.
 [32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS Workshop, 2017.
 [33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
 [34] L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin, “Crowd counting using deep recurrent spatialaware network,” in International Joint Conference on Artificial Intelligence (IJCAI), 2018, pp. 849–855.
 [35] X. Cao, Z. Wang, Y. Zhao, and F. Su, “Scale aggregation network for accurate and efficient crowd counting,” in European Conference on Computer Vision (ECCV), 2018, pp. 734–750.
 [36] D. Babu Sam, N. N. Sajjan, R. Venkatesh Babu, and M. Srinivasan, “Divide and grow: capturing huge diversity in crowd images with incrementally growing cnn,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 3618–3626.
 [37] Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang, “Crowd counting via adversarial crossscale consistency pursuit,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5245–5254.
 [38] V. Ranjan, H. Le, and M. Hoai, “Iterative crowd counting,” in European Conference on Computer Vision (ECCV), 2018, pp. 270–285.
 [39] S. Huang, X. Li, Z. Zhang, F. Wu, S. Gao, R. Ji, and J. Han, “Body structure aware deep crowd counting,” IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1049–1059, 2018.
 [40] K. Chen, C. Loy, S. Gong, and T. Xiang, “Feature mining for localised crowd counting,” in British Machine Vision Conference (BMVC), vol. 1, no. 2, 2012, p. 3.
 [41] L. Zeng, X. Xu, B. Cai, S. Qiu, and T. Zhang, “Multiscale convolutional neural networks for crowd counting,” in International Conference on Image Processing (ICIP), 2017, pp. 465–469.
 [42] K. Chen, S. Gong, T. Xiang, and C. Change Loy, “Cumulative attribute space for age and crowd density estimation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2467–2474.