Video Crowd Counting via Dynamic Temporal Modeling

  • 2019-07-04 03:07:22
  • Xingjiao Wu, Baohan Xu, Yingbin Zheng, Hao Ye, Jing Yang, Liang He
  • 16

Abstract

Crowd counting aims to count the number of instantaneous people in a crowdedspace, which plays an increasingly important role in the field of publicsafety. More and more researchers have already proposed many promisingsolutions to the crowd counting task on the image. With the continuousextension of the application of crowd counting, how to apply the technique tovideo content has become an urgent problem. At present, although researchershave collected and labeled some video clips, less attention has been drawn tothe spatiotemporal characteristics of videos. In order to solve this problem,this paper proposes a novel framework based on dynamic temporal modeling of therelationship between video frames. We model the relationship between adjacentfeatures by constructing a set of dilated residual blocks for crowd countingtask, with each phase having an expanded set of time convolutions to generatean initial prediction which is then improved by the next prediction. We extractfeatures from the density map as we find the adjacent density maps share moresimilar information than original video frames. We also propose a smaller basicnetwork structure to balance the computational cost with a good featurerepresentation. We conduct experiments using the proposed framework on fivecrowd counting datasets and demonstrate its superiority in terms ofeffectiveness and efficiency over previous approaches.

 

Quick Read (beta)

Video Crowd Counting via
Dynamic Temporal Modeling

Xingjiao Wu, Baohan Xu, Yingbin Zheng, Hao Ye, Jing Yang, Liang He X. Wu, J. Yang, and L. He are with East China Normal University, Shanghai 200062, China. (e-mail: [email protected]; [email protected]; [email protected])B. Xu is Jilian Technology Group (Video++), Shanghai 200023, China. (e-mail: [email protected])Y. Zheng and H. Ye are with Videt Tech Ltd., Shanghai 201203, China. (e-mail: [email protected]; [email protected])
Abstract

Crowd counting aims to count the number of instantaneous people in a crowded space, which plays an increasingly important role in the field of public safety. More and more researchers have already proposed many promising solutions to the crowd counting task on the image. With the continuous extension of the application of crowd counting, how to apply the technique to video content has become an urgent problem. At present, although researchers have collected and labeled some video clips, less attention has been drawn to the spatiotemporal characteristics of videos. In order to solve this problem, this paper proposes a novel framework based on dynamic temporal modeling of the relationship between video frames. We model the relationship between adjacent features by constructing a set of dilated residual blocks for crowd counting task, with each phase having an expanded set of time convolutions to generate an initial prediction which is then improved by the next prediction. We extract features from the density map as we find the adjacent density maps share more similar information than original video frames. We also propose a smaller basic network structure to balance the computational cost with a good feature representation. We conduct experiments using the proposed framework on five crowd counting datasets and demonstrate its superiority in terms of effectiveness and efficiency over previous approaches.

Crowd counting, video analysis, dynamic temporal modeling, spatiotemporal information.

I Introduction

Rapid development of surveillance devices has led to an explosive growth of images and videos, which creates a demand for analyzing visual content. In addition to object recognition, crowd counting, which focuses on estimating the number of people in a still image or a video clip, has received increasing interests in recent years. Many researchers have explored crowd counting task on still images [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], while limited efforts have been focused on videos. Nevertheless, crown counting in videos has many real-world applications, such as video surveillance, traffic monitoring, and emergency management.

Despite the attention that the crowd counting problem has received, it still remains a challenging task. Challenges arise from the non-uniform distribution of people, complex illumination, distortion of camera and occlusion. Although many studies have proposed multiple columns/branches network to learn more contextual information and achieve excellent performance [4, 7, 13], these existing methods may ignore the temporal relations between nearby frame since crowd counting data often collected by surveillance videos. Furthermore, the regression-based or CNN based frameworks have explored variance model to generate a density map, while the strong correlation between neighboring density maps and video frames is also overlooked.

To cope with these difficulties, we employ a novel framework to take advantage of temporal information extracted by continuously frames and the architecture is illustrated in Fig. 1. We model the relationship between adjacent features by constructing a multi-stage architecture for time segmentation tasks, with each phase having an expanded set of time convolutions to generate an initial prediction that is then improved by the next prediction. We also introduce density map as another branch of our architecture. The density map reports the distribution of people, which can be regarded as attention map. As we observe, the adjacent video frames may have different visual content due to the background and occlusion, while the adjacent density maps demonstrate more similar content with each other. The contextual information between consequent frames and density maps would benefit the current counting state. Comprehensive experiments on public datasets show the improvement with the help of temporal and contextual information.

The main contributions of this work are summarized as follows.

  • We propose a novel but lightweight architecture combing both spatial and temporal features for crowd counting in videos, by dynamic temporal modeling of the continuously video frames.

  • Our framework also utilize information from density maps to boost accuracy. The neighboring density maps would share more similar features than video frames.

  • Extensive experiments and evaluations on benchmark datasets demonstrate the superior performance of our proposed method. Notably, we achieve state-of-the-art results on the video datasets comparing with the existing video-based methods. Further, our network achieves 25 FPS crowd counting speed on a moderate commercial CPU.

The rest of this paper is organized as follows. Section II introduces background of crowd counting in images and videos. Section III discusses the model design, network architecture and training process in detail. In Section IV, we demonstrate the qualitative and quantitative study of the proposed framework. We conclude our work in Section V.

Fig. 1: The Architecture of dynamic temporal modeling.

II Related Work

II-A Crowd Counting in Still Images

Over the past few years, researchers have attempted to solve crowd counting in images using a variety of approaches. Early works focused on detection methods to recognize specific body parts or full body using hand-crafted features [15, 16]. While detection based methods are difficult to deal with dense crowds because of occlusion. Some studies investigated to learn a mapping function between features to the number of peoples [17]. Furthermore, Lempitsky et al. [18] proposed local features for the density map to exploit spatial information. However, the hand-crafted features are not enough facing the clutter and low resolution of images.

Recently, the convolutional neural network has shown great success in computer vision fields. Inspired by the promising performance of the neural network, many researchers have explored CNN-based methods in crowd counting. Zhang et al. [4] proposed a multi-column CNN with different sizes of filters to deal with the variations of density differences. Similarly, two parallel pathways architecture with the different receptive field was introduced by [13] and achieved good results on benchmark datasets. [5] used pyramid images to extract density signatures on multiple scales. Sam et al. [19] and Sindagi et al. [6] have achieved remarkable results in a multi-subnet structure. Li et al. [7] used an expanded kernel to provide larger receive fields and replace pooling operations to further improve the accuracy. To address the problem of limited training data, Kang et al. [20] introduced side information such as camera angle and hight to boost the network performance. Due to limited training data, Liu et al. [9] investigated enhance data such as collect scene datasets from Google using keyword searches and query-by-example image retrieval and then applying a learning-to-rank method. Shi et al. [8] considered that the adaptation of the previous method to the crowd relying on a single image is still in its infancy. They proposed the D-ConvNet structure that can be end-to-end trained and can be independent of the backbone full convolutional neural network. Sam et al. [10] proposed a framework named TDF-CNN with top-down feedback to correct the initial prediction of the CNN that is very limited for detecting the space background of people. These methods are all designed for image crowd counting, thus treating videos as image sequences would ignore the important temporal information in videos.

Fig. 2: The structure of the proposed lightweight convolutional neural network for crowd counting.

II-B Crowd Counting in Videos

There are fewer researchers studied on video crowd counting compared with still images. Brostow et al. [21] and Chan et al. [22] proposed to use the Bayesian function to detect individuals using motion information. Rodriguez et al. [23] further proposed optimization of energy function combing crowd density estimation and individual tracking. Chen et al. [24] proposed an error-driven iteration framework aiming to cope with the noisy input videos.

Although these methods based on motion or hand-crafted features showed satisfactory performance on the pedestrian or football datasets, they are still lack of the generalization ability when applying them to extremely dense crowds. More recently, Xiong et al. [25] proposed the convLSTM framework to capture both spatial and temporal dependencies. The CNN-based method demonstrated the effectiveness of benchmark crowd counting datasets, such as UCF_CC_50 [26] and UCSD [17]. However, due to the limited training data of videos and various scenes, it is usually difficult to train the complex and deeper networks for effective crowd counting. In this paper, we propose a novel framework considering temporal information and density maps as well. Even though using a lightweight network architecture, our method can achieve promising results on multiple datasets, with the help of the auxiliary information extracted from temporal dependencies and density maps.

III Framework

In this section, we will introduce the dynamic temporal modeling with the convolutional network for the crowd counting task in the video. We describe the basic network in Section III-A and architecture of dynamic temporal modeling in Section III-B. The implementation details will be described in Section III-C.

III-A Basic Network

The basic network in our framework is the convolutional neural network for crowd counting of a still image or single video frame. As mentioned in the previous section, networks with multiple subnets and single branch are employed. Since we focus on video crowd counting problem in this paper, the inference speed is an important issue and our goal is to use a small enough architecture to win a competitive result. Here the single branch network with few parameters is preferred. We design a lightweight convolutional neural network (LCNN), and the overall structure of LCNN is illustrated in Fig. 2. We do not use a lot of sophisticated architecture, and the network consists of convolutional blocks with the convolution kernel of 3 and max-pooling layer, which is good for network acceleration. The network is with an end-to-end architecture that is easy to train. In our preliminarily experiments, we find that using more convolutional layers with small kernels is more efficient than using fewer layers with larger kernels for crowd counting, which is consistent with the observations from recent research on image recognition [27]. Max pooling is applied for each 2×2 region, and Rectified linear unit (ReLU) is adopted as the activation function for its good performance for CNN. To reduce the computational complexity, we limit the number of filters on each layer. Finally, we adopt filters with a size of 1 × 1 to generate the features vector. As will be shown in the experiments, within a small size of parameters, our model can achieve state-of-the-art effects. The overall network parameter size is 0.03M, and the experiments will show that it can obtain real-time speed under the CPU environment.

The loss function of LCNN is defined as

L(Θ)=12Ni=1N||f(xi,Θ)-Fi||22, (1)

where N is the number of training images, and Fi is the ground truth density map of image xi, and f(xi,Θ) is the estimated density map parameterized with Θ for xi.

Fig. 3: Ground-truth density map for different datasets.

III-B Dynamic Temporal Modeling

The selection of a temporal modeling approach is important to the success of the video crowd counting system. Ideally, we want a comprehensive collection of both long-term and short-term frame correlations so that we can have accurate counting under any scene setting. However, video processing is time-consuming and the training video dataset for crowd counting is also limited. With these in mind, we design the dynamic temporal LCNN (DT-LCNN) with the dilated convolution to fully utilize the context and content information of the video, and the architecture is shown in Figure 1.

Formally, let X=(x1,,xT) be a video with T frames. Each frame xi go through the LCNN to produce the corresponding density map f(xi), which is then transformed into a one-dimensional vector vi. Vectors from several neighboring video frames are concatenated as the inputs of the first dilated block.

There are a few alternative choices to model the context with dilated convolution, such as dilated temporal convolution [28], dilation with densely connection [29], and dilated residual unit [30]. In this paper, we employ the design of dilated residual layer [30] for its computation efficiency.

Let 𝐰1,i and b1,i be the filter weights and bias associated with the i-th dilated residual layer and 𝐯i be the input, the output for location l after the 1D dilation is defined as

𝐯^i[l]=Δld𝐰1,i[Δl]𝐯i[l+Δl]+b1,i, (2)

where d={-d,0,d} construct the 1D filters with kernel size of 3 and d=2i-1. The output of the whole dilated residual layer is

𝐯i+1=𝐯i+𝐰2,iReLU(𝐯^i)+b2,i, (3)

where 𝐯i+1 is the output of layer i, 𝐰2,i and b2,i are the weights and bias of the dilated convolution filters. A dilated residual block consists of three dilated residual layer, and we use this architecture to help to provide more context to predict the result at each frame. Furthermore, our model can capture dependencies between this frame and the other video sequences, which helps smooth the prediction errors in the same video sequences.

In order to learn the parameters within the block, we use the loss function with two terms. The first is the MSE loss defined as

mse=1Ni=1N|Cp-Cgt|2, (4)

where N is the total amount of video frames, Cp is the predicted counting value, and Cgt is ground-truth.

While the MSE loss already performs well, we observe that the predictions for some of the videos contain a few over-segmentation errors. To further improve the quality of the predictions, we use an additional smoothing loss to reduce such over-segmentation issue. Here a Smooth-L1 loss is employed:

SL1=1N{12(xi-yi)2if|xi-yi|<1|xi-yi|-12otherwise (5)

The block loss function for a dilated residual block is a combination of these losses:

block=mse+λSL1, (6)

where λ is a model hyper-parameter to determine the contribution of the different terms. Several blocks will be applied in the DT-LCNN framework, and the loss function is the sum of block in each block.

To utilize the context information gain more effectively, we normalize the output of the dynamic model and obtain a set of weight vectors. To keep the context of the original video frame, we reinput the continuous video frame into the network and deal with it uses the weight gain of the network output. We obtain the weight gain as follows,

WOj=i=1mViji=1mj=1nVij (7)

where Vij is the final output vector after the dynamic modeling, m×n is the vector size, and WOj is the information gain corresponding to the original video frame. We represent the information of n frames before and after the original continuous video as: Fr={Ft-n,,F,,Ft+n}. The final output is computed by

count=Fr×WO. (8)

III-C Implementation Details

Ground Truth Generation. There is significate diversity among different crowd counting datasets (see Figure 3). Thus, we use the geometry-adaptive kernels to generate density maps from the ground truth. The geometry-adaptive kernels are defined as

F(x)=i=1Ntδ(x-oi)×Gσi(x) (9)

Given object oi in the target set {o1,o2,,oNt}, we calculate k nearest neighbors to determine di. For the pixel position i in the image, we use a Gaussian kernel with a parameter of σi=βd¯i generate the density map F(x).

In our experiments, we create density maps with the fixed kernel of 17 for UCSD dataset and 15 for others. We also follow the previous work [31] create density maps with using Region of Interest (ROI) and the perspective map deal with the WorldExpo’10 dataset.

Data Augmentation. We consider data augmentation based on the actual information of the data. For the image dataset, exists a problem that the insufficient number of single samples, we follow the data enhancement method in [7]. Nine color patches are cut from each image in different positions and the size is 14 of the original image. The first four tiles contain three-quarters of the images without overlapping, while the other five tiles are randomly cropped from the input image. After that, we mirrored the patch to double the training set. We do not apply any data enhancement for the video dataset, as we would like to consider more context information of the video frames within our model.

Training Details. Our dynamic temporal model is implemented using PyTorch [32]. To train the LCNN, we first initialize the layers of the network using a Gaussian distribution from 0.01 standard deviation and then use different learning rates training the model for each dataset. We set the learning rate of 10-5 for all the datasets, and use Adam [33] for training. For the training of DT-LCNN, we also use Adam optimizer with the learning rate of 0.0005.

IV Experiments

TABLE I: Statistics of the datasets.
Dataset Type Resolution Color Num. Max. Min. Avg. Total
ShanghaiTech Part A Image Varied RGB 482 3139 33 501 241677
ShanghaiTech Part B Image 768 × 1024 RGB 716 578 9 123 88488
UCF_CC_50 Image Varied Gray 50 4543 94 1279 63974
UCSD Video 158 × 238 Gray 2000 46 11 24.9 49885
Mall Video 640 × 480 RGB 2000 53 11 31.2 62315
WorldExpo Video 576 × 720 RGB 3980 253 1 50.2 199923

Num.: the number of images/video frames; Max. & Min.: the maximum and minimum numbers of people in the ROI of an image;
Average.: the average pedestrian count; Total: the total number of labeled pedestrians.

TABLE II: Comparison with the state-of-the-art on ShanghaiTech and UCF_CC_50. The parameter size is measured in million (M).
Method Year ShanghaiTech A ShanghaiTech B UCF Params Pre-trained
MAE MSE MAE MSE MAE MSE (M) Model
Switching-CNN [19] 2017 90.4 135.0 21.6 33.4 318.1 439.2 15.30 VGG-16
CSRNet [7] 2018 68.2 115.0 10.6 16 266.1 397.5 16.26 VGG-16
L2R [9] 2018 72.0 106.6 14.4 23.8 291.5 397.6 16.75 VGG-16
ASD [13] 2019 65.6 98.0 8.5 13.7 196.2 270.9 16.26 VGG-16
DRSAN [34] 2018 69.3 96.4 11.1 18.2 219.2 250.2 24.10 -
SANet [35] 2018 67.0 104.5 8.4 13.6 258.4 334.9 0.91 -
ic-CNN  [36] 2018 68.5 116.2 10.7 16.0 260.9 365.5 16.82 -
ACSCP [37] 2018 75.7 102.7 17.2 27.4 291.0 404.6 5.10 -
CP-CNN [6] 2017 73.6 106.4 20.1 30.1 298.8 320.9 68.40 -
IG-CNN [38] 2018 72.5 118.2 13.6 21.1 291.4 349.4 4.70 -
D-ConvNet [8] 2018 73.5 112.3 18.7 26.0 288.4 404.7 16.62 -
MCNN [4] 2016 110.2 173.2 26.4 41.3 377.6 509.1 0.13 -
Hydra-CNN [5] 2016 - - - - 333.7 425.2 0.56 -
BSAD [39] 2018 - - 20.2 35.6 409.5 563.7 1.30 -
TDF-CNN [10] 2018 97.5 145.1 20.7 32.8 354.7 491.4 1.15 -
LCNN 93.3 157.0 15.1 23.3 262.0 358.6 0.032 -

We evaluate the proposed framework with five challenging benchmarks, i.e., ShanghaiTech [4], UCF_CC_50 [26], Mall [40], UCSD [17], and WorldExpo’10 [31]. Some statistics of these datasets are summarized in Table I. For ShanghaiTech and UCF_CC_50 datasets, as there is no time-related information, we focus on the basic network LCNN and consider the image-level analysis. We evaluate the dynamic temporal modeling on Mall, UCSD, and WorldExpo’10 dataset.

Following existing state-of-the-art methods, we use the mean absolute error (MAE) and mean squared error (MSE) to evaluate the performance of the testing datasets, which are defined as

MAE=1Ni=1N|Ci-CiGT|, (10)
MSE=1Ni=1N|Ci-CiGT|2. (11)

Here N is the number of testing images, Ci and CiGT are the estimated people count and ground truth people count in the i-th image respectively. We also report the number of neural networks parameters (Params) for the comparison.

IV-A Results on Still Images


(a) ShanghaiTech Part A

(b) ShanghaiTech Part B

(c) UCF_CC_50

Fig. 4: Qualitative results for the LCNN on ShanghaiTech and UCF_CC_50 datasets.

We first evaluate the performance of LCNN and compare it with several state-of-the-art approaches.

ShanghaiTech Dataset. Table II-Left summarizes the MSE and MAE in both parts of the ShanghaiTech dataset. We compare LCNN with several baselines and state-of-the-art approaches. Among them, the first group are the state-of-the-art methods with pre-trained models  [19, 7, 9, 13] or more complex network designs  [35, 36, 34, 37, 6, 38, 8]. Our results are comparable with these approaches, while the parameter size of the LCNN is order-of-magnitude smaller than all of these methods. The second group contains several networks with compact structure, including MCNN [41], Hydra-CNN [5], BSAD [39], and TDF-CNN [10]. From the table we see that LCNN outperforms all these approaches. Fig. 4(a) and (b) illustrates some crowd images, their predicted density maps, and the counting results using LCNN.

UCF_CC_50 Dataset. We also study the performance of LCNN on UCF_CC_50 with both the state-of-the-art and compact approach. Results are also given in Table II. Similar to the experiments on ShanghaiTech, LCNN shows better results than the other four approaches with a compact network. We also notice that the parameter size of SANet [35] is also small, by using the Inception unit. We believe that LCNN may be also complementary to such structure, however, the structure still 30x parameter size comparing with our model. Fig. 4(c) shows the sample crowd images and their predicted results with LCNN on UCF_CC_50.

IV-B Results on Videos

There are a few parameters in DT-LCNN, including the number of video frames for dynamic temporal modeling and dilated residual blocks. In this set of experiments, we use 5 video frames for the temporal modeling and 3 blocks as the default setting. The effect of these parameters will be evaluated in the next subsection.

TABLE III: Crowd counting results on Mall and UCSD.
Method MALL UCSD
MAE MSE MAE MSE
Gaussian process regression [17] 3.72 20.1 - -
Ridge regression [40] 3.59 19.0 - -
Cumulative attribute regression [42] 3.43 17.7 - -
ConvLSTM-nt [25] 2.53 11.2 1.73 3.52
ConvLSTM [25] 2.24 8.5 1.30 1.79
Bidirectional ConvLSTM [25] 2.10 7.6 1.13 1.43
DT-LCNN 2.03 2.6 1.08 1.41

Mall Dataset. We now report results on the Mall dataset, as summarized in Table III-Left. The experiments follow the same setting as [40], which use the first 800 frames for training and the remaining 1,200 frames for the test. we compare the DT-LCNN with the methods which also make use of spatialtemporal information, including the regression-based methods [17, 40, 42] and the LSTM-based methods [25]. As shown in the table, using the proposed dynamic temporal modeling leads to the MAE of 2.03 and MSE of 2.6, which is significantly higher than the baseline approaches. We list some predicted density maps as well as their corresponding counting results with DT-LCNN in Fig. 5.


(a) Snippet 1

(b) Snippet 2

(c) Snippet 3

Fig. 5: Qualitative results on the sample snippets of MALL dataset.

(a) Snippet 1

(b) Snippet 2

(c) Snippet 3

Fig. 6: Qualitative results with DT-LCNN on the sample snippets of UCSD dataset.

(a) Scene 1

(b) Scene 2

(c) Scene 3

(d) Scene 4

(e) Scene 5

Fig. 7: Qualitative results on different scenes of WorldExpo’10 dataset.

UCSD Dataset. Following the convention of the existing works [17], we use frames 601-1400 as the training data and the remaining 1200 frames as the test data. We generate ground truth density maps with fixed spread Gaussian kernel. As the region of interest (ROI) and perspective map are provided, the intensities of pixels out of ROI is set to zero, and we also use ROI to revise the last convolution layer. Results on the UCSD dataset are presented in Table III-Right. Again, DT-LCNN shows better results than the LSTM-based crowd counting approaches. Some counting results with DT-LCNN on the sample snippets are shown in Fig. 6.

WorldExpo’10 Dataset. The WorldExpo’10 dataset [31] consists of 3980 annotated frames from 1132 video sequences captured by 108 different surveillance cameras during the Shanghai WorldExpo in 2010. The training set includes of 3,380 annotated frames from 103 scenes, while the testing images are extracted from other five different scenes with 120 frames per scene. Table  IV lists the per-scene performance of DT-LCNN and previous approaches. Here we also compare with two groups of approaches. The first contains methods with state-of-the-art performance [36, 8, 7, 37], and the second group is the temporal modeling approach. Ours are comparable with the state-of-the-arts for four scenes (except in scene 2), while our model and pruning speed may be more suitable for inference. And the results of DT-LCNN is significantly better than that of the LSTM-based methods. The qualitative results on different scenes are illustrated in Fig.  7.

TABLE IV: The MAE of different scenes on the WorldExpo’10 dataset.
Method S1 S2 S3 S4 S5 Avg. Params (M)
ic-CNN  [36] 17.0 12.3 9.2 8.1 4.7 10.3 16.82
D-ConvNet [8] 1.9 12.1 20.7 8.3 2.6 9.1 16.26
CSRNet [7] 2.9 11.5 8.6 16.6 3.4 8.6 16.26
ACSCP [37] 2.8 14.1 9.6 8.1 2.9 7.5 5.10
ConvLSTM-nt [25] 8.6 16.9 14.6 15.4 4.0 11.9 -
ConvLSTM [25] 7.1 15.2 15.2 13.9 3.5 10.9 -
Bi-ConvLSTM [25] 6.8 14.5 14.9 13.5 3.1 10.6 -
DT-LCNN 2.8 18.1 9.6 7.5 3.6 8.3 0.047

(a)

(b)

Fig. 8: Evaluation of network parameters on the video datasets, i.e., counting results w.r.t. (a) the number of video frames and (b) block number.

IV-C Ablation Study

In this section, we evaluate some parameters and alternative implementations of the proposed framework.

Number of Video Frames for Dynamic Modeling. We compare the performance of our framework with a varying number of video frames for dynamic modeling, as shown in Fig. 8(a). One intuitive way to add the temporal information is to smooth the density maps or counting numbers of neighboring frames, however, in some scenarios (such as WorldExpo’10), the MAE value is lower than using only single frames. We observe significant performance gains when the number of considered video frames increases from three to five. Using more frames does not improve performance.

Number of Dilated Residual Blocks. We also evaluate the effect of dilated residual block numbers in the DT-LCNN model. As shown in Fig. 8(b), the best trade-off is obtained by using three dilated residual blocks. Compared to using a single block, more blocks can boost performance. However, when the number gets larger, in some case the performances are decreased. This is probably because complex neural networks lead to underfitting when the scale of training data is limited.

TABLE V: Compare with others method for temporal modelling.
Method Dataset MAE MSE
LCNN +LSTM UCSD 1.21 1.69
MALL 2.23 3.80
LCNN + BI-LSTM UCSD 1.11 1.48
MALL 2.09 3.07
DT-LCNN UCSD 1.08 1.41
MALL 2.03 2.60

Temporal Modelling. We compare our dynamic temporal modeling approach with previous LSTM based approaches by incorporating LCNN with them. As shown in Table V, the results of DT-LCNN are better than LCNN with LSTM or Bi-directional LSTM.

Timing. Recall that our goal is to build a compact model for effective crowd counting in the videos, based on the proposed lightweight network. The parameter number of LCNN and DT-LCNN are 0.032M and 0.047M, respectively. For a video with the resolution of 320×240 pixels, the DT-LCNN model achieves 120 FPS detection speed on an Nvidia GTX TITAN X GPU and during inference it only consumes less than 500M GPU memory. Our approach can produce realtime (25FPS) crowd counting speed with a moderate Intel Core-i5 desktop CPU.

V Conclusions

We propose DT-LCNN, a new dynamic temporal modeling system with the LCNN unit to solve crowd counting in the video. Highlights are two-fold: (1) the novel lightweight architecture to produce good performance with the compact network, and (2) we explicitly model the temporal information with both crowd images and the predicted density maps. We show that by leverage contexture information of the video contents, promising results are achieved for crowd counting. The runtime speed is 25 FPS on a moderate commercial CPU. For future work, we plan to incorporate the proposed framework with the edge computing device to support the rapid decision for real-world scenarios.

References

  • [1] T. Li, H. Chang, M. Wang, B. Ni, R. Hong, and S. Yan, “Crowded scene analysis: A survey,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 3, pp. 367–386, 2014.
  • [2] B. Sheng, C. Shen, G. Lin, J. Li, W. Yang, and C. Sun, “Crowd counting via weighted vlad on a dense attribute feature map,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 8, pp. 1788–1797, 2016.
  • [3] R. Stewart, M. Andriluka, and A. Y. Ng, “End-to-End people detection in crowded scenes,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2325–2333.
  • [4] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Single-image crowd counting via multi-column convolutional neural network,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 589–597.
  • [5] D. Oñoro Rubio and R. J. López-Sastre, “Towards perspective-free object counting with deep learning,” in European Conference on Computer Vision (ECCV), 2016, pp. 615–629.
  • [6] V. A. Sindagi and V. M. Patel, “Generating high-quality crowd density maps using contextual pyramid cnns,” in International Conference on Computer Vision (ICCV), 2017, pp. 1879–1888.
  • [7] Y. Li, X. Zhang, and D. Chen, “CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1091–1100.
  • [8] Z. Shi, L. Zhang, Y. Liu, X. Cao, Y. Ye, M. Cheng, and G. Zheng, “Crowd counting with deep negative correlation learning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5382–5390.
  • [9] X. Liu, J. van de W., and A. D. Bagdanov, “Leveraging unlabeled data for crowd counting by learning to rank,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7661–7669.
  • [10] D. Sam and R. V. Babu, “Top-down feedback for crowd counting convolutional neural network,” in AAAI Conference on Artificial Intelligence (AAAI), 2018.
  • [11] D. Kang, Z. Ma, and A. B. Chan, “Beyond counting: Comparisons of density maps for crowd analysis tasks-counting, detection, and tracking,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 5, pp. 1408–1422, 2018.
  • [12] H. Zheng, Z. Lin, J. Cen, Z. Wu, and Y. Zhao, “Cross-line pedestrian counting based on spatially-consistent two-stage local crowd density estimation and accumulation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 3, pp. 787–799, 2018.
  • [13] X. Wu, Y. Zheng, H. Ye, W. Hu, J. Yang, and L. He, “Adaptive scenario discovery for crowd counting,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019, pp. 2382–2386.
  • [14] J. Gao, Q. Wang, and X. Li, “Pcc net: Perspective crowd counting via spatial convolutional network,” IEEE Transactions on Circuits and Systems for Video Technology, 2019.
  • [15] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 886–893.
  • [16] M. Li, Z. Zhang, K. Huang, and T. Tan, “Estimating the number of people in crowded scenes by mid based foreground segmentation and head-shoulder detection,” in International Conference on Pattern Recognition (ICPR), 2008.
  • [17] A. B. Chan, Z. J. Liang, and N. Vasconcelos, “Privacy preserving crowd monitoring: Counting people without people models or tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
  • [18] V. Lempitsky and A. Zisserman, “Learning to count objects in images,” in Advances in Neural Information Processing Systems (NeurIPS), 2010, pp. 1324–1332.
  • [19] D. B. Sam, S. Surya, and R. V. Babu, “Switching convolutional neural network for crowd counting,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5744–5752.
  • [20] D. Kang, D. Dhar, and A. B. Chan, “Crowd counting by adapting convolutional neural networks with side information,” arXiv:1611.06748, 2016.
  • [21] G. J. Brostow and R. Cipolla, “Unsupervised bayesian detection of independent motion in crowds,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006, pp. 594–601.
  • [22] A. B. Chan and N. Vasconcelos, “Counting people with low-level features and bayesian regression,” IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 2160–2177, 2012.
  • [23] M. Rodriguez, I. Laptev, J. Sivic, and J.-Y. Audibert, “Density-aware person detection and tracking in crowds,” in International Conference on Computer Vision (ICCV), 2011, pp. 2423–2430.
  • [24] S. Chen, A. Fern, and S. Todorovic, “Person count localization in videos from noisy foreground and detections,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1364–1372.
  • [25] F. Xiong, X. Shi, and D.-Y. Yeung, “Spatiotemporal modeling for crowd counting in videos,” in International Conference on Computer Vision (ICCV), 2017, pp. 5151–5159.
  • [26] H. Idrees, I. Saleemi, C. Seibert, and M. Shah, “Multi-source multi-scale counting in extremely dense crowd images,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2547–2554.
  • [27] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.
  • [28] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 156–165.
  • [29] B. Xu, H. Ye, Y. Zheng, H. Wang, T. Luwang, and Y.-G. Jiang, “Dense dilated network for video action recognition,” IEEE Transactions on Image Processing, 2019.
  • [30] Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convolutional network for action segmentation,” arXiv:1903.01945, 2019.
  • [31] C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd counting via deep convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 833–841.
  • [32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS Workshop, 2017.
  • [33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
  • [34] L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin, “Crowd counting using deep recurrent spatial-aware network,” in International Joint Conference on Artificial Intelligence (IJCAI), 2018, pp. 849–855.
  • [35] X. Cao, Z. Wang, Y. Zhao, and F. Su, “Scale aggregation network for accurate and efficient crowd counting,” in European Conference on Computer Vision (ECCV), 2018, pp. 734–750.
  • [36] D. Babu Sam, N. N. Sajjan, R. Venkatesh Babu, and M. Srinivasan, “Divide and grow: capturing huge diversity in crowd images with incrementally growing cnn,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 3618–3626.
  • [37] Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang, “Crowd counting via adversarial cross-scale consistency pursuit,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5245–5254.
  • [38] V. Ranjan, H. Le, and M. Hoai, “Iterative crowd counting,” in European Conference on Computer Vision (ECCV), 2018, pp. 270–285.
  • [39] S. Huang, X. Li, Z. Zhang, F. Wu, S. Gao, R. Ji, and J. Han, “Body structure aware deep crowd counting,” IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1049–1059, 2018.
  • [40] K. Chen, C. Loy, S. Gong, and T. Xiang, “Feature mining for localised crowd counting,” in British Machine Vision Conference (BMVC), vol. 1, no. 2, 2012, p. 3.
  • [41] L. Zeng, X. Xu, B. Cai, S. Qiu, and T. Zhang, “Multi-scale convolutional neural networks for crowd counting,” in International Conference on Image Processing (ICIP), 2017, pp. 465–469.
  • [42] K. Chen, S. Gong, T. Xiang, and C. Change Loy, “Cumulative attribute space for age and crowd density estimation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2467–2474.