Autonomous Deep Learning: Continual Learning Approach for Dynamic Environments

  • 2019-10-08 15:02:06
  • Andri Ashfahani, Mahardhika Pratama
  • 0

Abstract

The feasibility of deep neural networks (DNNs) to address data streamproblems still requires intensive study because of the static and offlinenature of conventional deep learning approaches. A deep continual learningalgorithm, namely autonomous deep learning (ADL), is proposed in this paper.Unlike traditional deep learning methods, ADL features a flexible structurewhere its network structure can be constructed from scratch with the absence ofan initial network structure via the self-constructing network structure. ADLspecifically addresses catastrophic forgetting by having a different-depthstructure which is capable of achieving a trade-off between plasticity andstability. Network significance (NS) formula is proposed to drive the hiddennodes growing and pruning mechanism. Drift detection scenario (DDS) is putforward to signal distributional changes in data streams which induce thecreation of a new hidden layer. The maximum information compression index(MICI) method plays an important role as a complexity reduction moduleeliminating redundant layers. The efficacy of ADL is numerically validatedunder the prequential test-then-train procedure in lifelong environments usingnine popular data stream problems. The numerical results demonstrate that ADLconsistently outperforms recent continual learning methods while characterizingthe automatic construction of network structures.

 

Quick Read (beta)

Autonomous Deep Learning: Continual Learning Approach for Dynamic Environments thanks: This paper has been published in Proceedings of the 2019 SIAM International Conference on Data Mining. The source code is available in https://www.researchgate.net/publication/335757711_ADL_Code_mFile.

Andri Ashfahani
School of Computer Science and Engineering
Nanyang Technological University
Singapore, 639798
[email protected]
&Mahardhika Pratama
School of Computer Science and Engineering
Nanyang Technological University
Singapore, 639798
[email protected]
Equal contribution.
Abstract

The feasibility of deep neural networks (DNNs) to address data stream problems still requires intensive study because of the static and offline nature of conventional deep learning approaches. A deep continual learning algorithm, namely autonomous deep learning (ADL), is proposed in this paper. Unlike traditional deep learning methods, ADL features a flexible structure where its network structure can be constructed from scratch with the absence of initial network structure via the self-constructing network structure. ADL specifically addresses catastrophic forgetting by having a different-depth structure which is capable of achieving a trade-off between plasticity and stability. Network significance (NS) formula is proposed to drive the hidden nodes growing and pruning mechanism. Drift detection scenario (DDS) is put forward to signal distributional changes in data streams which induce the creation of a new hidden layer. Maximum information compression index (MICI) method plays an important role as a complexity reduction module eliminating redundant layers. The efficacy of ADL is numerically validated under the prequential test-then-train procedure in lifelong environments using nine popular data stream problems. The numerical results demonstrate that ADL consistently outperforms recent continual learning methods while characterizing the automatic construction of network structures.

 

Autonomous Deep Learning: Continual Learning Approach for Dynamic Environments thanks: This paper has been published in Proceedings of the 2019 SIAM International Conference on Data Mining. The source code is available in https://www.researchgate.net/publication/335757711_ADL_Code_mFile.


 A Preprint
Andri Ashfahanithanks: Equal contribution. School of Computer Science and Engineering Nanyang Technological University Singapore, 639798 [email protected] Mahardhika Pratama School of Computer Science and Engineering Nanyang Technological University Singapore, 639798 [email protected]

October 10, 2019

1 Background and Motivation

State-of-the-art theoretical studies show that the increase of depth of neural networks increases the representational and generalization power of neural networks (NNs) [1]. Nevertheless, the problem of data stream remains an uncharted territory of conventional deep neural networks (DNNs). Unlike conventional data stream methods built upon a shallow network structure [2, 3, 4], DNNs potentially offers significant improvement in accuracy and aptitude to handle unstructured data streams. Direct application of conventional DNNs for data stream analytic is often impossible because of their considerable computational and memory demand making them impossible for deployment under limited computational resources [5]. Ideally, the data streams should be handled in a sample-wise manner without any retraining phase to prevent the catastrophic forgetting problem in addition to scale up with the nature of continual environments [5, 6]. Another challenge comes from the fixed and static structure of traditional DNNs [7]. In other words, the network capacity has to be estimated before process runs. This trait does not mirror the dynamic and evolving characteristics of data streams.

The use of flexible structure with the growing and pruning mechanism has picked up research attention in DNN literature [4, 3, 8] where the key idea is to evolve the DNN’s structure on demand. Incremental learning of denoising autoencoder (DAE) realizes the structural learning mechanism via the network’s loss and the hidden unit merging mechanism [9]. The underlying drawback of this approach is located in the over-dependence on problem-dependent predefined thresholds in growing and merging hidden units. The elastic consolidation weight (ECW) [10] and the hedge backpropagation (HBP) [11] are proposed to train DNN in the online situation where the ECW method addresses the catastrophic forgetting problem by preventing the output weights of new task to be deviated too far from the old one, while the HBP realizes a direct connection of hidden layer to output layer which enables representation of different concepts in each layer. However, these approaches call for network initialization step and operates under a fixed capacity.

The progressive neural networks (PNN) [12], the dynamically expandable networks (DEN) [6] and incremental learning of DAE (DEVDAN) [13] are proposed to address limited network capacity and catastrophic forgetting problems. PNN creates a new network structure for every new task, DEN grows hidden nodes whenever the loss criteria are not satisfied, while DEVDAN is capable of growing and pruning the hidden units based on the estimation of network significance (NS). Nevertheless, the three approaches utilize a fixed-depth structure [7, 1]. It is understood from [14] that addition of network depth leads to more significant improvement of generalization power than addition of hidden unit because it boosts the network capacity more substantially. To the best of our knowledge, the three approaches have not been tested under the prequential test-then-train scenario which reflects a situation where data stream arrives without label [5].

2 Problem Formulation

Continual learning of evolving data streams is defined as learning approach of continuously generated data batches B[B1,,Bk,,BK] where the number of data batches K and the type of data distributions are unknown before the process runs. Bk can be either a single data point Bk=Xn or a particular size of data batch Bk=[X1,,Xt,,XT]T×n, where n and T denote the dimension of the input space and the number of data points in a batch, respectively. Note that the batch size often varies across different time stamps. In the data stream problems, data points come into picture with the absence of true class label [5]. The execution of labelling process is subject to the access of the ground truth or expert knowledge. In other words, a delay is expected while revealing the true class labels CT. The 0-1 encoding scheme can applied to obtain multi-output target matrix CT×m where m is the number of target. This issue limits the feasibility of cross-validation or direct train-test partition methods as an evaluation protocol because those methods assume that the overall data batches are fully observable and risks on loss of data temporal order [5, 15].

The data streams require DNN to handle Bk which may be originated from different data distributions, also known as the concept drift. Specifically, there may exist a change of joint-class posterior probability P(Ct,Xt)P(Ct-1,Xt-1). The concept drift is commonly classified into two types: real and covariate [7]. The real drift usually is more severe than the covariate drift because the input variations lead to the shift of decision boundary which decreases the classification performance. In addition, this leads to a model created by previous concept Bk-1 being outdated. This characteristic shares some relevance with the multi-task learning problem where each data batch Bk is of different tasks. Nevertheless, DNN differs from the multi-task approaches in which all data batches are to be processed by a single model rather than rely on task-specific classifiers. Another problem of data streams exists in achieving trade-off between plasticity and stability which increases the risk of suffering from catastrophic forgetting [16]. These demands call for an online DNN model which is capable to incrementally construct its network structure from scratch in respect to data streams distribution. In addition, a mechanism to flexibly reuse and retain the old knowledge, or to learn the new one should be embedded to prevent catastrophic forgetting.

3 Our Approach

A fully elastic deep neural network (DNN), namely Autonomous Deep Learning (ADL), is proposed in this paper. ADL features an open structure where not only its hidden nodes can be self-organized but also the hidden layers can be constructed under the lifelong learning paradigm. These mechanisms enable ADL to perform dynamic resource allocation which tracks the dynamic variation of data streams [12, 6]. The adaptation of network width is governed by network significance (NS) method which governs creation of new hidden units and pruning of inconsequential hidden units. The adaptation of network depth is driven by drift detection scenario (DDS) where a new hidden layer is added if a drift is identified. Every hidden layer embraces different concepts played in different time windows of data streams [17]. The complexity reduction mechanism in the hidden layer level is implemented through the hidden layer merging procedure which quantifies mutual information of hidden layers and coalesces those suffering from high mutual information [8]. A new DNN structure is introduced where it puts into perspective the different-depth concept. That is, every layer is connected to a softmax layer which produces a local output. The global output is obtained from aggregation of each local output using the dynamic voting scenario. The generalization power of ADL is evaluated under the prequential test-then-train protocol with only a single epoch where the data are first use to test before exploited to update the model.The major contributions are elaborated as follows:

Different-depth network structure.

Unlike traditional DNN structure, where the final output relies on the last hidden layer, ADL puts forward the different-depth structure where there exists a direct connection of each layer to the output layer by inserting a softmax layer in each hidden layer to produce a layer-specific output. The dynamic voting scheme is integrated to deliver the final classification decision where every layer is assigned with a voting weight adapted with different intensities in respect to layer’s relevance. This approach is capable of overcoming the catastrophic forgetting problem because a network structure is constructed as a complete summary of data distributions [12]. Moreover, the dynamic voting weight mechanism is designed with dynamic decaying rates in respect to the prequential error which enables the strongest layer to dominate the voting process.

Network width adaptation.

ADL features elastic network width which supports automatic generation of new hidden nodes and pruning of inconsequential nodes. This mechanism is controlled by the NS method [13] which estimates the network generalization power in terms of bias and variance. A new hidden node is added in the case of underfitting (high bias) while the pruning mechanism is activated in the case of overfitting (high variance). Another salient feature of NS is not dependent on the user-defined parameters which enables the plug-and-play operation. It uses an adaptive threshold which dynamically adapts to the bias and variance estimation. This work offers an extension of [13] for a deep network structure.

Network depth adaptation.

The drift detection scenario (DDS) is employed to self-organize the depth of network structure where the depth of network structure increases if a drift is signalled. This idea is supported by the fact that addition of hidden layer induces more active regions than addition of hidden units, thereby being able to rectify the high bias situation due to drift effectively [18]. Note that active region here refers to the amount of unique representation carried by a hidden layer. In other words, DDS guides ADL to arrive with the hidden layers carrying the different concepts of data streams. Furthermore, the DDS method detects the real drift - variation of input space causing variation of output space via the evaluation of accuracy matrix based on the Hoeffding’s bounds method [19]. ADL also implements the complexity reduction scenario shrinking the depth of network structure. This scenario is achieved by the analysis of mutual information across hidden layers. A hidden layer sharing high correlation is discarded. This concept follows [3] but here this concept is played under the context of DNN.

Solution of catastrophic forgetting.

The key property of ADL in addressing the catastrophic forgetting problem lies in the different-depth architecture which allows to accommodate new knowledge while revisiting old knowledge with ease [12]. Moreover, the final output is produced by the dynamic voting scheme which enables to flexibly give more emphasis either to the old knowledge or to the new ones. This is evident because each layer is assigned with unique voting weights which increases and decreases with different rates. Moreover, the parameter tuning process is localized to the most relevant concept, namely the winning layer while freezing other layers to assure stable old concepts - old concept is not perturbed.

4 ADL Learning Policy

This section explains the network structure of ADL and its learning policy, which is depicted in Figure 1.

Figure 1: The learning policy of ADL

4.1 Network Structure and Working Principle

The ADL is constructed by the multilayer perceptron (MLP). The first layer defines the input feature, while the intermediate layers consists of multiple linear transformations interspersed by sigmoid function. The hidden layers and the hidden nodes of ADL can be automatically constructed which are controlled by DDS and NS formula, respectively. ADL characterizes the different-depth structure formalized as follows:

C^=maxo=1,,mYo^;Y^=l=1Lβ(l).y(l) (1)
y(l)=s.max(Ws(l)h(l)+bs(l)),l=1,,L
h(l)=σ(W(l)h(l-1)+b(l)),h(0)=X

From (1), it is observed that every hidden layer h(l) has a connection to a unique classifier producing the multiclass probability y(l), where L is the number of hidden layers. The network parameters of l-th hidden layer are denoted as θ(l) that is, W(l)Rl×d, b(l)Rl, Ws(l)m×Rl, bs(l)m, where Rl and d is the number of hidden nodes and the number of input in the l-th hidden layer, respectively. It is worth noting that the dimension of those matrices is changing according to the evolution of hidden nodes. The hidden layer is assigned with a voting weight β(l) which is dynamically adjusted by a dynamic penalty and reward factor p(l). The voting weights are normalized, l=1Lβ(l)=1, to ensure the partition of unity. Finally, the predicted label is the class label embracing the highest Y^o,o=1,,m, obtained by combining the weighted hidden layer output, as per in (1).

ADL starts its learning process from scratch without initial structure. ADL here is simulated under the prequential test-then-train procedure where data stream is first used for the testing process followed by the training process. This scenario realizes the fact that data stream come unlabelled. ADL consists of two learning stages: the high level learning and the low level learning. The former one concerns on the evolution of hidden layer while the later stage focuses on the network parameters and the number of hidden nodes of the winning layer lw using SGD and NS formula, respectively, in a single-pass learning fashion. The winning layer is a hidden layer embracing the highest β. The voting weight is deemed as an appropriate indicator of the hidden layer performance since it is adjusted using dynamic factor. Generally, the low level learning enables ADL to learn new knowledge while retaining the old ones. Moreover, it helps ADL to handle the virtual-drift, that is a distributional change of the input space [7]. After executing the low level learning, the generalization performance of ADL is evaluated using the labelled data batch Bk=[Xk,Ck]T×(n+m).

The evaluation results are then exploited in the high level learning process which consists of three mechanisms. The first one is dynamic voting weight adaptation. Every y(l) will be penalized if it makes an incorrect prediction and, conversely, it will be rewarded if it makes a correct prediction using dynamic penalty and reward factor p(l). Secondly, hidden layer pruning scenario is carried out to discard the redundant hidden layer. It is defined as the lp-th hidden layer, y(lp), which is highly correlated to others yet it has low performance. The MICI method [3] is employed to explore the mutual correlation of hidden layer, γ(y(i),y(j)),i,j=1,,L,ij. The first two mechanisms enable ADL to ignore the less useful representations and to emphasize the useful ones while obtaining the predicted output C^. Lastly, network depth adaptation is conducted by executing DDS. This method monitors the statistics of accuracy matrix and categorizes the behaviour into three stages, i.e., stable, warning, and drift. A new hidden layer is constructed when a drift is confirmed. The voting weight of newly created layer β(L) and its decreasing factor p(L) are set to 1, while the network parameters θ(L) are initialized via the low level learning phase using the current data batch. The last adjustment aims to increase the generalization and representational power of ADL. Figure 2 exemplifies the overall incremental learning process of ADL where [n,m]=[3,2].

4.2 Network Width Adaptation

This policy is carried out in the low level learning process which consists of two mechanisms as follows.

Hidden node growing.

The hidden node growing mechanism is controlled by the NS formula which evaluates the generalization power of network structure formalized as the expectation of squared error under a normal distribution as per in (2). This expression leads us to the bias-variance formula as per in (3).

NS=-(C-y^(lw))2p(x)𝑑x (2)
NS=Var(y^(lw))+(Bias(y^(lw)))2, (3)
NS=(E[(y^(lw))2]-E[y^(lw)]2)+(E[y^(lw)]-C)2

The solution of (2) requires to calculate E[y^(ly)]. Note that y^(lw) is the deterministic function of X that is, the input of h(1). Therefore, the key to solve the definite integral in (2) is the solution of E[h(1)]=-σ(W(1)X+b(1))p(X)𝑑X. Suppose X possesses normal distribution, the probability density function p(X) is given as 12πσ2exp(-(X~-μ)22σ2), where μ and σ are the recursive mean and recursive standard deviation of data streams, X, which can be calculated easily. It is worth noting that a sigmoid function can be approximated by a probit function Φ(ξX) where Φ(X)=-infX𝒩(θ|0,1)𝑑θ and ξ=π8. The integral of probit function is another probit function [20], it yields:

E[h(1)]=σ(W(1)μ/(1+πσ2/8)+b(1)) (4)

Next, E[y^(lw)] can be obtained by generalizing (4) via lw times-forward-chaining operation. It yields:

E[y^(lw)]=s.max(Ws(lw)E[h(lw)]+bs(lw)) (5)
E[h(l)]=σ(W(l)E[h(l-1)]+b(l)),l=2,,lw (6)

After that, the bias can be calculated by substituting E[y^(lw)] to the bias term, (Bias(y^(lw)))2=(E[y^(lw)]-C)2. This approach is different from the loss function used in [6], because while approximating the generalization of DNN, the bias formula takes into account the influence of all past and future samples under the assumption of normal distribution. The high bias indicates the underfitting situation which can be circumvented by increasing the network capacity.

The hidden node growing condition is derived based on k-sigma rule concept adopted from the theory of statistical process control [21]. However, instead of using the binomial distribution to calculate the mean and variance, ADL directly utilizes the bias itself (Bias2) because the hidden node growing strategy evaluates the real-variable bias instead of the accuracy score. The high bias problem, triggering the construction of a hidden node in the lw-th layer, is formulated as follows:

μbiast+σbiastμbiasmin+πσbiasmin (7)

where π=1.3exp(-(Bias(y^(lw)))2)+0.7 and governs the confidence degree of sigma rule. It is designed that π is a function of Bias2 and revolves around [1,2]. A high bias triggers π to return a low confidence level - close to 1, realizing around 68.2% confidence degree. Conversely, a high confidence level - close to 2, equivalent to 95.2%, is generated by π when the bias is low. This provides a flexibility for the hidden nodes growing mechanism and eliminates the involvement of problem-specific threshold. μbiast,σbiast are the recursive mean and standard deviation of Bias2 up to t-th time instant, whereas μbiasmin,σbiasmin denote the minimum value of μbias,σbias up to t-th time instant but are reset whenever (7) is satisfied. Equation (7) signifies the existence of changing data distribution represented by the increase of network bias. The network bias should decrease or at least be stable when there is no drift in data streams. When (7) is satisfied, a hidden node is added in the lw-th hidden layer, Rlw=Rlw+1, and the new network parameters in the Rlw-th hidden node θRlw(lw) are initialized using Xavier initialization.

Hidden node pruning.

It is derived from the same principle of the hidden node growing mechanism, yet it exploits Var(y^(lw)) instead of (Bias(y^(lw)))2. A high variance, overfitting, should be handled by reducing the network complexity. Before calculating Var(y^(lw)), it is required to derive the expression of E[(y^(lw))2] and E[y^(lw)]2. The second expression can be obtained easily by applying squared operation to E[y^(lw)]. It is worth noting that (y^(lw))2=y^(lw)*y^(lw) is the IID variable. Therefore, E[(y^(lw))2] can be obtained by first calculating E[(h(1))2]=E[h(1)]*E[h(1)] and followed by forward-passing the result to lw-th hidden layer. It is similar to the way of calculating (5), yet it takes E[(h(1))2] as the initial input instead of E[(h(1))].

The hidden node pruning condition implements the same principal as the growing part where the statistical process control is adopted to identify the high variance problem, as per in (8). Unlike the growing condition in (7), the σvarmin-part is multiplied by 2 to avoid direct-pruning-after-growing problem. It is worth mentioning that the addition of a hidden node leads to the increase of network variance yet progressively decrease as the next information arrives. χ is designed similar to π yet it takes Var(y^(lw)) as the input instead of (Bias(y^(lw)))2. Consequently, the k sigma rule revolves in the range of [2,4] providing around 95.4% to 99.9% confidence level.

μvart+σvartμvarmin+2χσvarmin (8)

If (8) is satisfied, the pruning scenario is undertaken to remove the weakest hidden node in the lw-th hidden layer. The least significant hidden node can be observed by calculating (6). That is, the importance of all hidden nodes in the lw-th hidden layer. The pruning mechanism discarding the hidden node with the lowest E[h(lw)] is formalized as follows: Pruningmini=1,,RlwE[h(lw)]i. Consequently, the number of hidden nodes decreases to Rlw=Rlw-1 as an effort to address the overfitting dilemma. Note that a small E[h(lw)]i value indicates that i-th hidden node plays a small role in producing the output y(lw) and thus can be discarded without significance loss of accuracy. The concept of statistical contribution of hidden node can be categorized as performance estimation strategy of neural architecture search because it estimates the generalization power of the network on unseen data [22].

4.3 Network Depth Adaptation

ADL realizes the different-depth structure using the DDS as an effort to deal with the concept drift. It also utilizes the MICI method as the complexity reduction procedure. The following explain those mechanisms.

Hidden layer growing.

A new hidden layer is constructed if there is a concept change in the data streams. The DDS signals a drift status by monitoring the accuracy matrix. The drift situation signifies that the network is underfitting as indicated by low statistic of accuracy matrix meaning that the ADL’s performance deteriorates. In other words, the crafted knowledge alone cannot adequately describes the new data distribution. This dilemma can be solved by increasing the network capacity in two ways those are, hidden node growing or hidden layer expansion. The second option, however, augments the network capacity more significantly since the extension of depth increases the number of active regions [18] more than that expansion of network width. In addition, addition of network depth has been theoretically more meaningful than addition of neuron or hidden units [14].

The accuracy matrix FT×m stores the generalization performance of the testing phase. It records 1 if the misclassification happens C^tCt, whereas 0 is stored if ADL correctly classify an observation C^t=Ct. The switching point is determined by evaluating two accuracy matrices, F and Gcut×m, where cut is the hypothetical switching point which can be found using the following condition:

F^+ϵFG^+ϵG (9)

where [F^,G^] and [ϵF,ϵG] denote the statistics and the Hoeffding’s error bounds of [F,G]. The condition (9) spots a transition between two concepts where G^>F^. Note that G^ is expected to decrease or at least be constant in the stable phase. This strategy performs better while dealing with sudden drift, the most common type of drift, yet it is less sensitive to the gradual drift where change slowly appears because every sample is treated equally without any weights [19]. The Hoeffding’s error bounds are formulated as follows:

ϵF,G,H=(b-a)size2(size×cut)ln(1α) (10)

where size denotes the size of accuracy matrices and α denotes the significance level of Hoeffding’s bound. Note that α is statistically justifiable since it is associated to the confidence level 1-α. It is not classified as a problem-specific threshold because a high α provides a low confidence level whereas a low α returns a high one. The values a,b indicate the minimum and the maximum entries of the accuracy matrices F,G,H.

|H^-G^|>ϵD (11)
ϵW|H^-G^|<ϵD (12)

The condition (9) aims at finding the cutting point, cut, where the accuracy matrix G is not in the decreasing trend. Once it is spotted, the accuracy matrix, H(T-cut)×m, can be formed. This matrix is used as a reference whether the null hypothesis is valid or not. The null hypothesis evaluates the increase of statistics of accuracy matrix which verifies the drift condition. The drift status is signalled when the null hypothesis is rejected with the size of αD, as per in (12). Conversely, the warning status is returned when the null hypothesis is rejected with the size αW, formalized in (12), which aims to signal the gradual drift. The value of ϵW and [ϵD,ϵF,ϵG] can be calculated via (10) using αW and αD. If none of those conditions are satisfied, the stable condition is returned.

A drift condition (11) displays the phase where the empirical mean of G is lower than F indicating the evidence that the classification performance degenerates. This case signals the hidden layer growing procedure to increase the network depth, L=L+1. The newly created layer L is then trained in the low level learning phase using the current data batch Bk. Meanwhile, the network parameters of other hidden layers are frozen to preserve the old knowledge which prevents the catastrophic forgetting. The warning phase (12) indicates the transition situation where more observations are required to signal a concept drift. Because of this reason, the current data batch is stored in the buffer Bwarning=Bk and is exploited to initialize a new hidden layer if the drift occurs in the next timestamp k+1. The stable condition yields to the adjustment of the current structure via the low level learning phase and the deletion of the data in the buffer.

Hidden layer pruning.

ADL employs the hidden layer pruning mechanism to handle the redundancy across different hidden layers. This is achieved by analyzing the correlation of the output y(l) [3]. Based on the manifold learning concept, a redundant hidden layer embracing similar concept is expected not to inform an important representation of the given problem or at least very well covered by other hidden layers, because it does not open the manifold of learning problem to a unique representation. The MICI method is utilized to explore the correlation between two hidden layers it yields to the pruning condition, as follows:

γ(y(i),y(j))>δ,i,j=1,,L,ij (13)

δ is a user-defined threshold which is proportional to the maximum correlation index, where the lower the value the less pruning mechanism is executed. If (13) is satisfied, the pruning process encompasses the hidden layer with the lowest β, i.e. HLpruningminlp=i,jβ(lp). Note that β is expected as an appropriate indicator of a hidden layer performance because it is dynamically adjusted using dynamic decreasing factor. Consequently, the direct connection from lp-th hidden layer to the output Y^ is deleted yet that hidden layer still performs the forward-pass operation providing the representation h(lp). This strategy also accelerates the model update because the pruned hidden layer is ignored in the learning procedure. It can be regarded as the dropout scenario in the realm of deep learning [23], yet ADL relies on the similarity analysis (13) instead of the probabilistic approach. The illustration of incremental learning aspect of ADL is depicted in Figure 2.

Figure 2: Example of an automated DNN construction used by the proposed ADL

4.4 The Solution of Catastrophic Forgetting

Having a flexible structure embracing different-depth enables ADL to address the problem via two mechanisms elaborated in this section.

Dynamic voting weight adaptation.

Every voting weight β(l) is dynamically adjusted by a unique decreasing factor p(l)[0,1] which plays an important role while adapting to the concept drift. A high value of p(l) provides slow adaptation to the rapidly changing environment, yet it handles gradual or incremental drift very well. Conversely, a low value of p(l) gives frequent adaptation to sudden drift, yet it forfeits the stability while dealing with gradual drift where data samples embrace two distributions. This issue is handled by continuously adjusting p(l) to represent the performance of each hidden layer using a step size ζ, as per in (14). These are realized by setting p(l) to either (p(l)+ζ) or (p(l)-ζ) when the l-th hidden layer returns a correct prediction or incorrect one, respectively. This also considers the fact that the voting weight of a hidden layer embracing relevant representation should decrease slowly when making misclassification while that embracing irrelevant representation should increase slowly when returning the correct prediction.

p(l)=p(l)±ζ (14)
β(l)=min(β(l)(1+p(l)),1) (15)
β(l)=p(l).β(l) (16)

The reward and penalty scenario are carried out by increasing and decreasing the voting weight based on the performance of its respective hidden layers, y(l). The reward is given when a hidden layer returns a correct prediction, as per in (15). Conversely, a hidden layer is penalized if it makes an incorrect prediction, as per in (16). The reward scenario is capable of handling the cyclic drift by reactivating the hidden layer embracing a small β. Unlike its predecessors in [4, 3], the aims of reward and penalty scenario carried out here are to augment the impact of a strong hidden layer by providing a high reward and a low penalty and to diminish a weak hidden layer by giving a small reward and a high penalty. Note that ADL possesses different-depth structure where every hidden layer has a direct connection to the output. As a result, the classification decision should consider the relevance of each hidden layer based on the prequential error. This approach aligns with the DDS as a method to increase the network depth because it guarantees ADL to embrace a different concept in each hidden layer.

Winning layer adaptation.

SGD method is employed to adjust the network parameters of the winning layer, i.e. θ(lw), using labelled data batch Bk=(Xk,Ck)T×(n+m) in a single-pass manner. It is derived using the cross-entropy loss minimization. However, instead of using the global error derivative, ADL exploits the local one which is backpropagated from the winning layer y(lw). By this approach, each hidden layer is optimized based on a different objective which embraces a different concept. This enables ADL to improve its generalization power while reducing the risk of being suffered from catastrophic forgetting problem. Note that the parameter adjustment mechanism is executed under a dynamic network which consists of single hidden node in the beginning and can grow on demand.

5 Empirical Evaluation

This section outlines the empirical study of ADL in which it is compared against four algorithms.

Experimental setting.

DL is numerically validated using nine prominent data stream problems, i.e., Permuted MNIST [16], Weather [24], KDDCup [25], SEA [26], hyperplane [27], SUSY, Hepmass [28], RLCPS [29], and RFID localization [2]. The first five characterize non-stationary properties, while the others feature prominent characteristics in examining the performance of data stream algorithms: big size, high input feature, etc. ADL is compared against fixed-structure DNN to observe kind of improvement produced by ADL while embracing the flexible different-depth structure. The value of [αD,αW,δ,ζ] are set to [0.0001,0.0005,0.05,0.001] in all problems. It is worth noting that αD,αW determines the confidence level of Hoeffding bound 1-αD,W. The selected values 0.0001,0.0005 return very high confidence levels close to 100%. DNN network structure is initialized before the execution. ADL is compared to another deep stacked network embracing a flexible different-depth structure, that is DEVFNN (DFN) [8]. ADL is also compared against pEnsemble+ (pE+) [3] and pEnsemble (pE) [4] aims to present the improvement over an evolving ensemble structure.

Table 1: Numerical results of consolidated algorithm
Class. rate ET HL HN NoP
S ADL 78.26±2.8 2.5K 𝟐±0.6 614±40 (17±7)K
U pE+ 76.99±4.6 35K 19±6 9±3 230±80
S pE 74.44±2.4 14K 3±2 2±1 36±21
Y DFN 76.7±3.2 5K 29±12 24±12 (72±6.4)K
DNN 51.19±4.64 6K 23 637 5.7K
H. ADL 84.02±2.2 0.59K 𝟐±0.7 154±10 (1.7±1.3)K
M pE+ 82.3±2.2 7.6K 2±0.7 2±0.7 24±8
A pE 82.6±1.9 12K 2±0.7 2±0.7 24±8
S DFN 80.46±6.87 1.5K 10±5 8±5 (15±10)K
S DNN 50.03±2.41 0.8K 8 160 4K
R ADL 99.99±0.03 1.6K 1 58±2 690±20
L pE+ 99.8±0.3 12.6K 7±1 7±1 84±13
C pE 99.7±0.3 60K 50±15 50±15 600±190
P DFN 99.7±0.3 0.8K 1 1 128
S DNN 99.99±0.03 0.54K 1 58 0.7K
R ADL 99.11±𝟐 55.1 1 51±10 420±80
F pE+ 60.9±7.6 0.7K 2±0.8 1±0.5 44±14
I pE 60.4±6.7 0.5K 2±1 2±0.7 43±22
D DFN 93.5±5.8 74.8 10±3 9±3 (7.6±5)K
DNN 99.1±2.7 33.02 1 51 0.4K
P. ADL 81.62±11.5 26.3 1 22±6 (18±4.5)K
M pE+ NA NA NA NA NA
NI pE NA NA NA NA NA
S DFN NA NA NA NA NA
T DNN 78.8±12.7 18.5 1 20 16K
W ADL 74.48±5.19 3.1 1 8±1 93±15
e pE+ 78.8±𝟒 29.42 2±0.2 1 24±2
a pE 78.4±4.3 33.49 2 1 24
t DFN 78.6±4.3 7.8 3 1 318
h. DNN 71.38±8.74 1.98 1 8 90
K ADL 99.85±0.18 102.7 1 26±1.2 1120±50
D pE+ 96.7±6 0.86K 1 1 12
D pE 99.3±0.4 5.4K 1 1 12
C DFN 99.16±0.5 0.21K 1 1 1900
p. DNN 99.84±0.19 62.46 1 26 1000
S ADL 92.82±5.79 18 1 22±8 130±50
E pE+ 92±6 0.2K 5±2 2±1 60±19
A pE 92±5.7 0.18K 5±2 2±1 60±19
DFN 91.9±5.3 39.1 2 1 52
DNN 92.5±6.49 11.28 1 22 0.13K
H ADL 92.33±2.63 21.51 1 16±1 110±7
y pE+ 87.6±6.2 0.15K 5±1 3±0.5 55±11
p pE 91.8±1.9 68.2 4±4 3±2 23±45
e DFN 91.77±1.6 0.3K 2 1 76
r. DNN 92±3.28 13.39 1 16 0.11K

ET: execution time, HL: hidden layers, HN: hidden nodes, NoP: number of parameters

The performance of all algorithms are assessed using five performance metrics: classification rate, execution time (ET), HL, HN, and the number of parameters (NoP). The prequential evaluation is conducted in a single-pass mode to simulate real data stream environments. The numerical results are averaged across all time stamps except the execution time. HL is the number of hidden-layer-to-output connection in ADL, the number of ensemble in pEnsemble and pEnsemble+, and the number of stacked building unit in DEVFNN. HN represents the total nodes in ADL and DNN, while in the remainder methods it signifies the total fuzzy rule. All experiments are executed in the same computational environment to assure fair comparisons under MATLAB environments with the Intel(R) Xeon(R) CPU E5-1650 @3.20 GHz processor and 16 GB RAM.

Numerical results.

From Table 1, ADL delivers up to 68% performance improvement over consolidated algorithms in terms of accuracy. This also demonstrates that the fully elastic network of ADL, where the hidden node and the hidden layer can be added or discarded on demand, can arrive at appropriate complexity for a specific problem and is comparable to those three evolving algorithms. ADL delivers the fastest execution time compared to those evolving algorithms in most cases. This result is understood because ADL is built upon MLP, while those algorithms are constructed by the multi-classifier concept possessing high computational and space complexity. This enables ADL to execute the high dimensional data, permuted MNIST problem, which results in 3% improved accuracy over DNN, while the evolving algorithms are not scalable to deal with this problem. In terms of resolving the catastrophic forgetting, ADL delivers the most encouraging performance. The evidence can be seen from the numerical results of big datasets, SUSY and Hepmass, where ADL delivers the highest classification rate. These facts are reasonable since ADL characterizes different-depth structure supported by dynamic voting weight and winning layer adaptation which enables ADL to flexibly recall the previous knowledge or craft the new one.

6 Conclusion

This paper presents a novel self-organizing DNN, namely ADL. It possesses a flexible different-depth structure where the network structure can be automatically constructed from scratch with the absence of problem-specific user-defined parameters. The adaptation of network width is controlled by the estimation of bias and variance while the hidden layer can be deepened using the drift detection method. Possessing different-depth structure becomes the key characteristics of ADL to address catastrophic forgetting problem in the lifelong learning environment. It enables ADL to put more emphasis on the most relevant layer via dynamic voting scenario and winning layer adaptation. Our empirical evaluation has validated the effectiveness of ADL in dealing with non-stationary data streams under prequential test-then-train protocol. It also demonstrates the increase in performance over fixed structure DNN embracing the same network complexity. Future work inspired by this method should investigate the feasibility of ADL to handle unstructured data streams.

Acknowledgement

This work is supported by A*STAR-NTU-SUTD AI Partnership Grant No. RGANS1902.

References

  • [1] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25. Curran Associates, Inc., 2012.
  • [2] Andri Ashfahani, Mahardhika Pratama, Edwin Lughofer, Qing Cai, and Huang Sheng. An Online RFID Localization in the Manufacturing Shopfloor, pages 287–309. Springer International Publishing, 2019.
  • [3] Mahardhika Pratama, Eric Dimla, Edwin Lughofer, Witold Pedrycz, and Tegoeh Tjahjowidodo. Online tool condition monitoring based on parsimonious ensemble+. IEEE Transactions Cybernetics, 2018.
  • [4] M. Pratama, W. Pedrycz, and E. Lughofer. Evolving ensemble fuzzy classifier. IEEE Transactions on Fuzzy Systems, pages 1–1, 2018.
  • [5] Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang. Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 2018.
  • [6] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. ICLR, 2018.
  • [7] João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM Comput. Surv., 46(4):44:1–44:37, March 2014.
  • [8] M. Pratama, W. Pedrycz, and G. I. Webb. An Incremental Construction of Deep Neuro Fuzzy System for Continual Learning of Non-stationary Data Streams. ArXiv e-prints, August 2018.
  • [9] Guanyu Zhou, Kihyuk Sohn, and Honglak Lee. Online incremental feature learning with denoising autoencoders. Journal of Machine Learning Research, 22:1453–1461, 2012.
  • [10] Ferenc Huszár. Note on the quadratic penalties in elastic weight consolidation. Proceedings of the National Academy of Sciences, 2018.
  • [11] D. Sahoo, Q. D. Pham, J. Lu, and S. C. Hoi. Online deep learning: Learning deep neural networks on the fly. arXiv preprint arXiv:1711.03705, abs/1711.03705, 2017.
  • [12] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. CoRR, abs/1606.04671, 2016.
  • [13] M. Pratama, A. Ashfahani, Y. S. Ong, S. Ramasamy, and E. Lughofer. Autonomous Deep Learning: Incremental Learning of Denoising Autoencoder for Evolving Data Streams. ArXiv e-prints, September 2018.
  • [14] D. H. Wolpert. The power of depth for feed-forward neural networks. Journal of Machine Learning Research, 49:1–39, 2016.
  • [15] Anjin Liu, Jie Lu, Feng Liu, and Guangquan Zhang. Accumulating regional density dissimilarity for concept drift detection in data streams. Pattern Recognition, 76:256–272, 2018.
  • [16] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks, 2016. cite arxiv:1612.00796.
  • [17] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, August 2013.
  • [18] Guido F. Montúfar, Razvan Pascanu, KyungHyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2924–2932, 2014.
  • [19] Isvani Frías-Blanco, José del Campo-Ávila, Gonzalo Ramos-Jiménez, Rafael Morales-Bueno, Agustín Ortiz-Díaz, and Yailé Caballero-Mota. Online and non-parametric drift detection methods based on hoeffding’s bounds. IEEE Transactions on Knowledge and Data Engineering, 27(3):810–823, 2015.
  • [20] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
  • [21] João Gama, Ricardo Fernandes, and Ricardo Rocha. Decision trees for mining data streams. Intell. Data Anal., 10(1):23–45, January 2006.
  • [22] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. arXiv preprint arXiv:1808.05377, 2018.
  • [23] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • [24] G. Ditzler and R. Polikar. Incremental learning of concept drift from streaming imbalanced data. IEEE Trans. on Knowl. and Data Eng., 25(10):2283–2301, October 2013.
  • [25] Salvatore J. Stolfo, Wei Fan, Wenke Lee, Andreas Prodromidis, and Philip K. Chan. Cost-based modeling for fraud and intrusion detection: Results from the jam project. In In Proceedings of the 2000 DARPA Information Survivability Conference and Exposition, pages 130–144. IEEE Computer Press, 2000.
  • [26] W. N. Street and Y-S Kim. A streaming ensemble algorithm (sea) for large-scale classification. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, pages 377–382, New York, NY, USA, 2001. ACM.
  • [27] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. Moa: Massive online analysis. J. Mach. Learn. Res., 11:1601–1604, August 2010.
  • [28] P. Baldi, Paul D. Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5:4308, 2014.
  • [29] Murat Sariyar, Andreas Borg, and Klaus Pommerening. Controlling false match rates in record linkage using extreme value theory. Journal of Biomedical Informatics, 44(4):648–654, 2011.