Block-Randomized Stochastic Proximal Gradient for Low-Rank Tensor Factorization

  • 2020-03-25 17:24:33
  • Xiao Fu, Shahana Ibrahim, Hoi-To Wai, Cheng Gao, Kejun Huang
  • 0

Abstract

This work considers the problem of computing the canonical polyadicdecomposition (CPD) of large tensors. Prior works mostly leverage data sparsityto handle this problem, which is not suitable for handling dense tensors thatoften arise in applications such as medical imaging, computer vision, andremote sensing. Stochastic optimization is known for its low memory cost andper-iteration complexity when handling dense data. However, exisitingstochastic CPD algorithms are not flexible enough to incorporate a variety ofconstraints/regularizations that are of interest in signal and data analytics.Convergence properties of many such algorithms are also unclear. In this work,we propose a stochastic optimization framework for large-scale CPD withconstraints/regularizations. The framework works under a doubly randomizedfashion, and can be regarded as a judicious combination of randomized blockcoordinate descent (BCD) and stochastic proximal gradient (SPG). The algorithmenjoys lightweight updates and small memory footprint. In addition, thisframework entails considerable flexibility---many frequently used regularizersand constraints can be readily handled under the proposed scheme. The approachis also supported by convergence analysis. Numerical results on large-scaledense tensors are employed to showcase the effectiveness of the proposedapproach.

 

Quick Read (beta)

Block-Randomized Stochastic Proximal Gradient for Low-Rank Tensor Factorization

Xiao Fu, , Shahana Ibrahim, , Hoi-To Wai, ,
Cheng Gao, and Kejun Huang,
This work is supported in part by National Science Foundation under Projects NSF ECCS-1608961, ECCS-1808159, III-1910118, the Army Research Office (ARO) under Proejct ARO W911NF-19-1-0247, and by the Chinese University of Hong Kong under the CUHK Direct Grant 4055113. X. Fu and S. Ibrahim are with the School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR 97331, United States. email (xiao.fu,ibrahish)@oregonstate.edu H.-T. Wai is with the Department of Systems Engineering and Engineering Management, Shatin, Hong Kong. email: [email protected] C. Gao was with the School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR 97331, United States. He is now with the University of Missouri - Columbia, Columbia, MO 65211. email: [email protected] K. Huang is with the Department of Computer and Information Science and Engineering, Gainesville, FL 32611, University of Florida. email: [email protected]
Abstract

This work considers the problem of computing the canonical polyadic decomposition (CPD) of large tensors. Prior works leverage data sparsity to handle this problem, which is not suitable for handling dense tensors that often arise in applications such as medical imaging, computer vision, and remote sensing. Stochastic optimization is known for its low memory cost and per-iteration complexity when handling dense data. However, existing stochastic CPD algorithms are not flexible to incorporate a variety of constraints/regularization terms that are of interest in signal and data analytics. Convergence properties of many such algorithms are also unclear. In this work, we propose a stochastic optimization framework for large-scale CPD with constraints/regularization terms. The framework works under a doubly randomized fashion, and can be regarded as a judicious combination of randomized block coordinate descent (BCD) and stochastic proximal gradient (SPG). The algorithm enjoys lightweight updates and small memory footprint. This framework entails considerable flexibility—many frequently used regularizers and constraints can be readily handled. The approach is supported by convergence analysis. Numerical results on large-scale dense tensors are presented to showcase the effectiveness of the proposed approach.

Large-scale tensor decomposition, canonical polyadic decomposition, stochastic gradient, Adagrad

I Introduction

Canonical polyadic decomposition (CPD) [previously known as parallel factor analysis (PARAFAC)] [1, 2, 3] is arguably the most popular low-rank tensor decomposition model. CPD has found applications in various fields, such as analytical chemistry [4], social network mining [5], hyperspectral imaging [6], topic modeling [7], and time series analysis [8]; also see [9, 10, 11] for more applications in communications.

Computing the CPD of a tensor, however, is a challenging optimization problem [12]. Many algorithms have been proposed through the years [13, 14, 15, 3]. To keep pace with the ever growing volume of available data, one pressing challenge is to compute CPD at scale. The classic alternating least squares (ALS) algorithm [3] has an elegant algorithmic structure, but suffers from a number of numerical issues [16, 17] and is hardly scalable. In recent years, many new CPD algorithms have appeared, triggered by the advances in big data analytics and first-order optimization [18, 19, 14, 13, 20]. Many of these algorithms leverage data sparsity to scale up CPD—by cleverly using the zero elements in huge tensors, computationally costly key operations in ALS (e.g., the matricized tensor times Khatri-Rao product (MTTKRP) operation) can be significantly simplified. Consequently, the classic ALS algorithm can be modified to handle CPD of huge and sparse tensors.

However, when the tensor to be factored is dense—i.e., when most entries of the tensor are nonzero—the sparsity-enabled efficient algorithms [13, 14, 18, 19, 6] are no longer applicable. Note that large and dense tensors arise in many timely and important applications such as medical imaging [21], hyperspectral imaging [6], and computer vision [22]. In fact, since big dense tensors typically cost a lot of memory (e.g., a dense tensor with a size of 2,000×2,000×2,000 occupies 57.52GB memory if saved as double-precision numbers), it is even hard to load them into the RAM of laptops, desktops, or servers.

Stochastic gradient (SG) method is a powerful tool for handling optimization problems involving a large amount of data, which is known for its low per-iteration memory and computational complexities [23]. A number of stochastic optimization based CPD algorithms have been proposed in the literature [24, 25, 26]. Specifically, The works in [24, 25] work in an iterative manner. In each iteration, the algorithm samples a random subset of the tensor entries and update the corresponding parts of the latent factors using the sampled data. The algorithms have proven quite effective in practice, and feature distributed implementation [25]. The challenge here is that every tensor entry only contains information of a certain row of the latent factors, and updating the entire latent factors may require a high complexity. This may lead to slow improvement of the latent factor estimation accuracy. More importantly, this update strategy loses the opportunity to incorporate constraints/regularization terms on the whole latent factors, since the sampled entries only contain partial information of them. This is undesired in practice, since prior information on the latent factors are critical for enhancing performance, especially in noisy cases.

Recently, a stochastic algorithm that ensures updating one entire latent factor in every iteration was proposed in [26]. Instead of sampling tensor entries, the algorithm works via sampling tensor fibers that contain information of the whole latent factors. This algorithm exhibits very good empirical performance when the tensor rank is low. However, this algorithm works with at least as many fibers as the tensor rank, which in some cases gives rise to high per-iteration complexity. In addition, the algorithm in [26] did not explicitly offer implementations that take into considerations of constraints or regularization terms—although this can be fixed with some modifications. Lastly, convergence properties of many stochastic CPD algorithms such as those in [24, 26] are unclear.

Contributions In this work, we propose a new stochastic algorithmic framework for computing the CPD of large-scale dense tensors. Our contributions include:

A Doubly Randomized Computational Framework for Large-Scale CPD. We propose an efficient and flexible computational framework for CPD of large dense tensors. Our method is a judicious combination of randomized block coordinate descent (BCD) [27, 28] and stochastic proximal gradient (SPG) [29, 30]. In each iteration, our algorithm first samples a mode from all modes of the tensor. Then, the algorithm samples some fibers of this mode and updates the corresponding latent factor via stochastic proximal operations. Such a combination exhibits an array of attractive features: It admits smaller per-iteration memory and computational complexities relative to the existing fiber sampling based method in [26], particularly in high-rank cases. More importantly, it is very flexible in terms of incorporating regularization terms and constraints on the latent factors.

Convergence Analysis. Both BCD and SPG are well studied topics in the optimization literature [31, 28, 27]. However, convergence properties of the proposed framework is not immediately clear due to the nonconvex nature of CPD. The existing block-randomized SGD (BR-SGD) framework [32] only considers convex optimization. Another work [33] considers nonconvex optimization, which adopts a Gauss-Seidel type block updating strategy without randomization. The conditions for convergence in [33] are not easy to check or guarantee in the context of tensor factorization. In contrast, we offer tailored convergence analyses of the proposed algorithm leveraging block randomization, and show that the proposed optimization strategy features sub-sequence convergence to a stationary point—which is a necessary condition for attaining local or global optima.

Implementation-friendly Adaptive Stepsize Scheduling. One of the most challenging aspects in stochastic optimization is selecting a proper stepsize schedule. To make the proposed algorithms friendly to use by practitioners, we propose a practical and adaptive stepsize schedule based on the celebrated Adagrad algorithm [34]. Adagrad is an adaptive stepsize selection method devised for single-block gradient descent. Nonetheless, we find through extensive simulations that it largely helps reducing the agonizing pain of tuning stepsize when implementing our multi-block algorithm for CPD. In addition, we show that the adaptive stepsize-based algorithm converges to a stationary point almost surely.

A quick demonstration of the effectiveness of the proposed algorithms is shown in Fig. 1, where the average mean squared error (MSE) of the estimated latent factors [cf. Eq. (22)] against the number of MTTKRP computed (which dominates the complexity) is plotted. One can see that the proposed algorithm largely outperforms a couple of state-of–the-art algorithms for constrained CPD. More thorough numerical results can be seen in Sec. VI.

Part of the work was submitted to ICASSP 2019 [35]. In this new version, we have included detailed convergence proofs and the new adaptive stepsize based algorithm. More extensive simulations and real-data experiments are also included.

Fig. 1: The proposed algorithms (AdaCPD and BrasCPD) exhibit low complexity for achieving high accuracy of the estimated latent factors. The tensor under test has a size of 100×100×100 and the rank is 10. The latent factors are constrained to be nonnegative. The baselines are two state-of-art constrained CPD algorithms AO-ADMM [13] and APG [14].

Notation. We follow the established conventions in signal processing. x, 𝒙, 𝑿, and 𝑿¯ denote scalar, vector, matrix, and tensor, respectively; denotes the Euclidean norm, i.e., 𝒙2 and 𝑿F, respectively; , , and denote outer product, Khatri-Rao product, and Hadamard product, respectively, unless otherwise specified; vec(𝑿) denotes the vectorization operator that concatenates the columns of 𝑿; 𝑿𝟎 means that all the entries of 𝑿 are nonnegative; and denote transpose and pseudo-inverse, respectively; |𝒞| denotes the cardinality of set 𝒞; λmax() denotes the largest eigenvalue of a matrix. Unless otherwise specified, we denote the total expectation by the subscript-less operator 𝔼[].

II Background

We first introduce some notions used in tensor algebra.

II-A Tensors and CPD

An Nth order tensor is an array whose entries are indexed by N coordinates; i.e., 𝑿¯(i1,,iN) denotes an element of the tensor 𝑿¯ with a size of I1×I2××IN. Like matrices, tensors can be represented as sum of rank-one components:

𝑿¯=f=1F𝑨(1)(:,f)𝑨(2)(:,f)𝑨(N)(:,f), (1)

where “” denotes the outer product of vectors, and 𝑨(n) is an In×F matrix that is often referred to as the mode-n latent factor. When F is the minimal integer that satisfies the expression in (1), the right hand side in (1) is called the canonical polyadic decomposition of the tensor 𝑿¯. At the entry level, the CPD can be expressed as

𝑿¯(i1,,iN)=f=1Fn=1N𝑨(n)(in,f), (2)

for in{1,,In}. The CPD of a tensor is essentially unique under mild conditions11 1 The latent factors 𝑨(n)’s that constitute the data 𝑿¯ are unique up to some trivial ambiguities like column permutations and scalings [2].. The CPD of a tensor can be obtained via minimizing a certain criterion as follows:

minimize{𝑨(n)}n=1Nf(𝑨(1),,𝑨(N)). (3)

A common optimization criterion for CPD in the literature is the least squares (LS) fitting criterion [3, 13, 14]:

f(𝑨(1),,𝑨(N))=𝑿¯-f=1F𝑨(1)(:,f)𝑨(N)(:,f)F2.

In the sequel, we will often use the shorthand notation f(𝜽) to denote f(𝑨(1),,𝑨(N)), where

𝜽=[vec(𝑨(1)),,vec(𝑨(N))].

Other criteria have also been considered, e.g., the Kullback-Leibler (KL) divergence [36] and robust fitting [37, 38] criteria—which serve for different purposes.

II-B Unfolding, ALS and MTTKRP

The matricization operation, or matrix unfolding of a tensor, has proven very useful in designing tensor factorization algorithms. The mode-n unfolding of a tensor is a Jn×In matrix where

𝑿¯(i1,,iN)=𝑿(n)(j,in),

and we have j=1+k=1,knN(ik-1)Jk and Jk=m=1,mnk-1Im [1]. The CPD representation in Eq. (1) can be expressed as

𝑿(n)=𝑯(n)𝑨(n), (4)

where the Jn×F matrix 𝑯(n) is defined as

𝑯(n)=𝑨(1)𝑨(n-1)𝑨(n+1)𝑨(N)=i=1,inN𝑨(i).

The elegant form of the unfoldings has enabled the famous alternating least squares (ALS) algorithm [3] for handling Problem (3) with the LS objective. Specifically, ALS solves the following cyclically for n=1,,N:

𝑨(n) argmin𝑨𝑿(n)-𝑯(n)𝑨F2. (5)

Problem (5) is nothing but a least squares problem that admits the following closed-form solution:

𝑨(n)((𝑯(n)𝑯(n))-1𝑯(n)𝑿(n)),

if rank(𝑯(n))=F. Note that (𝑯(n)𝑯(n))-1 is not difficult to compute by exploiting the Khatri-Rao structure of 𝑯(n) [13, 2, 1]. However, when the problem dimension is large (which often happens in applications such as medical imaging, remote sensing, and computer vision), solving the seemingly simple problem in (5) can be computationally prohibitive. The reason is that both 𝑿(n)(j=1,jnNIj)×In and 𝑯(n)(j=1,jnNIj)×F can be very large matrices. In particular, the so-called matricized tensor times Khatri-Rao product (MTTKRP) operation, i.e.,

𝖬𝖳𝖳𝖪𝖱𝖯:  𝑯(n)𝑿(n)

that happens in every iteration of ALS costs 𝒪(n=1NInF) flops (or, 𝒪(INF) if In=I). This is quite costly even if In is moderately large. Many works have considered fast algorithms for computing MTTKRP, but these methods are mainly enabled by judiciously exploiting sparsity of the tensor data [39, 18]. Computing MTTKRP for dense tensors has also been considered. Nonetheless, these works are often concerned with practical implementation schemes such as parallelization and memory-efficient computation strategies, but the number of computational flops required is naturally high for the dense tensor case; see, e.g., [40, 41].

In a lot of applications, some prior knowledge on the latent factors is known—e.g., in image processing, 𝑨(n)’s are normally assumed to be nonnegative [6]; in statistical machine learning, sometimes the columns of 𝑨(n) are assumed to be constrained within the probability simplex [36, 42]; i.e.,

𝟏𝑨(n)=𝟏,𝑨(n)𝟎. (6)

In those cases, the following criterion is often of interest:

minimize{𝑨(n)}n=1N f(𝑨(1),,𝑨(N)) (7)
subjectto 𝑨(n)𝒜n.

Compared to the unconstrained version, Problem (7) is even harder to handle. Some recent methods combine first-order constrained optimization and ALS [14, 13] to make the tensor factorization algorithms more flexible in handling constraints and regularization terms—but the complexity orders of those algorithms often scale similarly as that of ALS, since these algorithms do not avoid computing 𝑯(n)𝑿(n) that is the bottleneck for computing CPD.

II-C Stochastic Optimization

When the tensor is large and dense, working with the entire dataset could be computationally and memory-wise expensive. A popular workaround is to apply stochastic optimization—i.e., sampling parts of the data at random and use the sampled piece to update the latent factors. Using Eq. (2), Problem (3) with the LS objective is equivalent to the following:

minimize{𝑨(n)}(1/T)i1,,iNfi1,,iN(𝜽), (8)

where T=n=1NIn and

fi1,,iN(𝜽)=(𝑿¯(i1,,iN)-f=1Fn=1N𝑨(n)(in,f))2.

The objective function in (8) can be understood as an empirical risk [23]. Using this observation, the algorithms in [24, 25] randomly sample a subset of entries indexed by {(i1,,iN)} and update the pertinent parts of the latent factors (note that the (i1,,iN)th entry of tensor contains the information of 𝑨(n)(in,:) for n=1,,N) using the sampled entries of the tensor. For example, [25] uses a stochastic gradient (SG) based approach and update the 𝑨(n)(in,:)’s that are associated with the sampled entries. The sampling method in [24] is similar, while the update is not gradient-based but Gauss-Newton or ALS applied to the sampled set of entries (or, sub-tensors, to be precise). The upshot of this line of work is that the per-iteration complexity can be quite low.

Despite of such favorable complexity savings, the approaches in [24, 25] have a couple of limitations. One challenge is that many useful prior information cannot be incorporated in the algorithm. The reason is that these algorithms update part of the rows of 𝑨(n)’s, while many useful priors are defined w.r.t. the columns of the latent factors, e.g., the probability simplex constraint in (6) and the total variation constraint that is heavily used in image processing. For example, the algorithm in [24] samples subtensors 𝑿¯𝗌𝗎𝖻=𝑿¯(𝒮1,,𝒮N) (where 𝒮n{1,,In}) to update the corresponding 𝑨(n)(𝒮n,:)’s. Under such a scheme, it is hard to handle constraints like 𝟏𝑨(n)=𝟏,𝑨(n)𝟎 that are critical in statistical learning [42, 43, 44, 45]. Third, convergence properties of these methods are often unclear.

An alternative [26] is to leverage the tensor data structure by considering randomly sampled fibers of tensors. Note that a mode-n fiber of 𝑿¯ (cf. Fig. 2) is a row of the mode-n unfolding 𝑿(n) [1]. Now, assuming that one samples a set of mode-n fibers indexed by n{1,,Jn}, then 𝑨(n) can be updated by solving a ‘sketched version’ of Problem (5):

𝑨(n) argmin𝑨𝑿(n)(n,:)-𝑯(n)(n,:)𝑨F2, (9)

If |n|F, then the sketched system of linear equations 𝑿(n)(n,:)𝑯(n)(n,:)𝑨(n) can be over-determined. Hence, one can update 𝑨(n) by solving the |n| dimensional linear system

𝑨(n)𝑯(n)(n,:)𝑿(n)(n,:). (10)

Similar to the ALS algorithm, after updating 𝑨(n), the algorithm moves to mode-(n+1) fibers and repeats the same for updating 𝑨(n+1). The downside of this method is that it needs to sample at least F fibers for each update, and F can be larger than In in tensor decomposition. In addition, the work in [26] focused on unconstrained cases while did not offer implementations for constrained/regularized cases. This can be compensated by replacing (10) with a constrained least squares solver or certain constraint enforcing operations. However, convergence properties of doing so are also unclear.

Fig. 2: From left to right: mode-1,2,3 tensor fibers of a third-order tensor, respectively.

III Proposed Algorithm

In this work, we propose a new stochastic optimization strategy for CPD. Our method combines the insights from ALS and fiber sampling, but allows |n|F. This is instrumental in practice, since it is the key for achieving low per-iteration complexity. The proposed algorithm can easily handle a variety of constraints and regularizations that are commonly used in signal processing and data analytics—which is reminiscent of stochastic proximal gradient (SPG) [30, 46]. In addition, we provide convergence analyses to back up the proposed approach.

III-A Basic Idea: Unconstrained Case

We first consider Problem (3). Our idea is to apply SA while exploiting the tensor fiber structure. Specifically, at each iteration, we sample a set of mode-n fibers for a certain n as the method in [26] does. However, instead of exactly solving the least squares subproblems (5) for all the modes following a Gauss-Seidel manner in each iteration, we update 𝑨(n) using a doubly stochastic procedure. To be more precise, at iteration r, we first randomly sample a mode index n{1,,N}. Then, we randomly sample a set of mode-n fibers that is indexed by n{1,,Jn}. Let 𝑮(r)(I1++IN)×F such that

𝑮(r)=[(𝑮(1)(r)),,(𝑮(N)(r))],

where we have

𝑮(n)(r)=1|n|(𝑨(n)(r)𝑯(n)(n)𝑯(n)(n)-𝑿(n)(n)𝑯(n)(n))
𝑮(n)(r)=𝟎,nn, (11)

and we used the shorthand notations

𝑿(n)(n)=𝑿(n)(n,:),𝑯(n)(n)=𝑯(n)(n,:).

The latent variables are updated by

𝑨(n)(r+1)𝑨(n)(r)-α(r)𝑮(n)(r),n=1,,N. (12)

Observe that 𝑮(n)(r) is an estimate of the gradient applied to f(𝑨(1),,𝑨(N)) taken w.r.t. the mode-n variable 𝑨(n), and the update is an iteration of the SG algorithm with a minibatch size |n| for solving the problem in (5).

The proposed update is very efficient, since the most resource-consuming update 𝑯(n)𝑿(n) in algorithms such as those in [13, 14] is avoided. The corresponding part 𝑿(n)(n,:)𝑯(n)(n,:) costs only 𝒪(|n|FIn) flops—and |n| is under our control. Note that the first step in this procedure is different from standard ALS-type algorithms that update the block variables 𝑨(n) cyclically instead of updating a randomly sampled block. As we will show, this modification greatly simplifies our convergence analysis.

III-B Constrained and Regularized Case

As mentioned, there are many cases in practice where considering regularizations or constraints on 𝑨(n)’s can benefit the associated tasks. Since our framework updates an entire 𝑨(n) in each iteration, it is friendly for incorporating a large variety of commonly used constraints/regularizations—which is more flexible relative to the entry sampling based approaches in [24, 25]. The algorithm can be easily extended to handle the constrained/regularized case. Consider:

minimize{𝑨(n)}n=1N f(𝜽)+n=1Nhn(𝑨(n)) (13)

where hn(𝑨(n)) denotes a structure-promoting regularizer on 𝑨(n). Note that if 𝑨(n)𝒜n is desired, we can write hn() is defined as the indicator function of set 𝒜n:

hn(𝑨)=(𝒜n)={0,𝑨𝒜n,otherwise.

Using the same fiber sampling strategy as in the previous subsection, we update 𝑨(n) by

𝑨(n)(r+1) argmin𝑨(n)𝑨(n)-(𝑨(n)(r)-α(r)𝑮(n)(r))F2
             +hn(𝑨(n)), (14a)
𝑨(n)(r+1) 𝑨(n)(r),nn. (14b)

If hn() is a closed proper convex function, the update (14a) can be solved by applying the proximal operator of hn(), which is often denoted as

𝑨(n)(r+1)𝖯𝗋𝗈𝗑hn(𝑨(n)(r)-α(r)𝑮(n)(r)). (15)

Many hn()’s admit simple closed-form solutions for their respective proximal operators, e.g., when hn() is the indicator function of the nonnegative orthant and hn()=1; see Table I and more details in [13, 47]. The complexity of computing (15) is similar to that of the plain update in (12), and thus is also computationally efficient. An overview of the proposed algorithm can be found in Algorithm 1, which we name Block-Randomized SGD for CPD (BrasCPD).

\SetKwInOutInputinput \SetKwInOutOutputoutput \SetKwRepeatRepeatrepeatuntil \InputN-way tensor 𝑿¯I1××IN; rank F; sample size B, initialization {𝑨(n)(0)}, step size {α(r)}r=0, r0; \Repeatsome stopping criterion is reached uniformly sample n from {1,,N}, then sample n uniformly from {1,,Jn} with |n|=B; form the stochastic gradient 𝑮(r)(11); update 𝑨(n)(r+1)(14a), 𝑨(n)(r+1)𝑨(n)(r) for nn; rr+1; \Output{𝑨(n)(r)}n=1N
\algorithmcfname 1 BrasCPD
TABLE I: Proximal/projection operator of some frequently used regularizations and constraints.
h() prox./proj. solution complexity
1 soft-thresholding 𝒪(d)
2 re-scale 𝒪(d)
2,1 block soft-thresholding 𝒪(d)
0 hard-thresholding 𝒪(d)
(Δ) randomized pivot search [48] 𝒪(d) in expectation
(+) max 𝒪(d)
monotonic monotone regression [49] 𝒪(d)
unimodal unimodal regression [50] 𝒪(d2)

In the table, d is the number of optimization variables.

IV Convergence Properties

In this section, we offer tailored convergence analyses for BrasCPD. To this end, two most relevant works from the optimization literature are [14] and [32]. The work in [32] considers block-randomized SGD, but only for the convex case—while our problem is nonconvex. The work in [14] considers the Gauss-Seidel type block SGD (i.e., cyclically updating the blocks), instead of the block-randomized version as BrasCPD uses. There, convergence is established using a number of assumptions that are not easy to check or guarantee in the context of CPD, e.g., that the bias of the stochastic oracle is bounded. We will show that, by using the block-randomization strategy and the proposed stochastic oracle construction, such an assumption can be circumvented22 2 We should note that the major motivation for using block randomization strategy is theoretical guarantees—since our goal is a convergence-guaranteed algorithmic framework; in practice, we observe cyclically updating the latent factors works as well..

To facilitate our discussions, let us define ξ(r){1,,N} and 𝜻(r){1,,Jξ(r)} as the random variables (r.v.) responsible for selecting the mode and fibers in iteration r, respectively. These r.v.s are distributed as

𝖯𝗋(ξ(r)=n)=1N,𝖯𝗋(𝜻(r)=𝒮|ξ(r)=n)=1(JnB), (16)

where n{1,,N}, 𝒮Σ such that Σ is the set of all subsets of {1,,Jξ(r)} with size B. We observe

Fact 1

Denote B(r) as the filtration generated by the r.v.s {ξ(1),𝛇(1),,ξ(r-1),𝛇(r-1)} such that the rth iterate 𝛉(r) is determined conditioned on B(r). The stochastic gradient in (11) is an unbiased estimate for the full gradient w.r.t. 𝐀(ξ(r))

𝔼𝜻(r)[𝑮(ξ(r))(r)|(r),ξ(r)]=𝑨(ξ(r))f(𝜽(r)). (17)

The proof of the above is straightforward and thus skipped. Fact 1 says that our block stochastic gradient is an unbiased estimation for the “block gradient” 𝑨(n)f(𝜽(r)). This fact will prove quite handy in establishing convergence. The two-level sampling strategy (i.e., block sampling and fiber sampling, respectively) makes the gradient estimation w.r.t. 𝜽 unbiased up to a scaling factor (see Appendix A). This connection intuitively suggests that the proposed algorithm should behave similarly as an SG algorithm.

IV-A Unconstrained Case

We will use the following assumptions:

Assumption 1.

The stepsize schedule follows the Robbins-Monro rule [51]:

r=0α(r)=,r=0(α(r))2<.
Assumption 2.

The updates 𝑨(n)(r) are bounded for all n,r.

Assumption 1 is a principle for stepsize scheduling, which is commonly used in SG algorithms. Assumption 2 is considered a relatively strong assumption. In practice, there are several simple ways to make 𝑨(n)(r)’s bounded. One pragmatic modification is to change the objective to f(𝜽)+n=1Nhn(𝑨(n)+n=1Nλn𝑨(n)F2. Another method is as mentioned in [31, 13]. At iteration r, one may add a proximal term λn𝑨(n)-𝑨(n)(r)F2 to the cost function, which will effectively prevent 𝑨(n)(r+1) from being unbounded. Following both ways, the updates can still be handled by simple proximal operations for the h functions in Table I.

There are an array of problem structures that are useful for studying convergence of the algorithm.

Fact 2

For any 𝛉,𝛉¯ and mode n{1,,N}, there exists a constant L¯(n) such that

f(𝜽) f(𝜽¯)+𝑨(n)f(𝜽¯),𝑨-𝑨¯(n)
  +L¯(n)2𝑨-𝑨¯(n)F2, (18)

where 𝐀¯(n) and 𝐇¯(n) are extracted/constructed from 𝛉¯ following the respective definitions.

Eq. (2) holds because the objective function f(𝜽) w.r.t. 𝑨(n) is a plain least squares fitting criterion, which is known to have a Lipschitz continuous gradient—and the smallest Lipschitz constant is λmax(𝑯¯(n)𝑯¯(n)).

We have the following convergence property for BrasCPD in the unconstrained case:

Proposition 1

Consider the case where hn()=0 for all n and Assumptions 1-2 hold. The solution sequence produced by BrasCPD satisfies:

lim infr𝔼[f(𝜽(r))2]=0.

The proof is relegated to Appendix B. The above proposition implies that there exists a subsequence of the solution sequence that converges to a stationary point in expectation. The use of the expectation notion is due to the randomness in the algorithm. We should mention that the SG update and the block sampling step are essential for establishing convergence—and using the exact solution to (9) as in [26] may not have such convergence properties.

IV-B Constrained/Regularized Case

When hn()0, the gradient of the objective function of (13) may be undefined. In this case, a solution 𝜽(r) is stationary if 𝑷(n)(r)=𝟎, n, where

𝑷(n)(r)=1α(r)(𝑨(n)(r+1)-𝖯𝗋𝗈𝗑hn(𝑨(n)(r)-α(r)𝑨(n)f(𝜽(r))));

i.e., the condition is satisfied in a blockwise fashion [31, 33]. Hence, our goal of this section is to show 𝔼[𝑷(n)(r)2] vanishes for all n as r. Consider the following assumption:

Assumption 3.

There exists a sequence {σ(r)}r0 such that

𝔼𝜻(r)[𝑮(ξ(r))(r)-𝑨(ξ(r))f(𝜽(r))2|(r),ξ(r)](σ(r))2,
r=0(σ(r))2<,r=0α(r)(σ(r))2<, (19)

where {α(r)}r0 is the stepsize sequence following Assumption 1.

The BrasCPD produces a convergent solution sub-sequence:

Proposition 2

Assume that Assumptions 1-3 hold. Also assume that hn() is a closed proper convex function. Then, the solution sequence produced by BrasCPD satisfies

lim infr𝔼[𝑷(n)(r)2]=0,n.
Remark 1.

The convergence result in Proposition 2 inherits a drawback from single-block stochastic proximal gradient algorithms for nonsmooth nonconvex optimization. To be specific, the relatively strong assumption 3 is required to ensure convergence. Assumption 3 implies that the variance of the gradient estimation error 𝜹(ξ(r))(r)=𝑮(ξ(r))(r)-𝑨(ξ(r))f(𝜽(r)) converges to zero. This is not entirely trivial. One way to fulfill this assumption is to increase the minibatch size along the iterations, e.g., by setting [30, 33]:

|n(r)|=𝒪(r1+ϵ),ϵ>0.

Another popular way for achieving (19) is to use some advanced variance reduction techniques such as SVRG [46]—which may go beyond the scope of this paper and thus is left out of the discussion. Also notice that as the convergence analysis is pessimistic, in practice constant minibatch size works fairly well—as we will see soon.

V An Adaptive Stepsize Scheme

One may have noticed that the convergence theories in Propositions 1-2 do not specify the sequence α(r) except for the two constraints in Assumption 1. This often gives rise to agonizing tuning experience for practitioners when implementing stochastic algorithms.

Recently, a series of algorithms were proposed in the machine learning community for adaptive stepsize scheduling when training deep neural networks [52, 53, 54]. Most of these works are variants of the Adagrad algorithm [34]. The insight of Adagrad can be understood as follows: If one optimization variable has been heavily updated before, then it is given a smaller stepsize for the current iteration (and a larger stepsize otherwise). This way, all the optimization variables can be updated in a balanced manner. Adagrad was proposed for single-block algorithms, and this simple strategy admits many provable benefits under the context of convex optimization [34]. For our multi-block nonconvex problem, we extend the idea and propose the following updating rule: In iteration r, if ξ(r)=n, then, for all i{1,,In} and all f{1,,F}, we have

[𝜼(n)(r)]i,f η(b+t=1r[𝑮(n)(t)]i,f2)1/2+ϵ, (20a)
𝑨(n)(r+1) 𝖯𝗋𝗈𝗑hn(𝑨(n)(r)-𝜼(n)(r)𝑮(n)(r)), (20b)
𝑨(n)(r+1) 𝑨(n)(r),nn, (20c)

where η,b,ϵ>0. We note that b>0, ϵ>0 are technical conditions used for establishing theoretical convergence. In practice, setting b=ϵ=0 does not hurt the performance and we also observe a slight gain in runtime performance when ϵ=0. The Adagrad version of block-randomized CPD algorithm is very simple to implement. The algorithm is summarized in Algorithm 2, which is named AdaCPD.

As one will soon see, such a simple stepsize strategy is very robust to a large number of scenarios under test—i.e., in most of the cases, AdaCPD performs well without tuning the stepsize schedule. In addition, the AdaCPD algorithm works well for both the constrained and unconstrained case.

Proving convergence for nonconvex Adagrad-like algorithms is quite challenging [55, 56]. We show that:

Proposition 3

Assume hn()=0 for all n, and that Pr(ξ(r)=n)=1/N for all r and n. Under the Assumptions 1-2, the solution sequence produced by AdaCPD satisfies

𝖯𝗋(lim infrf(𝜽(r))2=0)=1.

Proposition 3 asserts that the algorithm converges almost surely. The proof is relegated to Appendix D. Our proof extends the idea from a recent paper [55] that focuses on using Adagrad for solving single-block nonconvex problems. As mentioned, our two-level sampling strategy makes our algorithm very similar to single-block SGD with a scaled gradient estimation (cf. Appendix A), and thus with careful modifications, the key proof techniques in [55] goes through. Nevertheless, we detail the proof for being self-containing.

\SetKwInOutInputinput \SetKwInOutOutputoutput \SetKwRepeatRepeatrepeatuntil \InputN-way tensor 𝑿¯I1××IN; rank F; sample size B , initialization {𝑨(n)(0)} r0; \Repeatsome stopping criterion is reached uniformly sample n from {1,,N}, then sample n uniformly from {1,,Jn} with |n|=B; form the stochastic gradient 𝑮(r)(11); determine the step size 𝜼(n)(r)(20a) update 𝑨(n)(r+1)(20b), 𝑨(n)(r+1)𝑨(n)(r) for nn; rr+1; \Output{𝑨(n)(r)}n=1N
\algorithmcfname 2 AdaCPD

VI Numerical Results

In this section, we use simulations and real-data experiments to showcase the effectiveness of the proposed algorithm.

VI-A Synthetic Data Simulations

VI-A1 Data Generation

Throughout this subsection, we use synthetic third-order tensors (i.e., N=3) whose latent factors are drawn from i.i.d. uniform distribution between 0 and 1—unless otherwise specified. This way, large and dense tensors can be created. For simplicity, we set In=I for all n and test the algorithms on tensors having different In’s and F’s. In some simulations, we also consider CPD for noisy tensors, i.e., factoring data tensors that have the following signal model:

𝒀¯=𝑿¯+𝑵¯,

where 𝑿¯ is the noiseless low-rank tensor and 𝑵¯ denotes the additive noise. We use zero-mean i.i.d. Gaussian noise with variance σN2 in our simulations, and the signal-to-noise ratio (SNR) (in dB) is defined as 𝖲𝖭𝖱=10log10(1n=1NIn𝑿¯2σN2).

VI-A2 Baselines

A number of baseline algorithms are employed as benchmarks. Specifically, we mainly use the AO-ADMM algorithm [57] and the APG algorithm [14] as our baselines since they are the most flexible algorithms with the ability of handling many different regularizations and constraints. We also present the results output by the CPRAND algorithm [26]. Note that we are preliminarily interested in constrained/regularized CPD. Because CPRAND operates without constraints, the comparison is not entirely fair (e.g., CPRAND can potentially attain smaller cost values since it has a much larger feasible set if other algorithms operate with constraints). Nevertheless, we employ it as a benchmark since it uses the same fiber sampling strategy as ours. To make the comparisons more comprehensive, we also offer a simple modification for CPRAND to incorporate constraints/regularization terms. Specifically, we apply the proximal operators associated with the constraint/regularization terms to the original CPRAND updates, and we denote this baseline as CPRAND-Prox. All the algorithms are initialized with the same random initialization; i.e., 𝑨(0)’s entries follow the uniform distribution between 0 and 1.

VI-A3 Parameter Setting

For BrasCPD, we set

α(r)=αrβ, (21)

where r is the number of iterations, β=10-6 and α typically takes a value in between 0.001 and 0.1, and we try multiple choices of α in our simulations. Our experience is that, under such settings, α is the main tuning parameter that affects the performance of BrasCPD. The batch size |n| is typically set to be below 25 throughout this section. For AdaCPD, we fix b=10-6, ϵ=0, and η=1 for all the simulations. For CPRAND, we follow the instruction in the original paper [26] and sample 10Flog2F fibers for each update.

VI-A4 Performance Metrics

To measure the performance, we employ two metrics. We mainly use the estimation accuracy for the latent factors, 𝑨(n) for n=1,,N, as our performance indicator. The accuracy is measured by the mean squared error (MSE) which is as defined in [58, 59]:

𝖬𝖲𝖤= (22)
minπ(f){1,,F}1Ff=1F𝑨(n)(:,π(f))𝑨(n)(:,π(f))2-𝑨^(n)(:,f)𝑨^(n)(:,f)222

where 𝑨^(n) denotes the estimate of 𝑨(n) and π(f)’s are under the constraint {π(1),,π(F)}={1,,F}—which is used to fix the intrinsic column permutation in CPD. We also use the cost function, i.e., 𝖼𝗈𝗌𝗍=(1/n=1NIn)×f(𝜽(r)) as a reference in some of the simulation tables.

Since the algorithms under test have very different operations and subproblem-solving strategies, it may be challenging to find an exactly unified complexity measure. In this section, we show the peformance of the algorithms against the number of MTTKRP operations 𝑯(n)𝑿(n) used, since 𝑯(n)𝑿(n) is the most costly step that dominates the complexity of all the algorithms under comparison. For the stochastic optimization/sketching based algorithms, we also test the MSE/cost value of the algorithms against runtime and the number of sampled entries for updating the latent factors. The latter is particularly meaningful under the stochastic settings, since it affects the communication overhead between the data storage units (e.g., the hard disks) and the computation units, e.g., CPUs and GPUs. All the simulations are conducted in Matlab. The results in this section are obtained from 50 trials with different randomly generated tensors.

VI-B Results

Fig. 1 in Sec. I has shown the MSE performance of the algorithms in a relatively small-size example, where In=I=100, F=10 and the nonnegativity constraints are used in the algorithms. In that simulation, we use |n|=20 so that every 500 iterations of the proposed algorithm compute a full MTTKRP. One can see that for this relatively easy case, all the algorithms can reach a good estimation accuracy for the latent factors. Nevertheless, the proposed methods exhibit remarkably higher efficiency.

Fig. 3 shows the MSEs of the estimated latent factors by the algorithms under I=300 and F=10 benchmarked by more baselines. The result is the median of 50 Monte Carlo trials; we use median here since mean is dominated by outlying trials, even if there is only one outlying trial. We set |n|=18 so that the proposed algorithms use 5,000 iterations to compute a full MTTKRP. All the algorithms use nonnegativity constraints except CPRAND. One can see that under this setting, the most competitive algorithms are CPRAND, CPRAND-Prox, and AdaCPD.

In Fig. 4, we increase the rank to be F=200. There are several observations in order: First, the stochastic algorithms (i.e., BrasCPD, AdaCPD, CPRAND and CPRAND-Prox) are much more efficient relative to the deterministic algorithms (AO-ADMM and APG). After 60 MTTKRPs computed, the stochastic algorithms often have reached a reasonable level of MSE. This is indeed remarkable, since 60 MTTKRPs are roughly equivalent to 20 iterations of AO-ADMM and APG. Second, CPRAND and CPRAND-Prox are not as competitive in this high-rank regime. In particular, BrasCPD with α=0.1 gives the most promising performance. However, the performance of BrasCPD is affected a bit significantly by the parameters α. One can see that using α=0.05 and α=0.01, the algorithm does not give promising results under this setting. Third, AdaCPD yields the second lowest MSEs after 60 MTTKRPs—while not using any parameter tuning.

Fig. 3: Median of MSEs against MTTKRP. I1=I2=I3=300 and F=10. 𝑨(n)𝟎.
Fig. 4: Median of MSEs against MTTKRP. I1=I2=I3=300 and F=200. 𝑨(n)𝟎.
Fig. 5: MSE against the number of sampled entries for I1=I2=I3=300, F=100.
Fig. 6: MSE against runtime (sec.) for I1=I2=I3=300, F=100.

Figs 5-6 show the MSEs of the stochastic algorithms (i.e., BrasCPD, AdaCPD, CPRAND and CPRAND-Prox) against the number of sampled entries and runtime, respectively. Here, we set I1=I2=I3=300 and F=100. One can see that BrasCPD and AdaCPD do not use many data samples to reach a good MSE level. This indicates that the communication overhead performance of the proposed algorithms is promising. In terms of runtime, CPRAND and CPRAND-Prox start with quick decrease of the MSE—they reach MSE10-2 within 50 seconds, while the proposed approaches reach this level of accuracy after 100 seconds. Nevertheless, the MSEs of CPRAND and CPRADN-Prox are somehow stuck at this level, but the proposed algorithms can reach a much better accuracy for estimating the latent factors. We would like to remark that the runtime performance of stochastic algorithms are affected by the programming language used (i.e., Matlab in this case). Typically, interpreted languages (e.g., Matlab and Python) are not specialized for handling “for” loops, which is heavily used in stochastic algorithms (especially when |n| is small). Hence, real-system implementations for these algorithms could be much faster.

Table II shows the mean and median of the MSEs and cost values output by the algorithms when the tensor rank varies under I=300. All the algorithms are stopped after 60 full MTTKRPs are used. One can see that BrasCPD exhibits a quite competitive MSE performance if a proper α is chosen, under the employed stepsize schedule in (21). However, one can see that when F changes, there is a risk that BrasCPD runs into numerical issues and yields unbounded solutions. This suggests that BrasCPD may need extra care for tuning its stepsize. In principle, when the problem setting changes, the “best” α of BrasCPD also changes. Our experience is that when |n| increases, using a properly scaled up α may help accelerate convergence. On the other hand, AdaCPD always outputs reasonably good results. More importantly, AdaCPD runs without tuning the stepsize parameters—which shows the power of the adaptive stepsize scheduling strategy. Another remark is that CPRAND and CPRAND-Prox work well when F=10 and F=50. In particular, CPRAND-Prox largely outperforms CPRAND when F=50, showing the benefit of explicitly considering constraints.

Table III shows the performance under different I’s when F=100. In general, when I increases, the performance of all the algorithms improves—with a fixed F, a larger I means more data and more “degrees of freedom” available, which normally leads to better performance. Again, BrasCPD with a proper α and AdaCPD in general outperform the baselines. One particular observation is that, although the mean and median MSEs of AdaCPD are both low, the median is sometimes much better than the mean (cf. the case when I=400), which indicates that there exist outlying trials (i.e., trials where AdaCPD does not produce very low MSE results). The median-mean gap is less often observed for other algorithms, e.g., BrasCPD and CPRAND. Fig. 7 may better illustrate the situation, where the histograms of the MSE (in dB) of the algorithms are shown. One can see that, although the worst-case result of AdaCPD is still acceptable (with MSE<10-4), the MSE of AdaCPD clearly has a larger variance compared to BrasCPD with α=0.1. This shows a trade-off between the easiness of stepsize scheduling and the risk of converging to less accurate solutions.

Table IV shows the median of the MSEs after 450 MTTKRPs of under same simulation setting (which is roughly equivalent to 150 iterations of AO-ADMM and APG). One can see that the performance of all the algorithms have improved. Note that due to resource limitations, we could not run 450 MTTKRPs for every table in this section and had to stop at 60 MTTKRPs. Hence, the capabilities of all the algorithms may not have fully shown up in those tables. Nonetheless, the 60-MTTKRP results can serve as a reference under limited time and computational resources.

Fig. 7: Histograms of the algorithms; I=300,F=100.
TABLE II: MSEs of the estimated latent factors by the algorithms under different F; I=300; all the algorithms are stopped after computing 60 MTTKRPs. “NaN” means the algorithm outputs unbounded solutions. 𝑨(n)𝟎.
Algorithm Metric F
10 50 100 200
BrasCPD (α=0.1) MSE Mean 1.97E-16 1.98E-11 6.12E-10 NaN
Median 1.70E-16 5.66E-12 3.82E-10 NaN
BrasCPD (α=0.05) MSE Mean 1.02E-10 1.33E-06 1.06E-04 4.08E-04
Median 3.26E-11 7.20E-07 1.46E-05 1.77E-04
BrasCPD (α=0.01) MSE Mean 2.80E-03 0.1108 0.1968 0.2599
Median 5.18E-04 0.1104 0.0000 0.2604
AdaCPD MSE Mean 2.30E-16 1.27E-04 3.54E-04 0.0175
Median 2.44E-16 5.43E-15 2.96E-07 9.86E-04
AO-ADMM MSE Mean 2.25E-02 0.1900 0.2667 0.3018
Median 1.21E-02 0.1886 0.2664 0.3018
APG MSE Mean 1.80E-01 0.2993 0.3254 0.3399
Median 1.83E-01 0.2995 0.3255 0.3398
CPRAND MSE Mean 2.00E-16 0.0037 0.0082 0.0475
Median 1.92E-16 0.0027 0.0075 0.0468
CPRAND-Prox MSE Mean 2.02E-16 0.0026 0.0073 0.0459
Median 1.92E-16 3.10E-11 0.0074 0.0453
BrasCPD (α=0.1) Cost Mean 1.66E-19 2.63E-12 1.64E-10 NaN
Median 4.31E-20 7.23E-13 1.03E-10 NaN
BrasCPD (α=0.05) Cost Mean 3.52E-12 1.84E-07 2.78E-05 3.01E-04
Median 1.15E-12 1.00E-07 3.65E-06 8.99E-05
BrasCPD (α=0.01) Cost Mean 1.49E-04 0.0226 0.0698 0.1620
Median 2.89E-05 0.0226 0.0701 0.1620
AdaCPD Cost Mean 9.45E-31 4.95E-05 1.64E-04 0.0142
Median 1.95E-31 8.86E-16 1.05E-07 8.43E-04
AO-ADMM Cost Mean 1.10E-03 0.0361 0.0869 0.1791
Median 8.10E-04 0.0359 0.0868 0.1790
APG Cost Mean 2.77E-02 0.2665 0.5636 1.1709
Median 2.79E-02 0.2659 0.5637 1.1715
CPRAND Cost Mean 1.88E-31 1.79E-10 0.0023 0.0329
Median 1.87E-31 1.59E-12 0.0027 0.0319
CPRAND-Prox Cost Mean 1.89E-31 0.0168 0.2128 129.4668
Median 1.88E-31 4.73E-12 0.1279 120.7575
TABLE III: Performance of the algorithms under various I’s, F=100. 𝑨(n)𝟎; all the algorithms are stopped after computing 60 MTTKRPs. 𝑨(n)𝟎.
Algorithm Metric I
100 200 300
BrasCPD (α=0.1) MSE Mean 0.1567 4.41E-04 6.12E-10
Median 0.1586 8.54E-05 3.82E-10
BrasCPD (α=0.05) MSE Mean 0.2378 0.0297 1.06E-04
Median 0.2364 0.0282 1.46E-05
BrasCPD (α=0.01) MSE Mean 0.2861 0.2786 0.1968
Median 0.2857 0.2786 0.0000
AdaCPD MSE Mean 0.1312 0.0040 3.54E-04
Median 0.1281 1.66E-04 2.96E-07
AO-ADMM MSE Mean 0.2396 0.2586 0.2667
Median 0.2395 0.2585 0.2664
APG MSE Mean 0.2961 0.3162 0.3254
Median 0.2961 0.3163 0.3255
CPRAND MSE Mean 0.1901 0.0139 0.0082
Median 0.1914 0.0134 0.0075
CPRAND-Prox MSE Mean 0.1873 0.0141 0.0073
Median 0.1879 0.0146 0.0074
BrasCPD (α=0.1) Cost Mean 0.0561 1.39E-04 1.64E-10
Median 0.0556 2.22E-05 1.03E-10
BrasCPD (α=0.05) Cost Mean 0.0752 0.0119 2.78E-05
Median 0.0744 0.0115 3.65E-06
BrasCPD (α=0.01) Cost Mean 0.1019 0.0881 0.0698
Median 0.1010 0.0881 0.0701
AdaCPD Cost Mean 0.0499 0.0018 1.64E-04
Median 0.0499 5.01E-05 1.05E-07
AO-ADMM Cost Mean 0.0736 0.0833 0.0869
Median 0.0735 0.0833 0.0868
APG Cost Mean 0.4248 0.5249 0.5636
Median 0.4251 0.5251 0.5637
CPRAND Cost Mean 0.0585 0.0046 0.0023
Median 0.0575 0.0048 0.0027
CPRAND-Prox Cost Mean 15.0076 1.2778 0.2128
Median 12.3764 0.9751 0.1279
TABLE IV: Median of the MSEs under various I’s, F=100. 𝑨n0; all the algorithms are stopped after computing 450 MTTKRPs.
Algorithm Metric I
100 200 300
BrasCPD (α=0.1) MSE 3.76E-06 4.74E-17 9.56E-17
BrasCPD (α=0.05) MSE 0.0016 1.01E-14 8.75E-17
BrasCPD (α=0.01) MSE 0.1957 6.13E-04 8.44E-08
AdaCPD MSE 9.51E-05 1.60E-07 2.43E-06
AO-ADMM MSE 3.45E-04 1.59E-04 1.44E-04
APG MSE 0.2615 0.2873 0.3006
CPRAND MSE 0.0230 0.0070 0.0073
CPRAND-Prox MSE 0.0260 0.0069 0.0065
TABLE V: Performance of the algorithms under various SNRs; all the algorithms are stopped after computing 60 MTTKRPs. I1=I2=I3=100, F=20. 𝑨(n)𝟎.
Algorithm Metric SNR
10 20 30 40
BrasCPD (α=0.1) MSE Mean 0.0433 0.0240 0.0271 0.0281
Median 0.0453 0.0195 0.0263 0.0257
BrasCPD (α=0.05) MSE Mean 0.1168 0.1135 0.1133 0.1093
Median 0.1232 0.1149 0.1121 0.1090
BrasCPD (α=0.01) MSE Mean 0.2135 0.2114 0.2104 0.2087
Median 0.2141 0.2121 0.2102 0.2078
AdaCPD MSE Mean 0.0179 0.0040 0.0017 1.39E-04
Median 0.0168 0.0036 7.02E-04 1.24E-04
AO-ADMM MSE Mean 0.1012 0.0925 0.0874 0.0905
Median 0.1009 0.0921 0.0870 0.0899
APG MSE Mean 0.2326 0.2291 0.2282 0.2284
Median 0.2319 0.2286 0.2279 0.2291
CPRAND MSE Mean 0.2049 0.0088 0.0021 0.0020
Median 0.2046 0.0077 7.36E-04 7.33E-05
CPRAND-Prox MSE Mean 0.1737 0.0108 0.0024 0.0017
Median 0.1715 0.0075 7.32E-04 7.44E-05
TABLE VI: Performance of the algorithms under various SNRs after computing 30 MTTKRPs. I1=I2=I3=100, F=20. 𝟏𝑨(n)=ρ𝟏, 𝑨(n)𝟎. ρ=100.
Algorithm Metric SNR
10 20 30 40
BrasCPD (α=0.1) MSE Mean 0.3946 0.3950 0.3939 0.4032
Median 0.3987 0.3993 0.3942 0.4062
BrasCPD (α=0.05) MSE Mean 0.0705 0.0067 6.80E-04 6.82E-05
Median 0.0705 0.0067 6.81E-04 6.80E-05
BrasCPD (α=0.01) MSE Mean 0.0169 0.0054 0.0017 0.0026
Median 0.0156 0.0017 2.34E-04 7.75E-05
AdaCPD MSE Mean 0.0202 0.0019 3.94E-04 6.13E-05
Median 0.0193 0.0019 3.95E-04 6.11E-05
AO-ADMM MSE Mean 0.0884 0.0759 0.0824 0.0772
Median 0.0881 0.0763 0.0823 0.0805
APG MSE Mean 0.2186 0.2149 0.2144 0.2146
Median 0.2188 0.2152 0.2157 0.2148
CPRAND MSE Mean 0.1937 0.0097 0.0034 0.0032
Median 0.1944 0.0077 7.44E-04 7.37E-05
CPRAND-Prox MSE Mean 0.1509 0.0670 0.0599 0.0602
Median 0.1519 0.0663 0.0591 0.0582

Tables V-VI show the estimation accuracy of the latent factors by the algorithms under different SNRs. Here, we use a low-rank case where I1=I2=I3=100 and F=20, under which CPRAND and CPRAND-Prox are more competitive. In a noisy environment, the ability of handling constraints/regularizations is essential for a CPD algorithm, since prior information on the latent factors can help improve estimation accuracy. Table V and Table VI test the cases where 𝑨(n) is elementwise nonnegative and the columns of 𝑨(n) reside in a scaled version of the probability simplex, respectively. One can see from the two tables that both BrasCPD (with a proper α) and AdaCPD work very well. In Table VI, one can see that BrasCPD again shows its sensitivity to the choice of α, with α=0.1 and 0.05 actually not working. One can also see that, in the low-SNR regime (SNR=10 and 20dB), BrasCPD with proper α and AdaCPD outperform the baselines. CPRAND-Prox also works well for the nonnegativity constraint case (especially for SNR=30 and 40dB), but not as promising for the simplex constraints.

VI-C Real-Data Experiment

In this subsection, we test our algorithm on a constrained tensor decomposition problem; i.e., we apply the proposed BrasCPD and AdaCPD to factor hyperspectral images. Hyperspectral images (HSIs) are special images with pixels measured at a large number of wavelengths. Hence, an HSI is usually stored as a third-order tensor with two spatial coordinates and one spectral coordinate. HSIs are dense tensors and thus are suitable for testing the proposed algorithms. We use sub-images of the Indian Pines dataset that has a size of 145×145×220 and the Pavia University dataset33 3 Both datasets are available online: http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes that has a size of 610×340×103. We apply the stochastic algorithms to the datasets by fixing |n|=500 in this section.

TABLE VII: Performance of the algorithms on the Indian Pines dataset under different F’s.
Algorithm Metric F
10 20 30 40
BrasCPD (α=4) Cost 6.82E-04 4.68E-04 3.46E-04 3.58E-04
BrasCPD (α=3) Cost 6.86E-04 4.52E-04 3.54E-04 3.35E-04
BrasCPD (α=2) Cost 6.88E-04 6.11E-04 5.38E-04 4.11E-04
AdaCPD Cost 6.23E-04 4.56E-04 5.40E-04 5.23E-04
AO-ADMM Cost 7.70E-04 5.28E-04 5.07E-04 4.91E-04
APG Cost 0.0019 0.0019 0.0018 0.0019
CPRAND Cost 6.74E-04 4.82E-04 4.35E-04 4.08E-04
CPRAND-Prox Cost 0.1116 0.0021 0.0020 0.0021
TABLE VIII: Performance of the algorithms on the Pavia University dataset under different F’s.
Algorithm Metric F
50 100 200
BrasCPD (α=4) Cost 0.0031 0.0027 0.0013
BrasCPD (α=3) Cost 0.0033 0.0053 0.0031
BrasCPD (α=2) Cost 0.0044 0.0067 0.0059
AdaCPD Cost 0.0022 0.0013 0.0008
AO-ADMM Cost 0.0378 0.0425 0.0053
APG Cost 0.0074 0.0073 0.0031
CPRAND Cost 0.0033 0.0028 0.0034
CPRAND-Prox Cost 0.0103 0.0109 8.46E+03

Tables VII-VIII show the cost values of the nonnegativity constrained optimization algorithms under different ranks, after computing 120 MTTKRPs for all three modes, which corresponds to 120 iterations for AO-ADMM and APG (we use this “all-mode MTTKRP” in this section since the tensors are unsymmetrical and thus single-mode MTTKRPs cannot be directly translated to iterations in batch algorithms). One can see that the proposed algorithms show the same merits as we have seen in the simulations: BrasCPD can exhibit very competitive performance when α is properly chosen (e.g., when F=10 and α=5 for the Indian Pines dataset); in addition, AdaCPD gives consistently good performance without tuning the stepsize manually. Particularly, on the Pavia University dataset, AdaCPD gives much lower cost values compared to other algorithms. We also present the results of CPRAND and CPRAND-Prox. Note that since CPRAND does not have constraints, its cost value is naturally lower than the other methods. Hence, this baseline is only for reference. Both CPRAND and CPRAND-Prox work reasonably well for the two datasets, especially on Indian Pines under low rank. However, we note that for Pavia University, CPRAND-Prox does not converge well for the F=200 case.

Fig. 8 shows how the cost values change along with the iterations on the Pavia University data using F=100. One can see that BrasCPD (α=0.5) and AdaCPD reduce the cost value quickly in this case. After 120 iterations (equivalent to 120 all-mode full MTTKRPs), the batch algorithm APG eventually reaches the same cost value level of those of BrasCPD (α=0.5) and AdaCPD.

Fig. 8: Number of all-mode MTTKRPs v.s. cost values output by the algorithms when applied to the Pavia University dataset. F=100. Nonnegativity constraint is added .

VII Conclusion

To conclude, we proposed a block-randomized stochastic proximal gradient based CPD algorithmic framework for large-scale dense tensors. The framework works under a doubly stochastic manner, which randomly selects a mode and then samples a set of fibers for updating the associated latent factor. The framework has a series of nice features including being able to quickly improve estimation accuracy of the latent factors, being flexible with incorporating constraints and regularizations, and having rigorous convergence guarantees. We also proposed a practical and effective adaptive stepsize scheduling method that is reminiscent of recent advances in neural network training algorithms. Simulations and real-data experiments show that the proposed algorithms outperform a number of state-of-art constrained CPD algorithms when dealing with large dense tensors.

Appendix A Connection between f(𝜽(r)) and 𝑮(r)

Let n be any integer from {1,,N}, consider the following conditional expectation:

𝑮¯(n)(r)=𝔼ξ(r),𝜻(r)[𝑮(n)(r)|(r)]
=(a)𝔼ξ(r)[1(Jξ(r)B)(𝑨(ξ(r))(r)𝑯(ξ(r))𝑯(ξ(r))-𝑿(ξ(r))𝑯(ξ(r)))]
=(b)n=1Nδ(n-n)N(JnB)(𝑨(n)(r)𝑯(n)𝑯(n)-𝑿(n)𝑯(n))
=1N(JnB)(𝑨(n)(r)𝑯(n)𝑯(n)-𝑿(n)𝑯(n)) (23)

where δ() is the Dirac function. In the above chain, (a) is due to Fact 1 and (b) is obtained by evaluating the expectation with respect to the possible modes ξ(r)=n. The last equality shows that 𝑮¯(n)(r) is a scaled gradient of the objective function of (3) taken w.r.t. 𝑨(n)(r). The block sampling step together with fiber sampling entails us an easy way to estimate the full gradient w.r.t. all the latent factors in an unbiased manner.

Appendix B Proof of Proposition 1

To show Proposition 1, we will need the following [60, Lemma A.5]:

Lemma 1

Let {at}t and {bt}t be two nonnegative sequences such that bt is bounded, t=0atbt converges and t=0at diverges, then we have

lim infrbt=0.

Recall that ξ(r), 𝜻(r) are the random mode, fiber chosen at iteration r, respectively. Under Assumption 2, we have 𝑯(ξ(r))(r)22L(ξ(r))(r) where 𝑯(ξ(r))(r)=n=1,nξ(r)N𝑨(n)(r) and L(ξ(r))(r)<. Combining with Fact 2, we observe:

f(𝜽(r+1))-f(𝜽(r))𝑨(ξ(r))f(𝜽(r)),𝑨(ξ(r))(r+1)-𝑨(ξ(r))(r)
                                         +L2𝑨(ξ(r))(r+1)-𝑨(ξ(r))(r)2
=-α(r)𝑨(ξ(r))f(𝜽(r)),𝑮(ξ(r))(r)+(α(r))2L2𝑮(ξ(r))(r)2,

where we have denoted

L=maxr=0,,L(ξ(r))(r)<.

Taking expectation conditioned on the filtration (r) and the chosen mode index ξ(r), we have

𝔼𝜻(r)[f(𝜽(r+1))|(r),ξ(r)]-f(𝜽(r))
-α(r)𝑨(ξ(r))f(𝜽(r))2
  +(α(r))2L2𝔼𝜻(r)[𝑮(ξ(r))(r)2|(r),ξ(r)]
-α(r)𝑨(ξ(r))f(𝜽(r))2+(α(r))2LM2, (24)

where the first inequality used the assumption that L(ξ(r))(r)L and Fact 1, and the second inequality is a consequence of Assumption 2, as we observe:

𝑮(ξ(r))(r)=1B𝑨(ξ(r))(r)(𝑯(ξ(r))(r)(𝜻(r)))𝑯(ξ(r))(r)(𝜻(r))-𝑿(ξ(r))(𝜻(r))𝑯(ξ(r))(r)(𝜻(r)). (25)

As 𝑿(n) is bounded for all n, and all the 𝑨(n)(r) are bounded under Assumption 2, we have 𝑮(ξ(r))(r)2M for all n,r and some M<. Taking the expectation w.r.t. ξ(r) yields

𝔼ξ(r),ζ(r)[f(𝜽(r+1))|(r)]-f(𝜽(r))
-α(r)𝔼ξ(r)[𝑨(ξ(r))f(𝜽(r))2]+(α(r))2ML2. (26)

Note that 𝔼ξ(r)[𝑨(ξ(r))f(𝜽(r))2]=f(𝜽(r))2. Taking the total expectation (w.r.t. all random variables in (r)) gives

𝔼[f(𝜽(r+1))]-𝔼[f(𝜽(r))]
-α(r)𝔼[f(𝜽(r))2]+(α(r))2ML2. (27)

Summing up (27) from t=0 to t=r, we have

𝔼[f(𝜽(t+1))]-f(𝜽(0))
t=0r-α(t)𝔼[f(𝜽(t))2]+t=0r(α(t))2ML2.

Taking r, the above implies that

r=0α(r)𝔼[f(𝜽(r))2]
f(𝜽(0))-f(𝜽())+r=0(α(r))2ML2, (28)

where we have used f(𝜽)f(𝜽()), and f(𝜽()) denotes the global optimal value. Note that the right hand side above is bounded from above because r=0(α(r))2<. Hence, using Lemma 1 we conclude:

lim infr𝔼[f(𝜽(r))2]=0.

Appendix C Proof of Proposition 2

C-A Preliminaries

For the constrained case, let us denote Φ(𝜽)=f(𝜽)+n=1Nhn(𝜽) as the objective function. Unlike the unconstrained case where we measure convergence via observing if the gradient vanishes, the optimality condition of the constrained case is a bit more complicated. Consider the following optimization problem

minimize𝜽f(𝜽)+h(𝜽),

where f(𝜽) is continuously differentiable while h is convex but possibly nonsmooth. The deterministic proximal gradient algorithm for handling this problem is as follows:

𝜽(r+1)𝖯𝗋𝗈𝗑h(𝜽(r)-α(r)f(𝜽(r))).

Define 𝑷(r)=1α(r)(𝜽(r+1)-𝜽(r)), the update can also be represented as 𝜽(r+1)𝜽(r)-α(r)𝑷(r), which is analogous to the gradient descent algorithm. It can be shown that 𝑷(r)=𝟎 implies that the necessary optimality condition is satisfied, and thus 𝑷(r) can be considered as a “generalized gradient”. In the multi-block setting of (13), we define:

𝑷(n)(r)
=1α(r)(𝑨(n)(r+1)-𝖯𝗋𝗈𝗑hn(𝑨(n)(r)-α(r)𝑨(n)f(𝜽(r)))).

To show that the BrasCPD algorithm finds a stationary point, our goal is to show the subsequence convergence of 𝔼[𝑷(n)(r)] to zero for all n as r.

C-B Proof

Our update is equivalent to the following:

𝑨(n)(r+1)argmin𝑨(n) 𝑮(n)(r),𝑨(n)-𝑨(n)(r) (29)
+12α(r)𝑨(n)-𝑨(n)(r)2+hn(𝑨(n))

for a randomly selected n, i.e., the above is the proximal operator. For a given ξ(r), we have

hξ(r)(𝑨(ξ(r))(r+1))-hξ(r)(𝑨(ξ(r))(r))
-𝑮(ξ(r))(r),𝑨(ξ(r))(r+1)-𝑨(ξ(r))(r)-12α(r)𝑨(ξ(r))(r+1)-𝑨(ξ(r))(r)2

by the optimality of 𝑨(ξ(r))(r+1) for solving Problem (29).

By the block Lipschitz continuity of the smooth part (cf. Fact 2), we have

f(𝜽(r+1))-f(𝜽(r)) 𝑨(ξ(r))f(𝜽(r)),𝑨(ξ(r))(r+1)-𝑨(ξ(r))(r)
+L(ξ(r))(r)2𝑨(ξ(r))(r+1)-𝑨(ξ(r))(r)2,

where f denotes the smooth part in the objective function and

L(ξ(r))(r)=λmax((𝑯(ξ(r))(r))𝑯(ξ(r))(r))L.

Combining the two inequalities, we have

Φ(𝜽(r+1)) Φ(𝜽(r))
-α(r)𝑨(ξ(r))f(𝜽(r))-𝑮(ξ(r))(r),𝒑(ξ(r))(r)
+(L(α(r))22-α(r)2)𝒑(ξ(r))(r)2, (30)

where we have defined:

𝒑(ξ(r))(r)=1α(r)(𝑨(ξ(r))(r+1)-𝑨(ξ(r))(r)).

The inequality in (C-B) can be further written as

Φ(𝜽(r+1))-Φ(𝜽(r))
-α(r)𝑨(ξ(r))f(𝜽(r))-𝑮(ξ(r))(r),𝒑(ξ(r))(r)-𝑷(ξ(r))(r)
-α(r)𝑨(ξ(r))f(𝜽(r))-𝑮(ξ(r))(r),𝑷(ξ(r))(r)
+(L(α(r))22-α(r)2)𝒑(ξ(r))(r)2. (31)

Again, taking expectation conditioning on the filtration (r) and ξ(r), we can upper bound 1α(r)(𝔼𝜻(r)[Φ(𝜽(r+1))|(r),ξ(r)]-Φ(𝜽(r))) by

𝔼𝜻(r)[𝑨(ξ(r))f(𝜽(r))-𝑮(ξ(r))(r),𝑷(ξ(r))(r)-𝒑(ξ(r))(r)|(r),ξ(r)]
+(Lα(r)2-12)𝔼𝜻(r)[𝒑(ξ(r))(r)2|(r),ξ(r)], (32)

i.e., the second term on the right hand side of (C-B) becomes zero because of Fact 1. The first term of (C-B) can be bounded via the following chain of inequalities:

𝔼𝜻(r)[𝑨(ξ(r))f(𝜽(r))-𝑮(ξ(r))(r),𝑷(ξ(r))(r)-𝒑(ξ(r))(r)|(r),ξ(r)]
(a)𝔼𝜻(r)[𝜹(r)𝑷(ξ(r))(r)-𝒑(ξ(r))(r)|(r),ξ(r)]
(b)𝔼𝜻(r)[𝜹(r)2|(r),ξ(r)](σ(r))2 (33)

where (a) is due to the Cauchy-Schwartz inequality, and (b) is a consequence of the non-expansiveness of the proximal operator of convex hn(). Taking the total expectation, we have

𝔼[Φ(𝜽(r+1))]-𝔼[Φ(𝜽(r))] (34)
α(r)(σ(r))2+(L(α(r))22-α(r)2)𝔼[𝒑(ξ(r))(r)2].

Summing up the inequality from t=0 to t=r-1,

𝔼[Φ(𝜽(r))]-Φ(𝜽(0)) (35)
t=0rα(t)(σ(t))2+t=0r(L(α(t))22-α(t)2)𝔼[𝒑(ξ(t))(t)2].

Since α(r)<1/L, we have L(α(r))22-α(r)2<0, therefore,

t=0r(α(t)2-L(α(t))22)𝔼[𝒑(ξ(t))(t)2]
Φ(𝜽(0))-Φ(𝜽)+t=0rα(t)(σ(t))2, (36)

such that 𝜽argmin𝜽Φ(𝜽). Taking r, and by the assumption that r=0α(r)(σ(r))2<, we can conclude that

lim infr𝔼[𝒑(ξ(r))(r)2]=0,

using Lemma 1.

To complete the proof, we observe that

12𝔼[𝑷(ξ(r))(r)2]𝔼[𝒑(ξ(r))(r)2]+𝔼[𝒑(ξ(r))(r)-𝑷(ξ(r))(r)2]
𝔼[𝒑(ξ(r))(r)2]
+𝔼ξ(r),(r)[𝔼𝜻(r)[𝑮(ξ(r))(r)-𝑨(ξ(r))f(𝜽(r))2|(r),ξ(r)]]
𝔼[𝒑(ξ(r))(r)2]+(σ(r))2. (37)

where the last inequality is obtained via applying the nonexpansive property again. Note that both terms on the right hand side converge to zero. Hence, this relationship implies that

lim infr𝔼[𝑷(ξ(r))(r)2]=0.

Note that by our sampling strategy, we have

𝔼[𝑷(ξ(r))(r)2] =𝔼ξ(r),(r)[𝔼𝜻(r)[𝑷(ξ(r))(r)2|(r),ξ(r)]].

However, since 𝑷(ξ(r))(r) is independent of the random seed ζ(r), we have

𝔼[𝑷(ξ(r))(r)2] =𝔼(r)[𝔼ξ(r)[𝑷(ξ(r))(r)2|(r)]]
=𝔼(r)[n=1N1N𝑷(n)(r)2].

This proves the proposition.

Appendix D Proof of Proposition 3

The insight of the proof largely follows the technique for single-block Adagrad [55], with some careful modifications to multiple block updates. One will see that the block sampling strategy and the block-wise unbiased gradient estimation are key to apply the proof techniques developed in [55] to our case. To show convergence, let us first consider:

Lemma 2

[55] Let a0>0, ai0, i=1,,T and β>1. Then, we have

t=1Tat(a0+i=1tai)β1(β-1)a0β-1.

The proof is simple and elegant; see [55, Lemma 4].

Lemma 3

[55] Consider a random variable X. If 𝔼[X]<, then

𝖯𝗋(X<)=1.

Let us consider the block-wise again:

f(𝜽(r+1)) f(𝜽(r))+𝑨(ξ(r))f(𝜽(r)),𝑨(ξ(r))(r+1)-𝑨(ξ(r))(r)
+Lξ(r)(r)2𝑨(ξ(r))(r+1)-𝑨(ξ(r))(r)2. (38)

Plugging in our update rule under AdaCPD, one can see that

f(𝜽(r+1))f(𝜽(r))+𝑨(ξ(r))f(𝜽(r)),-𝜼(ξ(r))(r)𝑮(ξ(r))(r)
   +Lξ(r)(r)2𝜼(ξ(r))(r)𝑮(ξ(r))(r)2
=f(𝜽(r))-𝑨(ξ(r))f(𝜽(r)),𝜼(ξ(r))(r)𝑨(ξ(r))f(𝜽(r))
+𝑨(ξ(r))f(𝜽(r)),𝜼(ξ(r))(r)(𝑨(ξ(r))f(𝜽(r))-𝑮(ξ(r))(r))
+Lξ(r)(r)2𝜼(ξ(r))(r)𝑮(ξ(r))(r)2. (39)

Taking expectation w.r.t. 𝜻(r) (the random seed that is responsible for selecting fibers) conditioning on the filtration (r) and the selected block ξ(r), the middle term is zero—since the block stochastic gradient is unbiased [cf. Fact 1]. Hence, we have reached the following

𝔼𝜻(r)[f(𝜽(r+1))|(r),ξ(r)]𝔼𝜻(r)[f(𝜽(r))|(r),ξ(r)]
-𝔼𝜻(r)[𝑨(ξ(r))f(𝜽(r)),𝜼(ξ(r))(r)𝑨(ξ(r))f(𝜽(r))|(r),ξ(r)]
+Lξ(r)(r)2𝔼𝜻(r)[𝜼(ξ(r))(r)𝑮(ξ(r))(r)2|(r),ξ(r)]. (40)

Taking total expectation on both sides, we have

𝔼[f(𝜽(r+1))]𝔼[f(𝜽(r))]
-𝔼[𝑨(ξ(r))f(𝜽(r)),𝜼(ξ(r))(r)𝑨(ξ(r))f(𝜽(r))]
+𝔼[Lξ(r)(r)2𝜼(ξ(r))(r)𝑮(ξ(r))(r)2]. (41)

From the above inequality and the assumption that L(n)(r) is bounded from above by L, we can conclude that

r=0R𝔼[𝑨(ξ(r))f(𝜽(r)),𝜼(ξ(r))(r)𝑨(ξ(r))f(𝜽(r))]
f(𝜽(0))-f(𝜽())+r=0RL2𝔼[𝜼(ξ(r))(r)𝑮(ξ(r))(r)2]

by summing up all the inequalities in (40) from r=0 to R.

Taking R and observe that:

𝔼[r=0𝜼(ξ(r))(r)𝑮(ξ(r))(r)2] (42)
=r=0𝔼[(𝜼(ξ(r))(r+1)+𝜼(ξ(r))(r)-𝜼(ξ(r))(r+1))𝑮(ξ(r))(r)2]
=r=0𝔼[𝜼(ξ(r))(r+1)𝑮(ξ(r))(r)2]
+r=0𝔼[(𝜼(ξ(r))(r)-𝜼(ξ(r))(r+1))𝑮(ξ(r))(r)2]. (43)

Note that we have exchanged the order of the limits and expectations, since the expectation is taking on nonnegative terms. Using Lemma 2, one can easily show the first term above is bounded from above by C12ϵβ2ϵ, where 0<C1< is a constant. To see the second term is bounded, observe

𝔼[r=0i=1Jξ(r)f=1F([𝜼(ξ(r))(r)]i,f2-[𝜼(ξ(r))(r+1)]i,f2)[𝑮(ξ(r))(r)]i,f2]
=𝔼~[r=0f=1F1Nn=1Ni=1Jn([𝜼(n)(r)]i,f2-[𝜼(n)(r+1)]i,f2)[𝑮(n)(r)]i,f2]
𝔼~[n=1Ni=1Jnf=1F1Nmaxr0[𝑮(n)(r)]i,f2r=0([𝜼(n)(r)]i,f2-[𝜼(n)(r+1)]i,f2)]
𝔼~[n=1Ni=1Jnf=1F1Nmaxr0[𝑮(n)(r)]i,f2[𝜼(n)(0)]i,f2]
n=1Ni=1Jnf=1F2N[𝜼(n)(0)]i,f2𝔼~[maxr0[[𝑨(n)f(𝜽(r))]i,f2
+([𝑨(n)f(𝜽(r))]i,f-[𝑮(n)(r)]i,f)2]],

where 𝔼~ means taking expectation w.r.t. all the random variables except for ξ(r) for r=0,,, and the second inequality is due to the effect of the telescope summation. Since we have assumed that 𝑨(n)(r)’s are bounded, the right hand side is bounded from above. Therefore, we have reached the conclusion

𝔼~[r=0𝑨(ξ(r))f(𝜽(r)),𝜼(ξ(r))(r)𝑨(ξ(r))f(𝜽(r))]<.

Applying Lemma 3, one can see that

𝖯𝗋(r=0[𝜼(ξ(r))(r)]i,f[𝑨(ξ(r))f(𝜽(r))]i,f2<)=1.

Since 𝖯𝗋(ξ(r)=n)>0, one immediate result is that any n appears infinitely many times in the sequence r=0,,, according to the second Borel-Cantelli lemma. This leads to

𝖯𝗋(j=1[𝜼(n)(rj(n))]i,f[𝑨(n)f(𝜽(rj(n)))]i,f2<)=1,

holds for n=1,,N, where r1(n),,rj(n), is the subsequence of {r} such that block n is sampled for updating. Hence, with probability one there exists a subsequence r1(n),,r(n) such that at the corresponding iterations block n is sampled for updating. It is not hard to show that

j=1[𝜼(n)(rj(n))]i,f=,

by the assumption that 𝑨(n)(r) are all bounded. This directly implies that r=1[𝜼(n)(r)]i,f=. Together with Lemma 1, we have

𝖯𝗋(lim infr[𝑨(n)f(𝜽(r))]i,f2=0)=1,i,f.

References

  • [1] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.
  • [2] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos, “Tensor decomposition for signal processing and machine learning,” IEEE Trans. Signal Process., vol. 65, no. 13, pp. 3551–3582, 2017.
  • [3] J. D. Carroll and J.-J. Chang, “Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition,” Psychometrika, vol. 35, no. 3, pp. 283–319, 1970.
  • [4] B. Rao and K. Kreutz-Delgado, “An affine scaling methodology for best basis selection,” IEEE Trans. Signal Process., vol. 47, no. 1, pp. 187 –200, jan 1999.
  • [5] J. Sun, D. Tao, and C. Faloutsos, “Beyond streams and graphs: dynamic tensor analysis,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2006, pp. 374–383.
  • [6] C. I. Kanatsoulis, X. Fu, N. D. Sidiropoulos, and W.-K. Ma, “Hyperspectral super-resolution: A coupled tensor factorization approach,” IEEE Trans. Signal Process. to appear, 2018.
  • [7] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky, “Tensor decompositions for learning latent variable models,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 2773–2832, 2014.
  • [8] A. Anandkumar, D. Hsu, and S. M. Kakade, “A method of moments for mixture models and hidden markov models,” in Conference on Learning Theory, 2012, pp. 33–1.
  • [9] N. D. Sidiropoulos, R. Bro, and G. B. Giannakis, “Parallel factor analysis in sensor array processing,” IEEE Trans. Signal Process., vol. 48, no. 8, pp. 2377–2388, Aug. 2000.
  • [10] N. D. Sidiropoulos and X.-Q. Liu, “Identifiability results for blind beamforming in incoherent multipath with small delay spread,” IEEE Trans. Signal Process., vol. 49, no. 1, pp. 228–236, Jan. 2001.
  • [11] X. Fu, N. D. Sidiropoulos, J. H. Tranter, and W.-K. Ma, “A factor analysis framework for power spectra separation and emitter localization,” IEEE Trans. Signal Process., vol. 63, no. 24, pp. 6581–6594, 2015.
  • [12] C. J. Hillar and L.-H. Lim, “Most tensor problems are np-hard,” Journal of the ACM (JACM), vol. 60, no. 6, p. 45, 2013.
  • [13] K. Huang, N. D. Sidiropoulos, and A. P. Liavas, “A flexible and efficient algorithmic framework for constrained matrix and tensor factorization,” IEEE Trans. Signal Process., vol. 64, no. 19, pp. 5052–5065, 2016.
  • [14] Y. Xu and W. Yin, “A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion,” SIAM Journal on imaging sciences, vol. 6, no. 3, pp. 1758–1789, 2013.
  • [15] A. P. Liavas and N. D. Sidiropoulos, “Parallel algorithms for constrained tensor factorization via alternating direction method of multipliers,” IEEE Trans. Signal Process., vol. 63, no. 20, pp. 5450–5463, 2015.
  • [16] C. Navasca, L. De Lathauwer, and S. Kindermann, “Swamp reducing technique for tensor decomposition.” in EUSIPCO, 2008, pp. 1–5.
  • [17] P. Comon, X. Luciani, and A. L. De Almeida, “Tensor decompositions, alternating least squares and other tales,” Journal of Chemometrics: A Journal of the Chemometrics Society, vol. 23, no. 7-8, pp. 393–405, 2009.
  • [18] U. Kang, E. Papalexakis, A. Harpale, and C. Faloutsos, “Gigatensor: scaling tensor analysis up by 100 times-algorithms and discoveries,” in Proc. ACM SIGKDD 2012, 2012, pp. 316–324.
  • [19] E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropoulos, “Parcube: Sparse parallelizable tensor decompositions,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases.   Springer, 2012, pp. 521–536.
  • [20] D. Mitchell, N. Ye, and H. De Sterck, “Nesterov acceleration of alternating least squares for canonical tensor decomposition,” arXiv preprint arXiv:1810.05846, 2018.
  • [21] A. L. Alexander, J. E. Lee, M. Lazar, and A. S. Field, “Diffusion tensor imaging of the brain,” Neurotherapeutics, vol. 4, no. 3, pp. 316–329, 2007.
  • [22] A. Shashua and T. Hazan, “Non-negative tensor factorization with applications to statistics and computer vision,” in Proceedings of the 22nd international conference on Machine learning.   ACM, 2005, pp. 792–799.
  • [23] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018.
  • [24] N. Vervliet and L. De Lathauwer, “A randomized block sampling approach to canonical polyadic decomposition of large-scale tensors,” IEEE J. Sel. Topics Signal Process., vol. 10, no. 2, pp. 284–295, 2016.
  • [25] A. Beutel, P. P. Talukdar, A. Kumar, C. Faloutsos, E. E. Papalexakis, and E. P. Xing, “Flexifact: Scalable flexible factorization of coupled tensors on hadoop,” in Proc. SIAM SDM 2014.   SIAM, 2014, pp. 109–117.
  • [26] C. Battaglino, G. Ballard, and T. G. Kolda, “A practical randomized CP tensor decomposition,” SIAM Journal on Matrix Analysis and Applications, vol. 39, no. 2, pp. 876–901, 2018.
  • [27] A. Beck and L. Tetruashvili, “On the convergence of block coordinate descent type methods,” SIAM journal on Optimization, vol. 23, no. 4, pp. 2037–2060, 2013.
  • [28] Y. Nesterov, “Efficiency of coordinate descent methods on huge-scale optimization problems,” SIAM Journal on Optimization, vol. 22, no. 2, pp. 341–362, 2012.
  • [29] S. Ghadimi and G. Lan, “Accelerated gradient methods for nonconvex nonlinear and stochastic programming,” Mathematical Programming, vol. 156, no. 1-2, pp. 59–99, 2016.
  • [30] ——, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013.
  • [31] M. Razaviyayn, M. Hong, and Z.-Q. Luo, “A unified convergence analysis of block successive minimization methods for nonsmooth optimization,” SIAM Journal on Optimization, vol. 23, no. 2, pp. 1126–1153, 2013.
  • [32] H. Wang and A. Banerjee, “Randomized block coordinate descent for online and stochastic optimization,” arXiv preprint arXiv:1407.0107, 2014.
  • [33] Y. Xu and W. Yin, “Block stochastic gradient iteration for convex and nonconvex optimization,” SIAM Journal on Optimization, vol. 25, no. 3, pp. 1686–1716, 2015.
  • [34] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
  • [35] X. Fu, C. Gao, H.-T. Wai, and K. Huang, “Block-randomized stochastic proximal gradient for constrained low-rank tensor factorization,” in submitted to IEEE ICASSP 2019, 2019.
  • [36] E. C. Chi and T. G. Kolda, “On tensors, sparsity, and nonnegative factorizations,” SIAM Journal on Matrix Analysis and Applications, vol. 33, no. 4, pp. 1272–1299, 2012.
  • [37] S. A. Vorobyov, Y. Rong, N. D. Sidiropoulos, and A. B. Gershman, “Robust iterative fitting of multilinear models,” IEEE Trans. Signal Process., vol. 53, no. 8, pp. 2678–2689, Aug 2005.
  • [38] X. Fu, K. Huang, W.-K. Ma, N. Sidiropoulos, and R. Bro, “Joint tensor factorization and outlying slab suppression with applications,” IEEE Trans. Signal Process., vol. 63, no. 23, pp. 6315–6328, 2015.
  • [39] E. E. Papalexakis, U. Kang, C. Faloutsos, N. D. Sidiropoulos, and A. Harpale, “Large Scale Tensor Decompositions: Algorithmic Developments and Applications,” IEEE Data Engineering Bulletin, Special Issue on Social Media and Data Analysis, vol. 36, no. 3, pp. 59–66, Sep. 2013.
  • [40] N. Ravindran, N. D. Sidiropoulos, S. Smith, and G. Karypis, “Memory-efficient parallel computation of tensor and matrix products for big tensor decomposition,” in 2014 48th Asilomar Conference on Signals, Systems and Computers, Nov 2014, pp. 581–585.
  • [41] J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc, “An input-adaptive and in-place approach to dense tensor-times-matrix multiply,” in SC’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.   IEEE, 2015, pp. 1–12.
  • [42] N. Kargas, N. D. Sidiropoulos, and X. Fu, “Tensors, learning, and’kolmogorov extension’for finite-alphabet random vectors,” arXiv preprint arXiv:1712.00205, 2017.
  • [43] S. Ibrahim and X. Fu, “Stochastic optimization for coupled tensor decomposition with applications in statistical learning,” in Proc. IEEE DSW 2019, 2019.
  • [44] N. Kargas and N. D. Sidiropoulos, “Learning mixtures of smooth product distributions: Identifiability and algorithm,” arXiv preprint arXiv:1904.01156, 2019.
  • [45] P. A. Traganitis, A. Pagès-Zamora, and G. B. Giannakis, “Blind multiclass ensemble classification,” IEEE Trans. Signal Process., vol. 66, no. 18, pp. 4737–4752, Sep. 2018.
  • [46] L. Xiao and T. Zhang, “A proximal stochastic gradient method with progressive variance reduction,” SIAM Journal on Optimization, vol. 24, no. 4, pp. 2057–2075, 2014.
  • [47] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in optimization, vol. 1, no. 3, pp. 123–231, 2013.
  • [48] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra, “Efficient projections onto the 1-ball for learning in high dimensions,” in Proc. the 25th international conference on Machine learning.   ACM, 2008, pp. 272–279.
  • [49] J. B. Kruskal, “Nonmetric multidimensional scaling: a numerical method,” Psychometrika, vol. 29, no. 2, pp. 115–129, 1964.
  • [50] R. Bro and N. D. Sidiropoulos, “Least squares algorithms under unimodality and non-negativity constraints,” Journal of Chemometrics: A Journal of the Chemometrics Society, vol. 12, no. 4, pp. 223–247, 1998.
  • [51] H. Robbins and S. Monro, “A stochastic approximation method,” in Herbert Robbins Selected Papers.   Springer, 1985, pp. 102–109.
  • [52] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [53] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
  • [54] T. Dozat, “Incorporating nesterov momentum into adam,” 2016.
  • [55] X. Li and F. Orabona, “On the convergence of stochastic gradient descent with adaptive stepsizes,” arXiv preprint arXiv:1805.08114, 2018.
  • [56] X. Chen, S. Liu, R. Sun, and M. Hong, “On the convergence of a class of adam-type algorithms for non-convex optimization,” arXiv preprint arXiv:1808.02941, 2018.
  • [57] K. Huang, N. Sidiropoulos, E. Papalexakis, C. Faloutsos, P. Talukdar, and T. Mitchell, “Principled neuro-functional connectivity discovery,” in Proc. SIAM SDM 2015, 2015.
  • [58] X. Fu, W.-K. Ma, K. Huang, and N. D. Sidiropoulos, “Blind separation of quasi-stationary sources: Exploiting convex geometry in covariance domain,” IEEE Trans. Signal Process., vol. 63, no. 9, pp. 2306–2320, May 2015.
  • [59] L. D. Lathauwer and J. Castaing, “Blind identification of underdetermined mixtures by simultaneous matrix diagonalization,” IEEE Trans. Signal Process., vol. 56, no. 3, pp. 1096 –1105, Mar. 2008.
  • [60] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” The Journal of Machine Learning Research, vol. 11, pp. 19–60, 2010.

Xiao Fu (S’12-M’15) is an Assistant Professor in the School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, United States. He received his Ph.D. degree in Electronic Engineering from The Chinese University of Hong Kong (CUHK), Hong Kong, in 2014. He was a Postdoctoral Associate in the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN, United States, from 2014 to 2017. His research interests include the broad area of signal processing and machine learning. He received a Best Student Paper Award at ICASSP 2014. Two of his co-authored papers received Best Student Paper Awards at IEEE CAMSAP 2015 and IEEE MLSP 2019, respectively. He serves as the treasurer of IEEE Signal Processing Society Oregon Chapter. He also serves as a member of the EURASIP Technical Area Committee in Signal Processing for Multisensor Systems for the term of 2020-2023. He is a member of IEEE.

Shahana Ibrahim received her B.Tech. degree in Electronics and Communication Engineering from National Institute of Technology, Calicut, India, in 2012. She had been working as System Validation Engineer at Texas Instruments, Bengaluru, India, from 2012 to 2017. She received her M.S. degree in Electrical Engineering from Oregon State University, Corvallis, Oregon, United States, in 2019. She is currently pursuing her PhD degree in Electrical Engineering at Oregon State University, Corvallis, Oregon, United States. Her research interests are in the broad areas of statistical machine learning and signal processing.

Hoi-To Wai (S’11–M’18) received his PhD degree from Arizona State University (ASU) in Electrical Engineering in Fall 2017, B. Eng. (with First Class Honor) and M. Phil. degrees in Electronic Engineering from The Chinese University of Hong Kong (CUHK) in 2010 and 2012, respectively. He is an Assistant Professor in the Department of Systems Engineering & Engineering Management at CUHK. He has held research positions at ASU, UC Davis, Telecom ParisTech, Ecole Polytechnique, LIDS, MIT. Hoi-To’s research interests are in the broad area of signal processing, machine learning and distributed optimization, with a focus of their applications to network science. His dissertation has received the 2017’s Dean’s Dissertation Award from the Ira A. Fulton Schools of Engineering of ASU and he is a recipient of a Best Student Paper Award at ICASSP 2018.

Cheng Gao is a Computer Science PhD student at University of Missouri, Columbia, Missouri, United States. He is currently focusing his research on sparse learning and bioinformatics. He earned his Master’s degree in Electrical and Computer Engineering from Oregon State University, Corvallis, Oregon, United States, in 2019 with research focus on large-scale tensor decomposition algorithms. He received his Bachelor’s degree in Automation from Wuhan University of Technology, China in 2016.

Kejun Huang received his B.Eng. degree in communication engineering from the Nanjing University of Information Science and Technology, China, in 2010 and his Ph.D. degree in electrical engineering from the University of Minnesota, Minneapolis, in 2016. He is an assistant professor in the Department of Computer and Information Science and Engineering at the University of Florida, Gainesville. He was a postdoctoral associate in the Department of Electrical and Computer Engineering at the University of Minnesota, Minneapolis, from 2016 to 2018. His research interests include machine learning, signal processing, optimization, and statistics. He is a Member of the IEEE.