Abstract
Tensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in avariety of applications, such as chemometrics, signal processing, and machinelearning. A broadly used method for computing such decompositions relies on theAlternating Least Squares (ALS) algorithm. When the number of components issmall, regardless of its implementation, ALS exhibits low arithmetic intensity,which severely hinders its performance and makes GPU offloading ineffective. Weobserve that, in practice, experts often have to compute multipledecompositions of the same tensor, each with a small number of components(typically fewer than 20), to ultimately find the best ones to use for theapplication at hand. In this paper, we illustrate how multiple decompositionsof the same tensor can be fused together at the algorithmic level to increasethe arithmetic intensity. Therefore, it becomes possible to make efficient useof GPUs for further speedups; at the same time the technique is compatible withmany enhancements typically used in ALS, such as line search, extrapolation,and non-negativity constraints. We introduce the Concurrent ALS algorithm andlibrary, which offers an interface to Matlab, and a mechanism to effectivelydeal with the issue that decompositions complete at different times.Experimental results on artificial and real datasets demonstrate a shorter timeto completion due to increased arithmetic intensity.