### Abstract

This work explores non-negative matrix factorization based on regularizedPoisson models for recommender systems with implicit-feedback data. Theproperties of Poisson likelihood allow a shortcut for very fast computation andoptimization over elements with zero-value when the latent-factor matrices arenon-negative, making it a more suitable approach than squared loss for verysparse inputs such as implicit-feedback data. A simple and embarrassinglyparallel optimization approach based on proximal gradients is presented, whichin large datasets converges 2-3 orders of magnitude faster than its Bayesiancounterpart (Hierarchical Poisson Factorization) fit through variationalinference techniques, and 1 order of magnitude faster than implicit-ALS fitwith the Conjugate Gradient method.

### Quick Read (beta)

# Fast Non-Bayesian Poisson Factorization for Implicit-Feedback Recommendations

###### Abstract

This work explores non-negative matrix factorization based on regularized Poisson models for recommender systems with implicit-feedback data. The properties of Poisson likelihood allow a shortcut for very fast computation and optimization over elements with zero-value when the latent-factor matrices are non-negative, making it a more suitable approach than squared loss for very sparse inputs such as implicit-feedback data. A simple and embarrassingly parallel optimization approach based on proximal gradients is presented, which in large datasets converges 2-3 orders of magnitude faster than its Bayesian counterpart (Hierarchical Poisson Factorization) fit through variational inference techniques, and 1 order of magnitude faster than implicit-ALS fit with the Conjugate Gradient method.

## 1 Introduction

In a typical scenario for recommender systems, a lot of data is available about interactions between users and items, such as users purchasing products or listening to songs, where each user interacts with only a handful of the items in the catalog – e.g. no user is going to play every song in a music service or watch every movie in a video service.

Typically, recommendation models based on collaborative filtering try to predict the entries in the user-item interactions matrix – that is, a matrix in which rows index users, columns index items, and the value at each user-item combination is the number of times the user has interacted/consumed the item or the rating she gave to it – based on minimization of some loss function such as squared difference between the predicted and the real value, with the idea that items with higher predicted values are better candidates to recommend to the user (see [10]).

In so-called explicit feedback settings, in which users provide an explicit valuation or rating of each item, such models are usually fit only to the non-missing entries in the interactions matrix, as this data signals both likes and dislikes of the user, which leads to efficient optimization procedures.

However, in so-called implicit-feedback settings, in which there are no explicit ratings but rather only event histories such as songs played by each user, it’s not enough for a good model to be fit only to the non-missing entries, as they don’t tend to signal dislikes and there can be pathological cases in which the non-missing entries all have the same value (e.g. when the matrix is binary - see [9]). In this case, it’s necessary to also consider the non-missing entries (typically with a value of zero), of which there are orders of magnitude more, resulting in a more computationally challenging problem.

Unlike Gaussian likelihood (squared loss) or Bernoulli likelihood (log loss), Poisson likelihood, when using a model that does not exponentiate its parameters, offers a very fast optimization approach for the missing entries filled with zeros, since the log-likelihood for them is given by their predicted value only (multiplied by minus one), and in low-rank matrix factorization, the sum of predicted values for all combinations of users and items can be quickly obtained by first summing the latent factors for all users (one vector) and for all items (another vector), then calculating the dot product between the resulting summed vectors.

This is not the first time that a Poisson model has been proposed for sparse matrix factorization - [3] also developed this idea, but following a Bayesian approach, while [6] improved upon it by adding a hierarchical structure and a faster optimization procedure based on variational inference, with many other works later building upon that base, also relying on variational inference (e.g. [5], [1]). While a Bayesian hierarchical formulation is more expressive, fitting such models is much slower than conventional optimization techniques on regularized models. This paper proposes an optimization-based approach towards matrix factorization that maximizes Poisson likelihood based on proximal gradients .

## 2 Low-rank matrix factorization

Low-rank matrix factorization is one of the most commonly used techniques in collaborative filtering for predicting entries in the user-item interaction matrix based only on observed user-item interactions ([10]). The idea behind it is to assign to each user $u$ and item $i$ a vector of fixed dimensionality $k$ representing arbitrary features (a.k.a. latent factors) ${\mathbf{a}}_{u}\in {\mathbb{R}}^{k},{\mathbf{b}}_{i}\in {\mathbb{R}}^{k}$ (these are the model parameters) in such a way that the value for each entry in the interactions matrix is approximated by the dot product between the features of the user and the item for that entry, i.e. ${x}_{ui}\approx \u27e8{\mathbf{a}}_{u},{\mathbf{b}}_{i}\u27e9$ or by a transformation of it ${x}_{ui}\approx f(\u27e8{\mathbf{a}}_{u},{\mathbf{b}}_{i}\u27e9)$. These features or latent factors are in turn determined through an optimization objective that aims at minimizing the difference between the predicted and the real values, e.g.

$\underset{\mathbf{A},\mathbf{B}}{\mathrm{min}}\parallel {I}_{x}(\mathbf{X}-{\mathrm{\mathbf{A}\mathbf{B}}}^{T})\parallel $ | (1) |

Where ${\mathbf{A}}_{m,k}=\left(\begin{array}{c}\hfill {\mathbf{a}}_{1}^{T},\mathrm{\dots},{\mathbf{a}}_{m}^{T}\hfill \end{array}\right),{\mathbf{B}}_{n,k}=\left(\begin{array}{c}\hfill {\mathbf{b}}_{1}^{T},\mathrm{\dots},{\mathbf{b}}_{n}^{T}\hfill \end{array}\right)$, and ${I}_{x}$ is the indicator function which is one when the entry ${x}_{ui}$ is present in the data and zero when it is missing.

As such model tends to overfit the interactions data, other additional improvements upon it are typically incorporated, such as centering the entries in the matrix, incorporating regularization on the model parameters, and adding user and item biases as additional parameters. The optimization problem is typically solved through the Alternating Least-Squares algorithm ([18] - when one factor matrix is fixed, optimizing for the other latent factor matrix is a convex optimization problem with a closed-form solution – this algorithm alternates between minimizing one or the other until convergence) or Stochastic Gradient Descent ([10]).

In the implicit-feedback case with missing-as-zero and values consisting of counts (e.g. number of times a user clicked something - note that many works instead propose a weighting scheme), the matrices $\mathbf{A}$ and $\mathbf{B}$ are sometimes constrained to have all non-negative entries, as it wouldn’t make sense to predict negative values, biases are left out (as they would not allow to predict values of zero), and entries are not centered, resulting in a minimization problem such as:

$\underset{\mathbf{A}\in {\mathbb{R}}_{+}^{m,k},\mathbf{B}\in {\mathbb{R}}_{+}^{n,k}}{\mathrm{min}}\parallel \mathbf{X}-{\mathrm{\mathbf{A}\mathbf{B}}}^{T}{\parallel}^{2}+\lambda (\parallel \mathbf{A}{\parallel}^{2}+\parallel \mathbf{B}{\parallel}^{2})$ | (2) |

This is a more challenging optimization problem, with a matrix X that usually is too large to even fit in a computer’s memory, but different methods have been devised to solve it or solve variations thereof smartly, such as implicit-ALS ([9]) along with techniques to speed it up ([17]), or BPR (Bayesian Personalized ranking), which tries to sample only some of the missing entries at each update ([15]).

## 3 Sparse Poisson regression

A typical probability distribution used for counts data is the Poisson distribution, parameterized by one variable $z>0$, with probability density function given by $p(y)={z}^{y}\mathrm{exp}(-z)/y!$ . This distribution is limited to non-negative integers and tends to produce asymmetrical and more peaked distributions that are more resemblant of real counts data than others such as the normal distribution. Poisson models in which the $z$ parameter is defined as the sum or dot product of other variables can be fit to observed data by following the maximum-likelihood principle, which translates into maximizing Poisson log-likelihood (the negative of it plus a constant is referred from here on interchangeably as Poisson loss), given by:

$ll(z)=-(z-y\mathrm{log}(z)+y!)$ | (3) |

Generalized linear models for Poisson regression usually follow the form $\mathbf{y}\sim Poisson(\mathrm{exp}(\mathbf{X}\beta ))$, where $\beta $ are the model coefficients (parameters), $\mathbf{X}$ (not to be confused with the matrix in the factorization models) is the matrix of covariates, and $\mathbf{y}$ the observed counts for each observation; but others (e.g. [8]) have also tried to perform Poisson regression for all-non-negative covariates without exponentiation, constraining the coefficients to be non-negative instead, i.e. $\mathbf{y}\sim Poisson(\mathbf{X}\beta )$, which is the approach that will be followed in this work, since exponentiated numbers would not allow for fast calculation of the sum of all entries in the ${\mathbf{X}}_{m,n}$ matrix.

As $y!$ does not depend on the model parameters, it can be left out of the minimization objective. For fitting a Poisson regression with non-negative features of dimensionality $k$ and coefficients without exponentiation to $m$ observations of covariates ${\mathbf{A}}_{m,k}=\left(\begin{array}{c}\hfill {\mathbf{a}}_{1}^{T},\mathrm{\dots},{\mathbf{a}}_{m}^{T}\hfill \end{array}\right)$ and counts ${\mathbf{b}}_{m}=\left(\begin{array}{c}\hfill {b}_{1},\mathrm{\dots},{b}_{m}\hfill \end{array}\right)$, the optimization objective (maximum likelihood estimation problem) would look as follows:

$\underset{\mathbf{x}\in {\mathbb{R}}_{+}^{k}}{\mathrm{min}}{\displaystyle \sum _{i}}{\mathbf{a}}_{i}^{T}\mathbf{x}-{b}_{i}\mathrm{log}({\mathbf{a}}_{i}^{T}\mathbf{x})$ | (4) |

Note that ${\sum}_{i}{\mathbf{a}}_{i}^{T}\mathbf{x}$ can be obtained by first summing $\mathbf{s}={\sum}_{i}{\mathbf{a}}_{i}$ and then taking its inner product with $\mathbf{x}$, i.e. ${\sum}_{i}{\mathbf{a}}_{i}^{T}\mathbf{x}={\mathbf{s}}^{T}\mathbf{x}$, something that could not be achieved with the exponentiated version, and when ${b}_{i}=0$, ${b}_{i}\mathrm{log}({\mathbf{a}}_{i}^{T}\mathbf{x})=0$ too, so for zero-valued entries, maximization of Poisson likelihood translates into minimizing ${\mathbf{a}}_{i}^{T}\mathbf{x}$. As such, the minimization objective can be re-expressed as:

$\underset{\mathbf{x}\in {\mathbb{R}}_{+}^{k}}{\mathrm{min}}{\mathbf{s}}^{T}\mathbf{x}-{\displaystyle \sum _{{b}_{i}>0}}{b}_{i}\mathrm{log}({\mathbf{a}}_{i}^{T}\mathbf{x})$ | (5) |

Adding ${l}_{2}$ regularization on the model parameters would result in an objective like this:

$\underset{\mathbf{x}\in {\mathbb{R}}_{+}^{k}}{\mathrm{min}}{\mathbf{s}}^{T}\mathbf{x}-{\displaystyle \sum _{{b}_{i}>0}}\left({b}_{i}\mathrm{log}({\mathbf{a}}_{i}^{T}\mathbf{x})\right)+\lambda \parallel \mathbf{x}{\parallel}_{2}^{2}$ | (6) |

This is a convex optimization problem, but as other works have found out ([8]), it cannot be solved through typical methods like L-BFGS-B ([2]) that rely on assumptions such as Lipschitz continuity.

Back to the matrix factorization case, if we adopt Poisson loss (negative likelihood plus a constant) and consider one of the matrices to be fixed, the optimal values for each vector of latent factors in matrices $\mathbf{A}$ and $\mathbf{B}$ are the solution of a Poisson regression problem in which ${\mathbf{a}}_{i}$ are the rows of the matrix that was fixed, and ${b}_{i}$ are the entries for the corresponding row (for users) or column (for items) in the interactions matrix $\mathbf{X}$. From the formula above, it can be seen that Poisson loss can be calculated without ever iterating through the non-zero values (their contribution is obtained through $\mathbf{s}$), which is very convenient and efficient in the implicit-feedback case as most of the entries will indeed be zero, and as will be shown later, such loss can be optimized just as efficiently.

## 4 Proximal gradient methods

While this constrained and non-exponentiated approach to Poisson regression cannot be solved through typical gradient-based methods, it still represents a convex minimization problem and there are other techniques that can indeed solve it, such as Proximal Gradient Descent, Accelerated Proximal Gradient Descent, Alternating Direction Method of Multipliers (ADMM, [13]), or Composite Mirror-Prox ([8]).

Given a decomposable convex minimization problem of the form

$\underset{\mathbf{x}\in \mathrm{\mathbf{d}\mathbf{o}\mathbf{m}}f}{\mathrm{min}}f(\mathbf{x})+h(\mathbf{x})$ | (7) |

Where $f(\mathbf{x})$ is a loss function with desirable properties such as differentiability and smoothness, and $h(\mathbf{x})$ is a regularization function which might not have these same properties, the proximal gradient descent method iteratively performs updates as follows:

[H]
Inputs Functions $f(.)$ and $h(.)$, starting point ${\mathbf{x}}_{0}$, number of steps $T$, step size sequence ${\alpha}_{1},\mathrm{\dots},{\alpha}_{T}$

Output Optimal parameters ${\mathbf{x}}^{*}$
{algorithmic}[1]
\For$1..T$
\StateSet ${\mathbf{x}}_{t+\frac{1}{2}}:={\mathbf{x}}_{t}-{\alpha}_{t}\nabla f({\mathbf{x}}_{t})$
\StateSet ${\mathbf{x}}_{t+1}:={\mathrm{\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{x}}}_{\alpha h}({\mathbf{x}}_{t+\frac{1}{2}})$
\Iftermination criteria is satisfied
\Statebreak
\EndIf\EndFor\Return${\mathbf{x}}_{T}$

Where ${\mathrm{\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{x}}}_{\alpha h}$ is the proximal operator, defined as

${\mathrm{\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{x}}}_{\alpha h}(\mathbf{x})=\underset{\mathbf{y}\in \mathrm{\mathbf{d}\mathbf{o}\mathbf{m}\mathbf{f}}}{argmin}(h(\mathbf{y})+{\displaystyle \frac{1}{2\alpha}}\parallel \mathbf{y}-\mathbf{x}{\parallel}_{2}^{2})$ | (8) |

Intuitively, what this algorithm does is first it takes a gradient step, but then finds the nearest point that is in the domain and penalizes it by regularization and distance to the unconstrained point where the gradient step would take it otherwise. For more details and explanations of how and why this method works, see [13]. The Poisson regression variation introduced before can also be expressed in this canonical form by letting:

$f(\mathbf{x})=-{\displaystyle \sum _{{b}_{i}>0}}{b}_{i}\mathrm{log}({\mathbf{a}}_{i}^{T}\mathbf{x}),h(\mathbf{x})={\mathbf{s}}^{T}\mathbf{x}+\lambda \parallel \mathbf{x}{\parallel}_{2}^{2}$ | (9) |

The $h(\mathbf{x})$ function is proximal-friendly. In order to obtain its proximal operator, it’s easy to calculate:

$\frac{\partial}{\partial \mathbf{y}}}({\mathbf{s}}^{T}\mathbf{y}+\lambda \parallel \mathbf{y}{\parallel}_{2}^{2}+{\displaystyle \frac{1}{2\alpha}}\parallel \mathbf{y}-\mathbf{x}{\parallel}_{2}^{2})=2\lambda \mathbf{y}+\mathbf{s}+{\displaystyle \frac{\mathbf{y}-\mathbf{x}}{\alpha}$ | (10) |

By setting it to zero, we obtain:

${\mathrm{\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{x}}}_{\alpha h}(\mathbf{x})=\mathrm{max}\{0,{\displaystyle \frac{\mathbf{x}-\alpha \mathbf{s}}{2\lambda \alpha +1}}\}$ | (11) |

Similarly, for ${l}_{1}$ regularization, we would obtain:

${\mathrm{\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{x}}}_{\alpha {h}_{{l}_{1}}}(\mathbf{x})=\mathrm{max}\{0,\mathbf{x}-\alpha (\lambda +\mathbf{s})\}$ | (12) |

The gradient of $f(.)$ is given by the formula:

$\nabla f(\mathbf{x})=-{\displaystyle \sum _{{b}_{i}>0}}{\displaystyle \frac{{b}_{i}}{{\mathbf{a}}_{i}^{T}\mathbf{x}}}{\mathbf{a}}_{i}$ | (13) |

For sparse Poisson regression with ${l}_{2}$ regularization, this would translate into the following update rules:

[H]
Inputs Functions $f(.)$ and $h(.)$, starting point ${\mathbf{x}}_{0}$, number of steps $T$, step size sequence ${\alpha}_{1},\mathrm{\dots},{\alpha}_{T}$

Output Optimal parameters ${\mathbf{x}}^{*}$
{algorithmic}[1]
\For$1..T$
\StateSet ${\mathbf{x}}_{t+\frac{1}{2}}:={\mathbf{x}}_{t}-{\alpha}_{t}{\sum}_{{b}_{i}>0}\frac{-{b}_{i}}{{\mathbf{a}}_{i}^{T}\mathbf{x}}{\mathbf{a}}_{i}$
\StateSet ${\mathbf{x}}_{t+1}:=\mathrm{max}\{0,\frac{{\mathbf{x}}_{t+\frac{1}{2}}-\alpha \mathbf{s}}{2\lambda \alpha +1}\}$
\Iftermination criteria is satisfied
\Statebreak
\EndIf\EndFor\Return${\mathbf{x}}_{T}$

In this formulation, $f(\mathbf{x})$ is an unbounded problem in which the solution is $\mathbf{x}=\mathrm{\infty}$, but its gradient is still well behaved. It’s also possible to define the functions as $f(\mathbf{x})=-{\sum}_{{b}_{i}>0}{b}_{i}\mathrm{log}({\mathbf{a}}_{i}^{T}\mathbf{x})+\lambda \parallel \mathbf{x}{\parallel}_{2}^{2},h(\mathbf{x})={\mathbf{s}}^{T}\mathbf{x}$ too, since the ${l}_{2}$ norm meets the necessary criteria, and in this case $f(\mathbf{x})$ would reach its minimum at a non-infinite $\mathbf{x}$.

Proximal gradient descent is usually not the preferred method for these types of problems, and tends to require more updates (steps) than other methods, but note for now that, if doing just one iteration, one proximal gradient update is much faster to compute than one update for ADMM or Composite Mirror-Prox.

Without going into much detail, $f(\mathbf{x})$ does not have a closed-form proximal operator, and one iteration of the ADMM updates (primal-dual method) for the same problem would look as follows:

[H] {algorithmic}[1] \State${\mathbf{x}}_{t+1}:=\mathbf{L}\mathtt{\text{-}}\mathrm{\mathbf{B}\mathbf{F}\mathbf{G}\mathbf{S}}\mathtt{\text{-}}\mathbf{B}({\mathbf{z}}_{t}-{\mathbf{u}}_{t},f,\nabla f,\mathbf{x}\ge 0)$ \State${\mathbf{z}}_{t+1}:=\mathrm{max}\{0,\frac{{\mathbf{x}}_{t+1}+{\mathbf{u}}_{t}-\alpha \mathbf{s}}{2\lambda \alpha +1}\}$ \State${\mathbf{u}}_{t+1}:={\mathbf{u}}_{t}+{\mathbf{x}}_{t+1}-{\mathbf{z}}_{t+1}$

Similarly, one Composite Mirror-Prox update, while perhaps making more progress than a proximal gradient update, would also be more expensive to compute, and require keeping additional extra variables in memory.

It’s also possible to add second-order information about $f(\mathbf{x})$ and use Proximal Newton ([14]) instead of Proximal Gradient Descent. When following the approach including the ${l}_{2}$ norm into $f(\mathbf{x})$, the gradient and hessian are given by:

$\nabla f(\mathbf{x})=-{\displaystyle \sum _{{b}_{i}>0}}{\displaystyle \frac{{b}_{i}}{{\mathbf{a}}_{i}^{T}\mathbf{x}}}{\mathbf{a}}_{i}+2\lambda \mathbf{x}$ | (14) | ||

$H(\mathbf{x})={\nabla}^{2}f(\mathbf{x})={\displaystyle \frac{{\mathbf{A}}^{T}\mathbf{A}\odot \mathbf{b}}{{(\mathrm{\mathbf{A}\mathbf{x}})}^{2}}}+2\lambda I$ | (15) |

Where $\mathbf{A}\odot \mathbf{b}$ denotes element-wise multiplication. Newton’s step is not a simple negative gradient step, but is given by ${({\nabla}^{2}f)}^{-1}g$ (with $g$ being the gradient), and the proximal function is in Newton’s case better defined not as the squared difference in ${l}_{2}$ norm between two points, but as the difference by the norm induced by the hessian $H$, i.e.

${\mathrm{\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{x}}}_{h}^{H}(\mathbf{x})=\underset{\mathbf{y}\in \mathrm{\mathbf{d}\mathbf{o}\mathbf{m}}f}{argmin}h(\mathbf{y})+{\displaystyle \frac{1}{2}}\parallel \mathbf{x}-\mathbf{y}{\parallel}_{H}^{2}=\underset{\mathbf{y}\in \mathrm{\mathbf{d}\mathbf{o}\mathbf{m}}f}{argmin}h(\mathbf{y})+{\displaystyle \frac{1}{2}}{(\mathbf{y}-\mathbf{x})}^{T}H(\mathbf{y}-\mathbf{x})$ | (16) |

In the case of $h(\mathbf{x})={\mathbf{s}}^{T}\mathbf{x}$, the Proximal Newton operator is the solution of a quadratic program (QP), which can be expressed in the canonical quadratic form used by most solvers as:

${\mathrm{\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{x}}}_{h}^{H}(\mathbf{x})=\underset{\mathbf{y}}{argmin}{\displaystyle \frac{1}{2}}{\mathbf{y}}^{T}H\mathbf{y}+{({\displaystyle \frac{\mathbf{s}}{2}}-H{\mathbf{x}}^{T})}^{T}\mathbf{y}\mathit{\hspace{1em}}s.t.\mathbf{y}\ge 0$ | (17) |

Further, a line search can then be performed, with the direction being ${\mathbf{x}}_{g}={\mathrm{\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{x}}}_{h}^{H}(\mathbf{x}-{(H(\mathbf{x}))}^{-1}\nabla f(\mathbf{x}))-\mathbf{x}$.

Again, Newton iterations are also much slower than regular gradient iterations, and require the hessian to be positive semi-definite, which will not always be the case if the regularization parameters is not large enough.

## 5 Poisson matrix factorization

Producing a maximum-likelihood estimate of a Poisson model with regularization for the case of matrix factorization would translate into solving:

$\underset{\mathbf{A}\in {\mathbb{R}}_{+}^{m,k},\mathbf{B}\in {\mathbb{R}}_{+}^{n,k}}{\mathrm{max}}{\displaystyle \sum _{u}^{m}}{\displaystyle \sum _{i}^{n}}(-{\mathbf{a}}_{u}^{T}{\mathbf{b}}_{i}+{x}_{ui}\mathrm{log}({\mathbf{a}}_{u}^{T}{\mathbf{b}}_{i})-{x}_{ui}!)-\lambda (\parallel \mathbf{A}{\parallel}_{2}^{2}+\parallel \mathbf{B}{\parallel}_{2}^{2})$ | (18) |

While at first glance this might not look like a computationally-tractable problem, from the previous sections, we know that it can equivalently be re-expressed as follows:

$\underset{\mathbf{A}\in {\mathbb{R}}_{+}^{m,k},\mathbf{B}\in {\mathbb{R}}_{+}^{n,k}}{\mathrm{min}}{\mathbf{s}}_{A}^{T}{\mathbf{s}}_{B}-\left({\displaystyle \sum _{{x}_{ui}>0}}{x}_{ui}\mathrm{log}({\mathbf{a}}_{u}^{T}{\mathbf{b}}_{i})\right)+\lambda (\parallel \mathbf{A}{\parallel}_{2}^{2}+\parallel \mathbf{B}{\parallel}_{2}^{2})$ | (19) |

Where ${\mathbf{s}}_{A}={\sum}_{i}^{m}{\mathbf{a}}_{i}$ and ${\mathbf{s}}_{B}={\sum}_{i}^{n}{\mathbf{b}}_{i}$.

This is a much faster objective to evaluate, but the problem is not convex. Following the same alternating minimization idea as in ALS, note that if one of the $\mathbf{A}$ or $\mathbf{B}$ matrices in the factorization problem is fixed, obtaining the optimal values of other matrix is a convex minimization problem which reduces to performing one Poisson regression for each vector of the other matrix as explained before.

Note however that, in order to make progress in the desired minimization objective, since the matrices to optimize are being alternated, it’s not strictly necessary to run the optimization routine for each vector until convergence each time, but rather just to make progress towards the optimum of each vector in each pass, and it might make more sense to spend the time applying updates to the other matrix than applying multiple updates to the same matrix. With only one update, Accelerated Proximal Gradient Descent would reduce to just regular Proximal Gradient Descent.

Putting everything together, the optimization routine for the Poisson factorization model, with gamma initialization and step sizes decreasing by half at each step, would be as follows:

[H]
Inputs Sparse matrix $\mathbf{X}\in {\mathbb{R}}_{+}^{m,n}$, number of factors $k$, initial step size $\alpha $, regularization parameter $\lambda $, number of iterations $T$, number of updates per iteration $\tau $

Outputs Latent factors ${\mathbf{A}}^{*},{\mathbf{B}}^{*}$
{algorithmic}[1]
\StateSample ${\mathbf{A}}_{m,k}\sim Gamma(1,1),{\mathbf{B}}_{n,k}\sim Gamma(1,1)$
\For$1..T$
\StateCalculate ${\mathbf{s}}_{B}={\sum}_{i}^{n}{\mathbf{b}}_{i}$
\For$u=1..m$
\For$1..\tau $
\State${\mathbf{a}}_{u}:=\mathrm{max}\{0,\left({\mathbf{a}}_{u}+\alpha {\sum}_{{x}_{ui}>0}\frac{{x}_{ui}}{{\mathbf{a}}_{u}^{T}{\mathbf{b}}_{i}}{\mathbf{b}}_{i}-\alpha {\mathbf{s}}_{B}\right)/\left(2\lambda \alpha +1\right)\}$
\EndFor\EndFor

Calculate ${\mathbf{s}}_{A}={\sum}_{i}^{m}{\mathbf{a}}_{i}$ \For$i=1..n$ \For$1..\tau $ \State${\mathbf{b}}_{i}:=\mathrm{max}\{0,\left({\mathbf{b}}_{i}+\alpha {\sum}_{{x}_{ui}>0}\frac{{x}_{ui}}{{\mathbf{a}}_{u}^{T}{\mathbf{b}}_{i}}{\mathbf{a}}_{u}-\alpha {\mathbf{s}}_{A}\right)/\left(2\lambda \alpha +1\right)\}$ \EndFor\EndFor

Update $\alpha :=\alpha /2$ \EndFor\Return$\mathbf{A},\mathbf{B}$

Since the updates for each row in the matrix being optimized are independent of each other, they can be calculated in parallel or in a distributed setting by sharing between nodes the $\mathbf{s}$ vector, the other matrix, and non-zero entries for that row/column in the $\mathbf{X}$ matrix. An efficient implementation of the algorithm requires duplicating the $\mathbf{X}$ matrix in row-sparse and column-sparse formats, but no additional intermediate variables are needed. The implementation developed here is made open-source and freely available^{1}^{1}
1
https://github.com/david-cortes/poismf.

The regularization might also be changed to ${l}_{1}$ without additional hassle by using the formula in (12) instead, which is more likely to result in some factors having a value of exactly zero.

## 6 Numerical stability and convergence

This model was fit to different datasets for collaborative filtering with both implicit and explicit feedback data, and it displayed some interesting behaviors. As expected, there can be issues with numerical precision and instability (note that a large step size would turn the whole vector to zeros), but it was possible to fit the model with the same procedure and hyperparameters to many different datasets, obtaining good results in all of them.

While iterations of HPF are guaranteed to monotonically increase the training set likelihood, proximal gradient iterations are not, and using the wrong step size might result in all the factors becoming zero or very close to zero, after which the updates would be undefined (division by zero and logarithm of zero). As such, the step sizes and regularization parameter need to be chosen carefully - too much or too little in either of them, and the parameters will become undefined or be too large for typical computer floating point representations. A line search can also help in this regard, but it was possible to save the time required for it by a good combination of hyperparameters instead.

While it is beneficial to perform more than one update per iteration, in practice $\tau =1$ is enough to reach convergence when looking at ranking metrics rather than Poisson loss. The procedure might be terminated after running for a fixed number of iterations, or by some termination criteria based on training or validation-set likelihood or other metric.

A perhaps more theoretically sound approach would be to use the alternative with ${l}_{2}$ norm incorporated into $f(.)$ rather than into $h(.)$, and to use an initial step size that is the same at the beginning of each iteration, decreasing instead with further updates to the same user/item factors ($\tau $ in the algorithm). This approach however required more iterations to converge (in terms of ranking metrics), with each iteration being much slower as it contains multiple updates, and did not lead to better final results. A different approach using a Conjugate Gradient method ([11]) with line search and lower regularization values was able to deliver very close results according to ranking metrics, and was less prone to numeric instability, but was also slower.

The number of steps required to reach a local optimum is also something variable – depending on the choice of step size and regularization, for some datasets such as the smaller MovieLens ([7]) ones it can do it in as little as 3 iterations, and oftentimes benefits from early stopping. The point at which to stop it however is hard to determine, and oftentimes later iterations will only decrease training likelihood.

PF and HPF seem to produce wildly different likelihood values, but despite these large differences, the rankings that they induce on the items are not as dissimilar. While HPF produces relatively large variations in the values of latent factors, in PF with ${l}_{2}$ regulatization these can all be very close to each other with some minor variation that still produces a good ranking - the actual predicted values $\widehat{{x}_{ui}}$ still resemble gamma distributions though.

In the datasets experimented with, large step sizes coupled with low regularization result in the parameters quickly becoming undefined. A choice of $\alpha ={10}^{-3}$ or ${10}^{-4}$ can result in reaching an optimum in 2-3 iterations, but more often than not, it results in failed optimization, or in reaching an optimum quickly but then later iterations resulting in undefined parameters. Regularization parameters lower than ${10}^{4}-{10}^{5}$ also seem to result in failure.

A range of relatively safe choices seem to be $\alpha ={10}^{-6}\mathtt{\text{-}}{\mathrm{\hspace{0.25em}10}}^{-8},\lambda ={10}^{7}\mathtt{\text{-}}{\mathrm{\hspace{0.25em}10}}^{11},T=5\mathtt{\text{-}}\mathrm{\hspace{0.25em}20},\tau =1\mathtt{\text{-}}\mathrm{\hspace{0.25em}5}$. Performing more iterations, while still having an effect on likelihood, does not lead to better results when evaluated under ranking metrics.

Using ${l}_{1}$ regularization results in a large drop in the performance metrics (not shown here), and requires completely different hyperparameters - much lower regularization and much larger step sizes, along with more step sizes and perhaps $\tau >1$. As expected, it has the nice property that many of the latent factors are exactly zero, and they look gamma-distributed just like in HPF.

Proximal Newton was not able to find as good local optima, and in the version of the minimization objective with the regularization term incorporated into $h(.)$, the hessians are more often than not non-positive-semi-definite. When using large regularization parameters (the same ones that work for Proximal Gradient Descent), Proximal Newton converges in barely 1-2 iterations, in the sense that further updates are then zero or very close to zero for each factor, but these local optima are worse than the ones found by other methods.

ADMM for this problem is much slower, and also requires more iterations and $\tau >1$, but it does manage to find just as good optima. Composite Mirror-Prox was too prone to numerical instability in this model framework.

## 7 Experiments

The model described here, fit through the Alternating Proximal Gradients procedure, was compared against its Bayesian counterpart, Hierarchical Poisson Factorization (HPF, [6]) fit through coordinate ascent under mean-field approximation, and against implicit-ALS using the Conjugate Gradient method ([17]), by fitting models with the same number of factors to the RetailRocket^{2}^{2}
2
https://www.kaggle.com/retailrocket/ecommerce-dataset, Last.Fm ([4]) and MillionSong TasteProfile ([12]) datasets, comparing implicit-feedback evaluation metrics along with time spent (in seconds).

The RetailRocket dataset contains events ”click”, ”add to basket”, ”purchase” in an e-commerce shop, from different visitors. In order to set a count for each user-item pair, the events were given the following values: click = +1, add to basket = +3, purchase = +3, with the final value for a given user-item pair being the sum of the event values for it. The Last.Fm and MillionSong datasets contain counts of the number of times a song was played by a user in online music listening services.

The models were compared by leaving 20% of the observations at random as test set, then discarding users in the test set who were not in the training set or who had less than 3 entries in the test set. Recommendations were evaluated for each user individually by ranking the items that were not in the training set according to the model, and taking [email protected] (precision at 5) and area under the ROC curve for these predictions on a random sample of $25,000$ users, with the items in the test set as positive class. The numbers were then averaged across all these $25,000$ users. Additionally, Pearson correlation ($\rho $) was calculated between the model outputs and the values in the test set, but on the whole and not by user. For Poisson Factorization (PF) and Hierarchical Poisson Factorization (HPF), log-likelihood plus constant was also evaluated on the test set, defined as $LogLik={\sum}_{{x}_{ui}\in \mathrm{\mathbf{t}\mathbf{e}\mathbf{s}\mathbf{t}}}-{\mathbf{a}}_{u}^{T}{\mathbf{b}}_{i}+{x}_{ui}\mathrm{log}({\mathbf{a}}_{u}^{T}{\mathbf{b}}_{i})$.

The experiments were run in a Google Cloud server with Skylake CPU, using 16 cores. The times tracked include only time spent in the respective optimization routines and does not account for initialization, allocation of intermediate matrices, evaluation of termination criteria, or others.

All hyper-parameters were set to the defaults recommended in their respective implementations - for HPF, these were $a,{a}^{\prime},c,{c}^{\prime}=0.3,{b}^{\prime},{d}^{\prime}=1$, and for implicit-ALS, 15 iterations and regularization parameter ${10}^{-2}$. HPF was run until the percent increase in training likelihood was below a certain threshold instead of running for a fixed number of iterations. For PF (this work), the hyper-parameters were set to $T=10,\tau =1,\alpha ={10}^{-7},\lambda ={10}^{9}$, with only the number of factors $k$ varying - the values tried for each model were 40, 70, 100.

## 8 Conclusions and discussion

This work presented an optimization-based approach towards Poisson matrix factorization which is especially well suited to very sparse inputs. While the model is in principle similar to its Bayesian counterpart HPF, the procedure for fitting it to observed data is very different and is based on proximal gradients rather than on variational inference or MCMC.

The alternating minimization approach presented here has faster iterations than HPF (Hierarchical Poisson Factorization) or implicit-ALS, and requires fewer iterations to reach a local optimum, turning out to be 2-3 orders of magnitude faster than HPF with variational inference, and 1 order of magnite faster than implicit-ALS with the CG method for the dataset sizes evaluated, but is more prone to numerical instability issues.

Ranking metrics were evaluated on 3 implicit-feedback datasets for collaborative filtering. In 2 of these datasets, it managed to achieve a better $\rho $ than the other algorithms, but did not achieve as good [email protected] HPF also seemed to result in far worse [email protected] than implicit-ALS.

Although in some cases it did not fare as well as implicit-ALS under ranking metrics, these metrics are still reasonably good and much better than a random choice (e.g. AUC $>0.85$ in all cases), showing a lot of promise for applications of PF to very large-scale datasets.

While the model presented here was meant to be fit to user-item-count triplets only, it should be easy to expand the model to incorporate other sparse count data about users and items (such as text descriptions as bag-of-words) in the same form as in [16], which is a task for future work.

## References

- [1] Mehmet E Basbug and Barbara E Engelhardt. Coupled compound poisson factorization. arXiv preprint arXiv:1701.02058, 2017.
- [2] Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.
- [3] John Canny. Gap: a factor model for discrete data. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 122–129. ACM, 2004.
- [4] O. Celma. Music Recommendation and Discovery in the Long Tail. Springer, 2010.
- [5] Laurent Charlin, Rajesh Ranganath, James McInerney, and David M Blei. Dynamic poisson factorization. In Proceedings of the 9th ACM Conference on Recommender Systems, pages 155–162. ACM, 2015.
- [6] Prem Gopalan, Jake M Hofman, and David M Blei. Scalable recommendation with hierarchical poisson factorization. In UAI, pages 326–335, 2015.
- [7] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):19, 2016.
- [8] Niao He, Zaid Harchaoui, Yichen Wang, and Le Song. Fast and simple optimization for poisson likelihood models. arXiv preprint arXiv:1608.01264, 2016.
- [9] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, pages 263–272. Ieee, 2008.
- [10] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, (8):30–37, 2009.
- [11] Can Li. A conjugate gradient type method for the nonnegative constraints optimization problems. Journal of Applied Mathematics, 2013, 2013.
- [12] Brian McFee, Thierry Bertin-Mahieux, Daniel PW Ellis, and Gert RG Lanckriet. The million song dataset challenge. In Proceedings of the 21st International Conference on World Wide Web, pages 909–916. ACM, 2012.
- [13] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and Trends® in Optimization, 1(3):127–239, 2014.
- [14] Nicholas G Polson, James G Scott, Brandon T Willard, et al. Proximal algorithms in statistics and machine learning. Statistical Science, 30(4):559–581, 2015.
- [15] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pages 452–461. AUAI Press, 2009.
- [16] Ajit P Singh and Geoffrey J Gordon. Relational learning via collective matrix factorization. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 650–658. ACM, 2008.
- [17] Gábor Takács, István Pilászy, and Domonkos Tikk. Applications of the conjugate gradient method for implicit feedback collaborative filtering. In Proceedings of the fifth ACM conference on Recommender systems, pages 297–300. ACM, 2011.
- [18] Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan. Large-scale parallel collaborative filtering for the netflix prize. In International Conference on Algorithmic Applications in Management, pages 337–348. Springer, 2008.