Mini-batch multiplicative updates for NMF with \(\beta\) -divergence

In [1] we propose mini-batch stochastic algorithms to perform NMF efficiently on large data matrices. Besides the stochastic aspect, the mini-batch approach allows exploiting intesive computing devices such as general purpose graphical processing units to decrease the processing time further and in some cases outperform coordinate descent approach.

NMF with \(\beta\) -divergence and multiplicative updates

Consider the (nonnegative) matrix \(\textbf{V}\in\mathbb{R}_+^{F\times N}\). The goal of NMF [2] is to find a factorisation for \(\textbf{V}\) of the form:

(1)\[\textbf{V} \approx \textbf{W}\textbf{H}\]

where \(\textbf{W}\in\mathbb{R}_+^{F\times K}\) , \(\textbf{H}\in\mathbb{R}_+^{K\times N}\) and \(K\) is the number of components in the decomposition. The NMF model estimation is usually considered as solving the following optimisation problem:

\[\min_{\textbf{W}\textbf{H}}D(\textbf{V} | \textbf{W}\textbf{H})\quad\mathrm{s.t.}\quad\textbf{W}\geq 0,\ \textbf{H}\geq 0\mathrm{.}\]

Where \(D\) is a separable divergence such as:

\[D(\textbf{V}|\textbf{W}\textbf{H}) = \sum\limits_{f=1}^{F}\sum\limits_{n=1}^{N}d([\textbf{V}]_{fn}|[\textbf{WH}]_{fn})\textrm{,}\]

with \([.]_{fn}\) is the element on column \(n\) and line \(f\) of a matrix and \(d\) is a scalar cost function. A common choice for the cost function is the \(\beta\) -divergence [3] [4] [5] defined as follows:

(2)\[\begin{split}d_{\beta}(x|y)\triangleq \begin{cases} \frac{1}{\beta(\beta -1)} (x^{\beta} + (\beta -1)y^{\beta} - \beta xy^{(\beta-1)})&\beta\in\mathbb{R}\backslash\{0,1\}\nonumber\\ x\log\frac{x}{y} - x + y&\beta=1\nonumber\\ \frac{x}{y} -\log\frac{x}{y} - 1&\beta=0\nonumber\textrm{.} \end{cases}\end{split}\]

Popular cost functions such as the Euclidean distance, the generalised KL divergence [6] and the IS divergence [7] and all particular cases of the \(\beta\)-divergence` (obtained for \(\beta=2\) , \(1\) and \(0\), respectively). The use of the \(\beta\) -divergence for NMF has been studied extensively in Févotte et al. [8]. In most cases NMF problem is solved using a two-block coordinate descent approach. Each of the factors \(\textbf{W}\) and \(\textbf{H}\) is optimised alternatively. The sub-problem in one factor can then be considered as a nonnegative least square problem (NNLS) [9]. The approach implemented here to solve these NNLS problems relies on the multiplicative update rules introduced in Lee et al. [2] for the Euclidean distance and later generalised to the the \(\beta\) -divergence [9] :

(3)\[\textbf{H}\gets \textbf{H}\odot\frac{\textbf{W}^T\left[(\textbf{W}\textbf{H})^{\beta-2}\odot\textbf{V}\right]} {\textbf{W}^T(\textbf{W}\textbf{H})^{\beta-1}}\]
(4)\[\textbf{W}\gets \textbf{W}\odot\frac{\left[(\textbf{W}\textbf{H})^{\beta-2}\odot\textbf{V}\right]\textbf{H}^T } {(\textbf{W}\textbf{H})^{\beta-1}\textbf{H}^T}\mathrm{;}\]

where \(\odot\) is the element-wise product (Hadamard product) and division and power are element-wise. The matrices \(\textbf{H}\) and \(\textbf{W}\) are then updated according to an alternating update scheme.

Mini-batch multiplicative updates for NMF

The standard update scheme requires the complete matrix \(\textbf{V}\). When considering time series with a large number of data points(or time series that are expending in time) running this algorithm can become prohibitive. Capitalising on the separability of the divergence (2) it is possible to perform NMF on mini-batch of data to reduce to computational burden of to allow for parallel computations [10].

When considering time series as defined above, each column (\(V_i\)) of the matrix \(\textbf{V}\) contain all the features for a specific time frame and each row represent a particular feature along time. The number of row is then a parameter of the low-level representation and only the number of columns can increase while increasing the amount of data. Therefore, in contrast to the approach proposed by Simsekli et al. [10] we decide to decompose the matrix \(\textbf{V}\) in \(B\) mini-batches of time frames that contains all the features for the given frame.

_images/batch.png

Fig. 1 Mini-batch decomposition of the data matrix \(\textbf{V}\)

For each batch \(b\), the update of the activations \(\textbf{H}_b\) corresponding to \(\textbf{V}_b\) can be obtained independently from all the other batches with the standard MU~eqref{upH}. The update of the bases \(\textbf{W}\) on the other hand require the whole matrix \(\textbf{V}\) to be processed. The positive contribution of the gradient (nabla^+textbf{W}) and the negative contribution of the gradient \(\nabla^-\textbf{W}\) are accumulated along the mini-batches. \(\textbf{W}\) is updated once per epoch with the standard MU rule (4) as described in Algorithm [1]. Note that this is theoretically similar to the standard full-gradient (FG) MU rules.

Stochastic mini-batch updates

When aiming at minimizing an Euclidean distance, it has been shown than drawing samples randomly can improve greatly the convergence speed in dictionary learning and by extension in NMF [11] [12]. We propose to apply a similar idea to MU in order to take advantage of the wide variety of divergences covered by the \(\beta\)-divergence. Instead of selecting the mini-batch sequentially on the original data \(\textbf{V}\) as above, we propose to draw mini-batches randomly on a shuffled version of \(\textbf{V}\). The mini-batch update of \(\textbf{H}\) still need one full pass through the data but a single mini-batch can be used to update \(\textbf{W}\), in an way analogous to stochastic gradient (SG) methods [13].

Two different strategies can be considered to updates \(\textbf{W}\). The first option is to updates \(\textbf{W}\) for each mini-batch, this approach is denoted asymmetric SG mini-batch MU (ASG-MU) as \(\textbf{H}\) and \(\textbf{W}\) are updated asymmetrically (the full \(\textbf{H}\) is updated once per epoch while \(\textbf{W}\) is updated for each mini-batch). The second option is to update \(\textbf{W}\) only once per epoch on a randomly selected mini-batch a described this approach will be referred to as greedy SG mini-batch MU (GSG-MU). Detailled description of the algorithm can be found in [1]

Stochastic average gradient mini-batch updates

Stochastic average gradient (SAG) [14] is a method recently introduced for optimizing cost functions that are a sum of convex functions (which is the case here). SAG provide an intermediate between FG-like methods and SG-like methods. SAG then allows to maintain convergence rate close to FG methods with a complexity comparable to SG methods.

We propose to apply SAG-like method to update the dictionaries \(\textbf{W}\) in a mini-batch based NMF. Note that, as the full pass through the data is needed to update \(\textbf{H}\) it would not make sense to apply SAG here. The key idea is that for each mini-batch \(b\) instead of using the gradient computed locally to update \(\textbf{W}\), we propose to use the mini-batch data to update the full gradient negative and positive contributions:

\[\begin{split}\begin{align} \nabla^-\textbf{W}&\gets\nabla^-\textbf{W} - \nabla^-_{\mathrm{old}}\textbf{W}_b + \nabla^-_{\mathrm{new}}\textbf{W}_b\\ \nabla^+\textbf{W}&\gets\nabla^+\textbf{W} - \nabla^+_{\mathrm{old}}\textbf{W}_b + \nabla^+_{\mathrm{new}}\textbf{W}_b \end{align}\end{split}\]

where \(\nabla^-_{\mathrm{new}}\textbf{W}_b\) and \(\nabla^+_{\mathrm{new}}\textbf{W}_b\) are the negative and positive contribution to the gradient of \(\textbf{W}\) calculated on the mini-batch \(b\), respectively. \(\nabla^-_{\mathrm{old}}\textbf{W}_b\) and \(\nabla^+_{\mathrm{old}}\textbf{W}_b\) are the previous negative and positive contribution to the gradient of \(\textbf{W}\) for the mini-batch \(b\), respectively.

Similarly as above, two different strategies can be considered to updates \(\textbf{W}\). The first option is to updates \(\textbf{W}\) for each mini-batch this approach is denoted asymmetric SAG mini-batch MU (ASAG-MU). The second option is to update \(\textbf{W}\) only once per epoch on a randomly selected mini-batch which is greedy SAG mini-batch MU (GSAG-MU). Detailled description of the algorithm can be found in [1]

Download

Source code available at https://github.com/rserizel/minibatchNMF

Citation

If you are using this source code please consider citing the following paper:

Reference

  1. Serizel, S. Essid, and G. Richard. “Mini-batch stochastic approaches for accelerated multiplicative updates in nonnegative matrix factorisation with beta-divergence”. Accepted for publication In Proc. of MLSP, p. 5, 2016.

Bibtex

@inproceedings{serizel2016batch,

title={Mini-batch stochastic approaches for accelerated multiplicative updates in nonnegative matrix factorisation with beta-divergence},

author={Serizel, Romain and Essid, Slim and Richard, Ga{“e}l},

booktitle={IEEE International Workshop on Machine Learning for Signal Processing (MLSP)},

pages={5},

year={2016},

organization={IEEE} }

References

[1]
  1. Serizel, S. Essid, and G. Richard. “Mini-batch stochastic approaches for accelerated multiplicative updates in nonnegative matrix factorisation with beta-divergence”. Accepted for publication In Proc. of MLSP, p. 5, 2016.
[2]
    1. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization.,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
[3]
  1. Basu, I. R. Harris, N. L. Hjort, and M. C. Jones, “Robust and efficient estimation by minimising a density power divergence,” Biometrika, vol. 85, no. 3, pp. 549–559, 1998.
[4]
  1. Cichocki and S. Amari, “Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities,” Entropy, vol. 12, no. 6, pp. 1532–1568, 2010.
[5]
  1. Eguchi and Y. Kano, “Robustifing maximum likelihood estimation,” Research Memo 802, Institute of Statistical Mathematics, June 2001.
[6]
  1. Kullback and R. A. Leibler, “On information and sufficiency,” The annals of mathematical statistics, pp. 79–86, 1951.
[7]
  1. Itakura, “Minimum prediction residual principle applied to speech recognition,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 23, no. 1, pp. 67–72, 1975.
[8]
  1. Févotte and J. Idier, “Algorithms for nonnegative matrix factorization with the beta-divergence,” Neural Computation, vol. 23, no. 9, pp. 2421–2456, 2011.
[9]
  1. Gillis, “The why and how of nonnegative matrix factorization,” in Regularization, Optimization, Kernels, and Support Vector Machines, M. Signoretto J.A.K. Suykens and A. Argyriou, Eds., Machine Learning and Pattern Recognition Series, pp. 257 – 291. Chapman & Hall/CRC, 2014.
[10]
  1. Simsekli, H. Koptagel, H. Guldas¸, A. Taylan Cemgil, F. Oztoprak, and S. Ilker Birbil, “Parallel Stochastic Gradient Markov Chain Monte Carlo for Matrix Factorisation Models,” ArXiv e-prints, June 2015.
[11]
  1. Mairal, J.and Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” The Journal of Machine Learning Research, vol. 11, pp. 19–60, 2010.
[12]
  1. Mensch, J. Mairal, B. Thirion, and G. Varoquaux, “Dictionary Learning for Massive Matrix Factorization,” in Proc. of ICML, 2016.
[13]
  1. Bottou, “Online learning and stochastic approximations,” On-line learning in neural networks, vol. 17, no. 9, pp. 142.
[14]
  1. Schmidt, N. Le Roux, and F. Bach, “Minimizing Finite Sums with the Stochastic Average Gradient,” ArXiv e-prints, Sept. 2013.