NMF with \(\beta\) -divergence and multiplicative updates

Consider the (nonnegative) matrix \(\textbf{V}\in\mathbb{R}_+^{F\times N}\). The goal of NMF [1] is to find a factorisation for \(\textbf{V}\) of the form:

\[\textbf{V} \approx \textbf{W}\textbf{H}\label{nmf}\]

where \(\textbf{W}\in\mathbb{R}_+^{F\times K}\) , \(\textbf{H}\in\mathbb{R}_+^{K\times N}\) and \(K\) is the number of components in the decomposition. The NMF model estimation is usually considered as solving the following optimisation problem:

\[\min_{\textbf{W}\textbf{H}}D(\textbf{V} | \textbf{W}\textbf{H})\quad\mathrm{s.t.}\quad\textbf{W}\geq 0,\ \textbf{H}\geq 0\mathrm{.}\]

Where \(D\) is a separable divergence such as:

\[D(\textbf{V}|\textbf{W}\textbf{H}) = \sum\limits_{f=1}^{F}\sum\limits_{n=1}^{N}d([\textbf{V}]_{fn}|[\textbf{WH}]_{fn})\textrm{,}\]

with \([.]_{fn}\) is the element on column \(n\) and line \(f\) of a matrix and \(d\) is a scalar cost function. A common choice for the cost function is the \(\beta\) -divergence [2] [3] [4] defined as follows:

\[\begin{split}d_{\beta}(x|y)\triangleq \begin{cases} \frac{1}{\beta(\beta -1)} (x^{\beta} + (\beta -1)y^{\beta} - \beta xy^{(\beta-1)})&\beta\in\mathbb{R}\backslash\{0,1\}\nonumber\\ x\log\frac{x}{y} - x + y&\beta=1\nonumber\\ \frac{x}{y} -\log\frac{x}{y} - 1&\beta=0\nonumber\textrm{.} \end{cases}\end{split}\]

Popular cost functions such as the Euclidean distance, the generalised KL divergence [5] and the IS divergence [6] and all particular cases of the $beta$-divergence (obtained for \(\beta=2\) , \(1\) and \(0\), respectively). The use of the \(\beta\) -divergence for NMF has been studied extensively in Févotte et al. [7]. In most cases NMF problem is solved using a two-block coordinate descent approach. Each of the factors \(\textbf{W}\) and \(\textbf{H}\) is optimised alternatively. The sub-problem in one factor can then be considered as a nonnegative least square problem (NNLS) [8]. The approach implemented here to solve these NNLS problems relies on the multiplicative update rules introduced in Lee et al. [1] for the Euclidean distance and later generalised to the the \(\beta\) -divergence [7] :

\[\begin{split}\begin{align} \textbf{H}&\gets \textbf{H}\odot\frac{\textbf{W}^T\left[(\textbf{W}\textbf{H})^{\beta-2}\odot\textbf{V}\right]} {\textbf{W}^T(\textbf{W}\textbf{H})^{\beta-1}}\label{upH}\\ \textbf{W}&\gets \textbf{W}\odot\frac{\left[(\textbf{W}\textbf{H})^{\beta-2}\odot\textbf{V}\right]\textbf{H}^T } {(\textbf{W}\textbf{H})^{\beta-1}\textbf{H}^T}\mathrm{;}\label{upW} \end{align}\end{split}\]

where \(\odot\) is the element-wise product (Hadamard product) and division and power are element-wise. The matrices \(\textbf{H}\) and \(\textbf{W}\) are then updated according to an alternating update scheme.

Download

Source code available at https://github.com/rserizel/beta_nmf

Getting Started

A short example is available as at https://github.com/rserizel/beta_nmf/blob/master/BetaNMF_howto.ipynb

Citation

If you are using this source code please consider citing the following paper:

Reference

  1. Serizel, S. Essid, and G. Richard. “Mini-batch stochastic approaches for accelerated multiplicative updates in nonnegative matrix factorisation with beta-divergence”. Accepted for publication In Proc. of MLSP, p. 5, 2016.

Bibtex

@inproceedings{serizel2016batch,

title={Mini-batch stochastic approaches for accelerated multiplicative updates in nonnegative matrix factorisation with beta-divergence},

author={Serizel, Romain and Essid, Slim and Richard, Ga{“e}l},

booktitle={IEEE International Workshop on Machine Learning for Signal Processing (MLSP)},

pages={5},

year={2016},

organization={IEEE} }

References

[1](1, 2)
    1. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization.,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
[2]
  1. Basu, I. R. Harris, N. L. Hjort, and M. C. Jones, “Robust and efficient estimation by minimising a density power divergence,” Biometrika, vol. 85, no. 3, pp. 549–559, 1998.
[3]
  1. Cichocki and S. Amari, “Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities,” Entropy, vol. 12, no. 6, pp. 1532–1568, 2010.
[4]
  1. Eguchi and Y. Kano, “Robustifing maximum likelihood estimation,” Research Memo 802, Institute of Statistical Mathematics, June 2001.
[5]
  1. Kullback and R. A. Leibler, “On information and sufficiency,” The annals of mathematical statistics, pp. 79–86, 1951.
[6]
  1. Itakura, “Minimum prediction residual principle applied to speech recognition,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 23, no. 1, pp. 67–72, 1975.
[7](1, 2)
  1. Févotte and J. Idier, “Algorithms for nonnegative matrix factorization with the beta-divergence,” Neural Computation, vol. 23, no. 9, pp. 2421–2456, 2011.
[8]
  1. Gillis, “The why and how of nonnegative matrix factorization,” in Regularization, Optimization, Kernels, and Support Vector Machines, M. Signoretto J.A.K. Suykens and A. Argyriou, Eds., Machine Learning and Pattern Recognition Series, pp. 257 – 291. Chapman & Hall/CRC, 2014.