Group-NMF with  :math:`\beta` -divergence
-----------------------------------------
In [1]_ we propose an approach to speaker identification that relies on group-nonnegative matrix factorisation (NMF) and that is inspired by the I-vector training procedure. Given data measured with several subjects, the key idea in group-NMF is to track inter-subject and intra-subject variations by constraining a set of common bases across subjects in the decomposition dictionaries. This has originally been applied to the analysis of electroencephalograms [2]_. The approach presented here extends this idea and proposes to capture inter-class and inter-session variabilities by constraining a set of class-dependent bases across sessions and a set of session-dependent bases across classes. This approach is inspired by the joint factor analysis applied to the speaker identification problem [3]_. In the following, first the general NMF framework with :math:`\beta` -divergence and multiplicative updates is describe, then we present the group-NMF approach with class and session similarity constraints.

NMF with :math:`\beta` -divergence and multiplicative updates
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Consider the (nonnegative) matrix :math:`\textbf{V}\in\mathbb{R}_+^{F\times N}`. The goal of NMF [4]_ is to find a factorisation for :math:`\textbf{V}` of the form:

.. math::
	\textbf{V} \approx \textbf{W}\textbf{H} 
	:label: nmf

where :math:`\textbf{W}\in\mathbb{R}_+^{F\times K}` , :math:`\textbf{H}\in\mathbb{R}_+^{K\times N}` and :math:`K` is the number of components in the decomposition. The NMF model estimation is usually considered as solving the following optimisation problem:

.. math::
	\min_{\textbf{W}\textbf{H}}D(\textbf{V} | \textbf{W}\textbf{H})\quad\mathrm{s.t.}\quad\textbf{W}\geq 0,\ \textbf{H}\geq 0\mathrm{.}

Where :math:`D` is a separable divergence such as:

.. math::
	D(\textbf{V}|\textbf{W}\textbf{H}) = \sum\limits_{f=1}^{F}\sum\limits_{n=1}^{N}d([\textbf{V}]_{fn}|[\textbf{WH}]_{fn})\textrm{,}

with :math:`[.]_{fn}` is the element on column :math:`n` and line :math:`f` of a matrix and :math:`d` is a scalar cost function. A common choice for the cost function is the :math:`\beta` -divergence [5]_  [6]_ [7]_ defined as follows:

.. math::
 	d_{\beta}(x|y)\triangleq 
  	\begin{cases}
 	\frac{1}{\beta(\beta -1)} (x^{\beta} + (\beta -1)y^{\beta} - \beta xy^{(\beta-1)})&\beta\in\mathbb{R}\backslash\{0,1\}\nonumber\\
 	x\log\frac{x}{y} - x + y&\beta=1\nonumber\\
 	\frac{x}{y} -\log\frac{x}{y} - 1&\beta=0\nonumber\textrm{.}
 	\end{cases}

Popular cost functions such as the Euclidean distance, the generalised KL divergence [8]_ and the IS divergence [9]_ and all particular cases of the :math:`\beta$-divergence` (obtained for :math:`\beta=2` , :math:`1` and :math:`0`, respectively). The use of the :math:`\beta` -divergence for NMF has been studied extensively in Févotte et al. [10]_.  In most cases NMF problem is solved using a two-block coordinate descent approach. Each of the factors :math:`\textbf{W}` and :math:`\textbf{H}` is optimised alternatively. The sub-problem in one factor can then be considered as a nonnegative least square problem (NNLS) [11]_. The approach implemented here to solve these NNLS problems relies on the multiplicative update rules introduced in Lee et al. [4]_ for the Euclidean distance and later generalised to the the :math:`\beta` -divergence [10]_ :

.. math::
	\textbf{H}\gets \textbf{H}\odot\frac{\textbf{W}^T\left[(\textbf{W}\textbf{H})^{\beta-2}\odot\textbf{V}\right]}
	{\textbf{W}^T(\textbf{W}\textbf{H})^{\beta-1}}
	:label: upH
.. math::
	\textbf{W}\gets \textbf{W}\odot\frac{\left[(\textbf{W}\textbf{H})^{\beta-2}\odot\textbf{V}\right]\textbf{H}^T }
	{(\textbf{W}\textbf{H})^{\beta-1}\textbf{H}^T}\mathrm{;}
	:label: upW


where :math:`\odot` is the element-wise product (Hadamard product) and division and power are element-wise. The matrices :math:`\textbf{H}` and :math:`\textbf{W}` are then updated according to an alternating update scheme.

Group NMF with class and session similarity constraints
+++++++++++++++++++++++++++++++++++++++++++++++++++++++

In the approach presented above, the matrix factorisation is totally unsupervised and does not account for class variability or session variability. The approach introduced in Serizel *et al.* [1]_ intends to take these variabilities into account. It derives from group-NMF [2]_ and is inspired by exemplar-based approaches for the speaker identification problem [12]_.

NMF on speaker utterances for speaker identification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In order to better model speaker identity, we now consider the portion of :math:`\textbf{V}` recorded in a session :math:`s` in which only the speaker :math:`c` is active. This is denoted by :math:`\textbf{V}^{(cs)}`, its length is :math:`N^{(cs)}` and it can be decomposed according to :eq:`nmf`:

.. math::
	\textbf{V}^{(cs)} \approx \textbf{W}^{(cs)} \textbf{H}^{(cs)}\quad \forall\ (c,s) \in\mathcal{C}\times\mathcal{S}_c\nonumber

under nonnegative constraints.

.. figure:: NMFutt.png
	:align: center
	:scale: 60 %

	**Fig. 1** NMF on speaker utterances

We define a global cost function which is the sum of all local divergences:

.. math::
	J_{\mathrm{global}} = \sum\limits_{c = 1}^{C}\sum\limits_{s\in\mathcal{S}_c} D_{KL}(\textbf{V}^{(cs)} | \textbf{W}^{(cs)} \textbf{H}^{(cs)})\mathrm{.}
	:label: beta_glob


Each :math:`\textbf{V}^{(cs)}` can be decomposed independently with standard multiplicative rules (:eq:`upH`, :eq:`upW`). The bases learnt on the training set are then concatenated to form a global basis. The latter basis is then used to produce features on test sets.

Class and session similarity constraints
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In order to take the session and speaker variabilities into account we propose to further decompose the dictionaries :math:`\textbf{W}` similarly as what was proposed by Lee et al. [4]_. The matrix :math:`\textbf{W}^{(cs)}` can indeed be arbitrarily decomposed as follows:

.. math::
	\textbf{W}^{(cs)} = \left[\right.\underset{\leftarrow K_{\mathrm{SPK}}\rightarrow} {\textbf{W}^{(cs)}_{\mathrm{SPK}}}|\underset{\leftarrow K_{\mathrm{SES}}\rightarrow}{\textbf{W}^{(cs)}_{\mathrm{SES}}}|\underset{\leftarrow K_{\mathrm{RES}}\rightarrow}{\textbf{W}^{(cs)}_{\mathrm{RES}}}\left.\right]

with :math:`K_{\mathrm{SPK}} + K_{\mathrm{SES}} + K_{\mathrm{RES}} = K` and where :math:`K_{\mathrm{SPK}}`, :math:`K_{\mathrm{SES}}` and :math:`K_{\mathrm{RES}}` are the number of components in the speaker-dependent bases, the session-dependent bases and the residual bases, respectively. 

.. figure:: clssesres.png
	:align: center
	:scale: 60 %

	**Fig. 2** Basis decomposition into :math:`\textbf{W}^{(cs)}_{\mathrm{SPK}}`, :math:`\textbf{W}^{(cs)}_{\mathrm{SES}}` and :math:`\textbf{W}^{(cs)}_{\mathrm{RES}}`

The first target is to capture speaker variability. This is related to finding vectors for the speaker bases (:math:`\textbf{W}^{(cs)}_{\mathrm{SPK}}`) for each speaker :math:`c` that are as close as possible across all the sessions in which the speaker is present, leading to the constraint:

.. math::
	J_{\mathrm{SPK}} = \frac{1}{2}\sum\limits_{c=1}^{C}\sum\limits_{s\in\mathcal{S}_c}\sum\limits_{\substack{s_1\in\mathcal{S}_c  \\s_1 \neq s}} \|  \textbf{W}^{(cs)}_{\mathrm{SPK}} -  \textbf{W}^{(cs_1)}_{\mathrm{SPK}}\|^2 < \alpha_1
	:label: cst_spk

with :math:`\|.\|^2` the Euclidean distance and :math:`\alpha_1` is the similarity constraint on speaker-dependent bases.

.. figure:: clscst.png
	:align: center
	:scale: 60 %

	**Fig. 3** Speaker similarity constraint

The second target is to capture session variability. This can be accounted for by finding vectors for the sessions bases (:math:`\textbf{W}^{(cs)}_{\mathrm{SES}}`) for each session :math:`s` that are as close as possible across all the speakers that speak in the session, leading to the constraint:

.. math::
	J_{\mathrm{SES}} = \frac{1}{2}\sum\limits_{s=1}^{S}\sum\limits_{c\in\mathcal{C}_s}\sum\limits_{\substack{c_1\in\mathcal{C}_s  \\c_1 \neq c}} \|  \textbf{W}^{(cs)}_{\mathrm{SES}} -  \textbf{W}^{(c_1s)}_{\mathrm{SES}}\|^2 < \alpha_2
	:label: cst_ses

where :math:`\alpha_2` is the similarity constraint on session-dependent bases.

.. figure:: sescst.png
	:align: center
	:scale: 60 %

	**Fig. 3** Session similarity constraint

The vectors composing the residual bases :math:`\textbf{W}^{(cs)}_{\mathrm{RES}}` are left unconstrained to represent characteristics that depend neither on the speaker nor on the session.

Minimizing the global divergence :eq:`beta_glob` subject to constraints :eq:`cst_spk` and :eq:`cst_ses` is equivalent to the following problem:

.. math::
	\min\limits_{\textbf{W},\textbf{H}} J_{\mathrm{global}} + \lambda_1 J_{\mathrm{SPK}} + \lambda_2 J_{\mathrm{SES}}\quad\mathrm{s.t.} \quad\textbf{W}\geq 0,\ \textbf{H}\geq 0
	:label: cost

which in turn leads to the multiplicative update rules for the dictionaries :math:`\textbf{W}^{(cs)}_{\mathrm{SPK}}` and :math:`\textbf{W}^{(cs)}_{\mathrm{SES}}`:

.. math::
	\textbf{W}_{\mathrm{SPK}}^{(cs)}&\leftarrow \textbf{W}_{\mathrm{SPK}}^{(cs)}\odot\frac{\left[(\textbf{W}^{(cs)}\textbf{H}^{(cs)})^{-1}\odot\textbf{V}^{(cs)}\right]{\textbf{H}_{\mathrm{SPK}}^{(cs)}}^T + \frac{\lambda_1}{2}\sum\limits_{\substack{s_1\in\mathcal{S}_c\\s_1\neq s}}\textbf{W}_{\mathrm{SPK}}^{(cs_1)}}
	{\textbf{1}{\textbf{H}_{\mathrm{SPK}}^{(cs)}}^T + \frac{\lambda_1}{2} \left(\mathrm{Card}(\mathcal{S}_c) - 1\right)\textbf{W}_{\mathrm{SPK}}^{(cs)}}\\
	\textbf{W}_{\mathrm{SES}}^{(cs)}&\leftarrow \textbf{W}_{\mathrm{SES}}^{(cs)}\odot\frac{\left[(\textbf{W}^{(cs)}\textbf{H}^{(cs)})^{-1}\odot\textbf{V}^{(cs)}\right]{\textbf{H}_{\mathrm{SES}}^{(cs)}}^T + \frac{\lambda_2}{2}\sum\limits_{\substack{c_1\in\mathcal{C}_s\\c_1\neq c}}\textbf{W}_{\mathrm{SES}}^{(c_1s)}}
	{\textbf{1}{\textbf{H}_{\mathrm{SES}}^{(cs)}}^T + \frac{\lambda_2}{2} \left(\mathrm{Card}(\mathcal{C}_s) - 1\right)\textbf{W}_{\mathrm{SES}}^{(cs)}}


We obtained these update rules using the well know heuristic which consists in expressing the gradient of the cost function :eq:`cost` as the difference between a positive contribution and a negative contribution. The multiplicative update then has the form of a quotient of the negative contribution by the positive contribution. The update rules for :math:`\textbf{W}^{(cs)}_{\mathrm{RES}}` are similar to the standard rules:

.. math::
	\textbf{W}_{\mathrm{RES}}^{(cs)}\leftarrow \textbf{W}_{\mathrm{RES}}^{(cs)}\odot\frac{\left[(\textbf{W}^{(cs)}\textbf{H}^{(cs)})^{-1}\odot\textbf{V}^{(cs)}\right]{\textbf{H}_{\mathrm{RES}}^{(cs)}}^T }{\textbf{1}{\textbf{H}_{\mathrm{RES}}^{(cs)}}^T}\mathrm{.}\nonumber

Note that the update rules for the activations (:math:`\textbf{H}^{(cs)}`) are left unchanged.


Download
++++++++
Source code available at https://github.com/rserizel/groupNMF

Citation
++++++++
If you are using this source code please consider citing the following paper: 

.. topic:: Reference

	R. Serizel, S. Essid, and G. Richard (2016, March). “Group nonnegative matrix factorisation with speaker and session variability compensation for speaker identification”. In *Proc. of ICASSP*, pp. 5470-5474, 2016.


.. topic:: Bibtex

	@inproceedings{serizel2016group,

  	title={Group nonnegative matrix factorisation with speaker and session variability compensation for speaker identification},

  	author={Serizel, Romain and Essid, Slim and Richard, Ga{\"e}l},

  	booktitle={2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},

  	pages={5470--5474},

  	year={2016},

  	organization={IEEE}
	}


References
++++++++++

.. [#] R. Serizel, S. Essid, and G. Richard, “Group nonnegative matrix factorisation with speaker and session variability compensation for speaker identification”. In *Proc. of ICASSP*, pp. 5470-5474, 2016.

.. [#] H. Lee and S. Choi, “Group nonnegative matrix factorization for EEG classification,” in *Proc. of AISTATS*, pp. 320–327, 2009.

.. [#] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” *IEEE Transactions on Audio, Speech and Language Processing*, vol. 15, pp. 1435–1447, 2007.

.. [#] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization.,” *Nature*, vol. 401, no. 6755, pp. 788–791, 1999.

.. [#] A. Basu, I. R. Harris, N. L. Hjort, and M. C. Jones, “Robust and efficient estimation by minimising a density power divergence,” *Biometrika*, vol. 85, no. 3, pp. 549–559, 1998.

.. [#] A. Cichocki and S. Amari, “Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities,” *Entropy*, vol. 12, no. 6, pp. 1532–1568, 2010.

.. [#] S. Eguchi and Y. Kano, “Robustifing maximum likelihood estimation,” Research Memo 802, Institute of Statistical Mathematics, June 2001.

.. [#] S. Kullback and R. A. Leibler, “On information and sufficiency,” *The annals of mathematical statistics*, pp. 79–86, 1951.

.. [#] F. Itakura, “Minimum prediction residual principle applied to speech recognition,” *IEEE Transactions on Acoustics, Speech and Signal Processing*, vol. 23, no. 1, pp. 67–72, 1975.

.. [#] C. Févotte and J. Idier, “Algorithms for nonnegative matrix factorization with the beta-divergence,” *Neural Computation*, vol. 23, no. 9, pp. 2421–2456, 2011.

.. [#] N. Gillis, “The why and how of nonnegative matrix factorization,” in *Regularization, Optimization, Kernels, and Support Vector Machines*, M. Signoretto J.A.K. Suykens and A. Argyriou, Eds., Machine Learning and Pattern Recognition Series, pp. 257 – 291. Chapman & Hall/CRC, 2014.

.. [#] A. Hurmalainen, R. Saeidi, and T. Virtanen, “Noise Robust Speaker Recognition with Convolutive Sparse Coding,” in *Proc. of Interspeech*, 2015.