Supervised group NMF

In [1], we adapted a supervised matrix factorization model known as Task-driven Dictionary Learning (TDL) [3]} to suit the ASC task. We introduced a variant of the TDL model in its nonnegative formulation [4], including a modification of the original algorithm, where a nonnegative dictionary is jointly learned with a multi-class classifier. In [2] we proposed a new formulation of the Group nonnegative matrix factorisation (GNMF) method. Using the Euclidean distance as the divergence for the GNMF problem, the dictionary learning based on GNMF is integrated in a supervised framework inspired by TDL [3].

GNMF with speaker and session similarity

Details and source code for the GNMF [3] .

Task-driven NMF based dictionary learning

TDL [4] has recently been applied with nonnegativity constraints to perform speech enhancement[5]_ or to acoustic scene classification, where temporally integrated projections are classified with multinomial logistic regression [1]. In [2] we extended the latter approach to the GNMF case.

Task-driven NMF

The general idea of nonnegative TDL or task-driven NMF (TNMF) is to unite the dictionary learning with NMF and the training of the classifier in a joint optimization problem [1], [5]. Influenced by the classifier, the basis vectors are encouraged to explain the discriminative information in the data while keeping a low reconstruction cost. The TNMF model first considers the optimal projections \(\textbf{h}^{\star}(\textbf{v},\textbf{W})\) of the data points \(\textbf{v}\) on the dictionary textbf{W}, which are defined as solutions of the nonnegative elastic-net problem [6], expressed as:

(1)\[\textbf{h}^{\star}(\textbf{v},\textbf{W}) = \min_{\textbf{h} \in \mathbb{R}_+^{K}}\ \frac{1}{2}\|\textbf{v}-\textbf{W}\textbf{h}\|_{2}^{2}+\lambda_{1}\|\textbf{h}\|_{1}+\frac{\lambda_{2}}{2}\|\textbf{h}\|_{2}^{2};\]

where \(\lambda_{1}\) and \(\lambda_{2}\) are nonnegative regularization parameters. Given each data segment \(\textbf{V}^{(l)}\) of length \(M\) frames, associated with a label \(y\) in a fixed set of labels \(\mathcal{Y}\), we want to classify the mean of the projections of the data points \(\textbf{v}^{(l)}\) belonging to the segment \(l\), such that \(\textbf{V}^{(l)}=[\textbf{v}_{0}^{(l)},...,\textbf{v}_{M-1}^{(l)}]\). We define \(\hat{\textbf{h}}^{(l)}\) as the averaged projection of \(\textbf{V}^{(l)}\) on the dictionary, where \(\hat{\textbf{h}}^{(l)}=\frac{1}{M}\sum_{m=0}^{M-1}\textbf{h}^{\star}(\textbf{v}_{m}^{(l)},\textbf{W})\). The corresponding classification loss (here using multinomial logistic regression) is defined as \(l_{s}(y,\textbf{A},\hat{\textbf{h}}^{(l)})\), where \(\textbf{A}\in \mathcal{A}\) are the parameters of the classifier. The TNMF problem is then expressed as a joint minimization of the expected classification loss over \(\textbf{W}\) and \(\textbf{A}\):

\[\min_{\textbf{W} \in \mathcal{W}, \textbf{A} \in \mathcal{A}} f(\textbf{W},\textbf{A})+\frac{\nu}{2}\|\textbf{A}\|_{2}^{2},\]

with

(2)\[f(\textbf{W},\textbf{A}) = \mathbb{E}_{y,\textbf{V}^{(l)}}\ [l_{s}(y,\textbf{A},\hat{\textbf{h}}^{(l)}(\textbf{v}^{(l)},\textbf{W}))].\]

Here, \(\mathcal{W}\) is defined as the set of nonnegative dictionaries containing unit \(l_{2}\)-norm basis vectors and \(\nu\) is a regularization parameter on the classifier parameters, meant to prevent over-fitting. The problem in equation (2) is optimized with mini-batch stochastic gradient descent as described in the paper of Bisot et al. [1].

Task-driven GNMF

In task-driven GNMF (TGNMF) we propose to perform jointly the dictionary learning based on GNMF [2] and the training of a multinomial logistic regression. The dictionary \(\textbf{W}\) is then the concatenation of all the sub-dictionaries \(\textbf{W}^{(cs)}\) and the optimal projections \(\textbf{h}^{\star}(\textbf{v},\textbf{W})\) are the solutions of (1).

Including the similarity constraints , the TGNMF is thus expressed as the minimization of the following problem:

\[\min_{\textbf{W}\in \mathcal{W}, \textbf{A} \in \mathcal{A}} f(\textbf{W},\textbf{A})+\frac{\mu}{2}\|\textbf{A}\|_{2}^{2} + \nu_1 J_{\mathrm{SPK}} + \nu_2 J_{\mathrm{SES}},\]

with \(f(\textbf{W},\textbf{A})\) as defined above. The problem is again optimized with mini-batch stochastic gradient descent. However, as opposed to the previous algorithm, for each data point \(\textbf{v}\) belonging to a particular \(\textbf{V}^{(cs)}\), only the corresponding sub-dictionaries (\(\textbf{W}^{(cs)}\)) are updated, whereas the other dictionaries are left unchanged in order to match the GNMF adaptation scheme [2].

Download

Source code available at https://github.com/rserizel/TGNMF

Getting Started

A short example is available as at https://github.com/rserizel/TGNMF/blob/master/TGNMF_howto.ipynb

Citation

If you are using this source code please consider citing one of the following papers:

References

[1]
  1. Bisot, R. Serizel, S. Essid, and G. Richard. “Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification”. Accepted for publication in IEEE Transactions on Audio, Speech and Language Processing, 2017
[2]
  1. Serizel, V.Bisot, S. Essid, and G. Richard. “Supervised group nonnegative matrix factorisation with similarity constraints and applications to speaker identification”. In Proc. of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
@article{bisot2017TASLP,
title={Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification},
author={Serizel, Romain and Bisot, Victor and Essid, Slim and Richard, Ga{\"e}l},
journal={IEEE Transactions on Audio, Speech and Language Processing},
pages={14},
year={2017},
organization={IEEE}
}
@inproceedings{serizel2017ICASSP,
title={Supervised group nonnegative matrix factorisation with similarity constraints and applications to speaker identification},
author={Serizel, Romain and Bisot, Victor and Essid, Slim and Richard, Ga{\"e}l},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={5},
year={2017},
organization={IEEE}
}Download

References

[3]
  1. Serizel, S. Essid, and G. Richard. “Group nonnegative matrix factorisation with speaker and session variability compensation for speaker identification”. In Proc. of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5470-5474, 2016.
[4]
  1. Mairal, F. Bach and J. Ponce. “Task-driven dictionary learning”. In IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 34(4), pp. 791–804, 2012.
[5]
  1. Sprechmann, A. M. Bronstein and G. Sapiro. “Supervised non-euclidean sparse NMF via bilevel optimization with applications to speech enhancement”. In Proc. Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), pp. 11-15, 2014.
[6]
  1. Zou and T. Hastie. “Regularization and variable selection via the elastic net”. In Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol 67(2), pp. 301–320, 2015.