Yi-Hsuan Yang's Blog: December 2011

Saturday, December 10, 2011

Learning content similarity for music recommendation

Learning content similarity for music recommendation, TASLP 2011
by Brian McFee, Luke Barrington, and Gert Lanckriet

Metric learning to rank (MLR) = tr (W) + C/n*\sum (err_q), s.t. \deta<W,\pi> >= \deta(y) - err_q; solve the above obj by cutting-plane opt. (i.e., structured SVM)
Use top-\tau codeword histogram over mfcc as feature

Motivation of using top-\tau (something like soft assignment): counteract quantization errors
Experiment result shows this reduce the number of codewords needed

Represent each histogram in a probability product kernel (PPK) space to better exploit the geometry of codeword histograms.

Leads to better accuracy

Visualize result using t-SNE (http://homepage.tudelft.nl/19j49/t-SNE.html)

Approximate nearest-subspace representations for sound mixtures

Approximate nearest-subspace representations for sound mixtures, ICASSP 2011
by Paris Smaragdis

Unlike the single-source case where we search for a nearest-point, we now search for two points, one from each source dictionary, that form a subspace that passes the closest to our input.
In order to address this problem in an efficient manner we recast this search as a sparse coding problem
In order to consider mixtures we willmake the assumption that when we have sounds that mix, their magnitude spectra superimpose linearly. Although this is not an exact consequence, it is an assumption that has been used frequently by the source separation community and is generally accepted as being approximately true.
The use of Euclidean distance inside the spectral composition simplex implies that we are making a Gaussian distribution assumption for the spectral composition frames.
A more appropriate distribution in this space is the Dirichlet distribution [2], which is explicitly defined on a simplex and is used to describe compositional data like the ones we have.
If we examine the log likelihood of this model we can see that it resolves to the following form that is the formula for cross-entropy, which from information theory we know to be an appropriate measure to compare two probability vectors.
It is in principle invariant to the number of sources since any mixture problem can be seen as a binary segmentation between a target and an interference (the only complication of having many sources being the increased probability of the target and the interference overlapping in the frequency composition simplex). Other factors such as reverberation and propagation effects are also not an issue as long as they don’t color the sources enough to significantly change their spectral composition (not an observed problem in general).

Friday, December 9, 2011

Contextual tag inference

Contextual tag inference, tomccap 2011
by M. I. Mandel et al.

We show that users agree more on tags applied to clips temporally “closer” to one another; that conditional restricted Boltzmann machine models of tags can more accurately predict related tags when they take context into account
and that when training data is “smoothed” using context, support vector machines can better rank these clips according to the original, unsmoothed tags and do this more accurately than three standard multi-label classifiers
This article discusses and tests two different kinds of tag language models, one based on an information-theoretic formulation of this inference [Schifanella et al. 2010], and the second based on restricted Boltzmann machines (RBMs) [Mandel et al. 2010; 2011].
Assuming that tags applied to an artist apply equally well to all of the clips of music that the artist has released (as is done commonly [Bertin-Mahieux et al. 2008]) implies that up to 50% noise is being introduced in those tags
A visual scene might be analogous to a musical genre, as the priors over instruments, moods, etc. found in a song should depend on the genre of the song.
Spatial context in images could correspond to temporal context in music