Saturday, December 10, 2011

Learning content similarity for music recommendation

Learning content similarity for music recommendation, TASLP 2011
by Brian McFee, Luke Barrington, and Gert Lanckriet

  • Metric learning to rank (MLR) = tr (W) + C/n*\sum (err_q), s.t. \deta<W,\pi>  >= \deta(y) - err_q;  solve the above obj by cutting-plane opt. (i.e., structured SVM)
  • Use top-\tau codeword histogram over mfcc as feature 
    • Motivation of using top-\tau (something like soft assignment): counteract quantization errors 
    • Experiment result shows this reduce the number of codewords needed
  • Represent each histogram in a probability product kernel (PPK) space to better exploit the geometry of codeword histograms.
    • Leads to better accuracy
  • Visualize result using t-SNE (http://homepage.tudelft.nl/19j49/t-SNE.html)

Approximate nearest-subspace representations for sound mixtures

Approximate nearest-subspace representations for sound mixtures, ICASSP 2011
by Paris Smaragdis
  • Unlike the single-source case where we search for a nearest-point, we now search for two points, one from each source dictionary, that form a subspace that passes the closest to our input. 
  • In order to address this problem in an efficient manner we recast this search as a sparse coding problem
  • In order to consider mixtures we willmake the assumption that when we have sounds that mix, their magnitude spectra superimpose linearly. Although this is not an exact consequence, it is an assumption that has been used frequently by the source separation community and is generally accepted as being approximately true.
  • The use of Euclidean distance inside the spectral composition simplex implies that we are making a Gaussian distribution assumption for the spectral composition frames.
  • A more appropriate distribution in this space is the Dirichlet distribution [2], which is explicitly defined on a simplex and is used to describe compositional data like the ones we have. 
  • If we examine the log likelihood of this model we can see that it resolves to the following form that is the formula for cross-entropy, which from information theory we know to be an appropriate measure to compare two probability vectors.
  • It is in principle invariant to the number of sources since any mixture problem can be seen as a binary segmentation between a target and an interference (the only complication of having many sources being the increased probability of the target and the interference overlapping in the frequency composition simplex). Other factors such as reverberation and propagation effects are also not an issue as long as they don’t color the sources enough to significantly change their spectral composition (not an observed problem in general).

Friday, December 9, 2011

Contextual tag inference

Contextual tag inference, tomccap 2011
by M. I. Mandel et al.
  • We show that users agree more on tags applied to clips temporally “closer” to one another; that conditional restricted Boltzmann machine models of tags can more accurately predict related tags when they take context into account
  • and that when training data is “smoothed” using context, support vector machines can better rank these clips according to the original, unsmoothed tags and do this more accurately than three standard multi-label classifiers
  • This article discusses and tests two different kinds of tag language models, one based on an information-theoretic formulation of this inference [Schifanella et al. 2010], and the second based on restricted Boltzmann machines (RBMs) [Mandel et al. 2010; 2011].
  • Assuming that tags applied to an artist apply equally well to all of the clips of music that the artist has released (as is done commonly [Bertin-Mahieux et al. 2008]) implies that up to 50% noise is being introduced in those tags
  • A visual scene might be analogous to a musical genre, as the priors over instruments, moods, etc. found in a song should depend on the genre of the song.
  • Spatial context in images could correspond to temporal context in music

Friday, November 25, 2011

A Connotative Space for Supporting Movie Affective Recommendation

A Connotative Space for Supporting Movie Affective Recommendation, Sergio Benini, Luca Canini, and Riccardo Leonardi, IEEE TMM 2011

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5962360&tag=1

"there are at least three possible levels of description for a given object, a video in our case: the denotative meaning (what is the described concept), the connotative one (by which terms the concept is described), and the affective response (how the concept is perceived by a person)."

"Connotation is essential in cinematography, as in any other art discipline. It is given by the set of conventions (such as editing, music, mise-en-scene elements, color, sound, lighting, etc.) ..."

"using connotation properties can be more reliable than exploiting emotional annotations by other users."

"A set of conventions, known as film grammar [22], governs the relationships between these elements and influences how the meanings conveyed by the director are inferred by the audience."

"borrowing the theoretical approach from art and design, ... the affectivemeaning of a movie varies along three axes which account for the natural (warm/cold), temporal (dynamic/slow), and energetic (energetic/minimal) dimension"

"For self-assessment, the emotation wheel is preferred to other models, such as PAD, since it is simpler for the users to provide a unique emotional label than to express their emotional state by a combination of values of pleasure, arousal, and dominance."

"Exploiting distances between emotions, for each scene, we then turn emotations into a 1-to-5 bipolar scale by unfolding the wheel only on the five most voted contiguous emotions, as shown in Fig. 8."

"Moreover, the choice of discarding, separately for each scene, the three least voted contiguous emotions is supported by Osgood,..."

Tuesday, November 15, 2011

Law of two-and-a-half

http://jhimusic.com/blog/?p=121



Emotion representation, analysis, and synthesis in continuous space: a survey

"Emotions are complex constructs with fuzzy boundaries and with substantial individual variations in expression and experience."

"To guarantee a more complete description of affective colouring, some resedarchers include expectation (the degree of anticipating or being taken unaware) as the fourth dimension, and intensity (how far a person is away from a state of pure, cool rationality) as the fifth dimension."

"Despite the existence of diverse affect models, search for optimal low-dimensional representation of affect, for analysis and synthesis, and for each modality or cue, remains open."

"While visual signals appear to be better for interpreting valence, audio signals seem to be better for interpreting arousal."

"There are also spin off companies emerging out of collaborative research at well-known universities (e.g., Affectiva established R. Picard and colleagues of MIT Media Lab)."

Monday, November 7, 2011

Music Discovery with Social Networks

by Cédric S. Mesnage et al, womrad, 2011

  • Study "social shuffle" (or flooding, diffusion) over Facebook by using Starnet Ap on FB
  • Definition of a successful music discovery: "it occurs when the user of the application likes a track that s/he has never heard before."
  • Conclusion: social recom > non-social recom > random recom
  • Prototype system: apps.facebook.com/music_valley

Friday, November 4, 2011

Exploring Automatic Music Annotation with Acoustically-Objective Tags

Exploring Automatic Music Annotation with Acoustically-Objective Tags
by Derek Tingle, Youngmoo E. Kim, and Douglas Turnbull, MIR 2010

http://cosmal.ucsd.edu/cal/projects/CAL10K/

  • consists of 10,870 songs annotated using a vocabulary of 475 acoustic tags and 153 genre tags from Pandora’s Music Genome Project
  • use Echo Nest API for feature extraction
  • train on cal10k and test on cal500
the 55 overlapping tags between the vocabularies of cal10k and cal500:

  1. acoustic
  2. acoustic guitar
  3. aggressive
  4. alternative
  5. ambient sounds
  6. bebop
  7. bluegrass
  8. blues
  9. breathy
  10. call and response
  11. catchy
  12. classic rock
  13. cool jazz
  14. country
  15. dance pop
  16. danceable
  17. distorted electric guitar
  18. drum set
  19. duet
  20. electric
  21. electric blues
  22. electronica
  23. emotional
  24. female lead vocals
  25. folk
  26. funk
  27. gospel
  28. gravelly
  29. hand drums
  30. harmonica
  31. heavy beat
  32. hip hop
  33. jazz
  34. light beat
  35. low pitched
  36. major
  37. male lead vocals
  38. mellow
  39. minor
  40. organ
  41. piano
  42. pop
  43. punk
  44. r&b
  45. rock
  46. saxophone
  47. slow
  48. soul
  49. string ensemble
  50. studio recording
  51. swing
  52. synthesized
  53. synthesizer
  54. trumpet
  55. vocal harmonies

Unifying Low-Level and High-Level Music Similarity Measures

This paper proposes three of distance measures based on the audio content:
  1. A low-level measure based on tempo-related description
  2. A high-level semantic measure based on the inference of different musical dimensions by support vector machines. These dimensions include genre, culture, moods, instruments, rhythm, and tempo annotations
  3. A hybrid measure which combines the above two
Evaluation:
  1. Objective evaluation: By using classification benchmark as ground truth: "For each collection, we considered songs from the same class to be similar and songs from different classes to be dissimilar, and assessed the relevance of the songs’ rankings returned by each approach."
  2. Subjective evaluation: The listener was presented with 5different playlists (one for each measure) generated from the same seed song. Independently for each playlist, we asked the listeners to provide1) a playlist similarity rating (six-point) and 2) a playlist inconsistency boolean answer (bipolar).

Tuesday, October 25, 2011

Shotgun: Parallel Lasso and Sparse Logistic Regression

http://select.cs.cmu.edu/code/index.html

Shot-gun outperforms other published solvers on a range of large problems, proving to be one of the most scalable algorithms for L1.

Joseph K. Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. "Parallel Coordinate Descent for L1-Regularized Loss Minimization." International Conference on Machine Learning (ICML 2011).

Thursday, October 20, 2011

Advanced chroma features

http://www.mpi-inf.mpg.de/resources/MIR/chromatoolbox/

Implementation of novel chroma features proposed in the following articles:

  • Audio matching via chroma-based statistical features.  ISMIR 2005
  • Making chroma features more robust to timbre changes.   ICASSP 2009
  • Towards timbre-invariant audio features for harmony-based music. TASLP 2010
An article described this toolbox is in the proceedings of ISMIR this year
  • Chroma Toolbox: MATLAB Implementations for Extracting Variants of Chroma-Based Audio Features.  ISMIR 2011
The author of this toolbox (Meinard Muller) will give a tutorial on "Audio Content-based Music Retrieval," with an MTG researcher, Joan Serra.

Monday, October 17, 2011

K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation, IEEE TSP 2006

The sparse representation problem can be viewed as a generalization of the VQ objective, in which we allow each input signal to be represented by a linear combination of codewords, which we now call dictionary elements. Therefore the coefficients vector is now allowed more than one nonzero entry, and these can have arbitrary values.

Music & Emotion in ISMIR 2011

http://ismir2011.ismir.net/program.html

9 out of 129 papers (7%) are related to affective analysis in music:

  • Modeling Dynamic Patterns for Emotional Content in Music
  • Identifying Emotion Segments in Music by Discovering Motifs in Physiological Data
  • Music Emotion Classification of Chinese Songs based on Lyrics Using TFIDF and Rhyme
  • Modeling Musical Emotion Dynamics with Conditional Random Fields
  • Mining the Correlation between Lyrical and Audio Features and the Emergence of Mood
  • Exploring The Relationship Between Mood and Creativity in Rock Lyrics
  • A Comparative Study of Collaborative vs. Traditional Musical Mood Annotation
  • Music Mood Classification of Television Theme Tunes
  • Musical Moods- A Mass Participation Experiment for Affective Classification of Music

Learned dictionaries for sparse image representation: Properties and results, SPIE 2011

  • Compare MOD, K-SVD, LS-DLA, ODL, RLS-DLA
  • Propose some dictionary properties: mutual coherence, distribution ratio, gap, sparse representation capabilities (SRC), dictionary distance
  • However, not able to find a clear correlation between any property and the performance of the dictionary in an image compression application.

Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space, IEEE TAC 2011

The work introduced here converges with this recent shift in affect recognition, from recognizing posed expressions in terms of discrete and basic emotion categories, to the recognition of spontaneous expressions in terms of dimensional and continuous descriptions.

Contributions:
  • Fuse facial expression, shoulder gesture and speech cues in analysis of human affect.
  • Propose an output-associative fusion framework that incorporates correlations and covariances between the emotion dimensions.
  • Demonstrate that capturing temporal correlations and remembering the temporally distant events (or storing them in memory) is of utmost importance for continuous affect prediction.
Challenges mentioned: reliability of ground truth, baseline problem, unbalanced data.

Interesting observation:
  • Arousal can be much better predicted than valence using audio cues. 
  • For valence dimension instead, visual cues (facial expressions and shoulder movements) appear to perform better.

Sunday, October 16, 2011

Online resources for sparse encoding of signals

A nice review:
"A Review of Fast 11-Minimization Algorithms for Robust Face Recognition," by Allen Yang, Arvind Ganesh, Zihan Zhou, Shankar Sastry, and Yi Ma.


SPAMS: a SPArse Modeling Software
http://www.di.ens.fr/willow/SPAMS/doc/html/doc_spams.html 

SparseLab 
http://sparselab.stanford.edu/

l1-ls
http://www.stanford.edu/~boyd/l1_ls/

l1-benchmark
http://www.eecs.berkeley.edu/~yang/software/l1benchmark/

First post

Want to have a place to put down some random thinking and obviously FB is not a good one for archive purpose. So I blog.