Saturday, March 3, 2012

Context-Aware Recommendation Systems

Paradigms for using contextual information

  • Contextual prefiltering
  • Contextual postfiltering
  • Contextual modeling [karatzoglou2010]
Types of context [fling2009]
  • Physical context: time, position, activity, weather, light, temperature
  • Social context
  • Interaction media context: media and device
  • Modal context: mind, goal, mood, experience, cognitive capabilities. 

Saturday, December 10, 2011

Learning content similarity for music recommendation

Learning content similarity for music recommendation, TASLP 2011
by Brian McFee, Luke Barrington, and Gert Lanckriet

  • Metric learning to rank (MLR) = tr (W) + C/n*\sum (err_q), s.t. \deta<W,\pi>  >= \deta(y) - err_q;  solve the above obj by cutting-plane opt. (i.e., structured SVM)
  • Use top-\tau codeword histogram over mfcc as feature 
    • Motivation of using top-\tau (something like soft assignment): counteract quantization errors 
    • Experiment result shows this reduce the number of codewords needed
  • Represent each histogram in a probability product kernel (PPK) space to better exploit the geometry of codeword histograms.
    • Leads to better accuracy
  • Visualize result using t-SNE (http://homepage.tudelft.nl/19j49/t-SNE.html)

Approximate nearest-subspace representations for sound mixtures

Approximate nearest-subspace representations for sound mixtures, ICASSP 2011
by Paris Smaragdis
  • Unlike the single-source case where we search for a nearest-point, we now search for two points, one from each source dictionary, that form a subspace that passes the closest to our input. 
  • In order to address this problem in an efficient manner we recast this search as a sparse coding problem
  • In order to consider mixtures we willmake the assumption that when we have sounds that mix, their magnitude spectra superimpose linearly. Although this is not an exact consequence, it is an assumption that has been used frequently by the source separation community and is generally accepted as being approximately true.
  • The use of Euclidean distance inside the spectral composition simplex implies that we are making a Gaussian distribution assumption for the spectral composition frames.
  • A more appropriate distribution in this space is the Dirichlet distribution [2], which is explicitly defined on a simplex and is used to describe compositional data like the ones we have. 
  • If we examine the log likelihood of this model we can see that it resolves to the following form that is the formula for cross-entropy, which from information theory we know to be an appropriate measure to compare two probability vectors.
  • It is in principle invariant to the number of sources since any mixture problem can be seen as a binary segmentation between a target and an interference (the only complication of having many sources being the increased probability of the target and the interference overlapping in the frequency composition simplex). Other factors such as reverberation and propagation effects are also not an issue as long as they don’t color the sources enough to significantly change their spectral composition (not an observed problem in general).

Friday, December 9, 2011

Contextual tag inference

Contextual tag inference, tomccap 2011
by M. I. Mandel et al.
  • We show that users agree more on tags applied to clips temporally “closer” to one another; that conditional restricted Boltzmann machine models of tags can more accurately predict related tags when they take context into account
  • and that when training data is “smoothed” using context, support vector machines can better rank these clips according to the original, unsmoothed tags and do this more accurately than three standard multi-label classifiers
  • This article discusses and tests two different kinds of tag language models, one based on an information-theoretic formulation of this inference [Schifanella et al. 2010], and the second based on restricted Boltzmann machines (RBMs) [Mandel et al. 2010; 2011].
  • Assuming that tags applied to an artist apply equally well to all of the clips of music that the artist has released (as is done commonly [Bertin-Mahieux et al. 2008]) implies that up to 50% noise is being introduced in those tags
  • A visual scene might be analogous to a musical genre, as the priors over instruments, moods, etc. found in a song should depend on the genre of the song.
  • Spatial context in images could correspond to temporal context in music

Friday, November 25, 2011

A Connotative Space for Supporting Movie Affective Recommendation

A Connotative Space for Supporting Movie Affective Recommendation, Sergio Benini, Luca Canini, and Riccardo Leonardi, IEEE TMM 2011

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5962360&tag=1

"there are at least three possible levels of description for a given object, a video in our case: the denotative meaning (what is the described concept), the connotative one (by which terms the concept is described), and the affective response (how the concept is perceived by a person)."

"Connotation is essential in cinematography, as in any other art discipline. It is given by the set of conventions (such as editing, music, mise-en-scene elements, color, sound, lighting, etc.) ..."

"using connotation properties can be more reliable than exploiting emotional annotations by other users."

"A set of conventions, known as film grammar [22], governs the relationships between these elements and influences how the meanings conveyed by the director are inferred by the audience."

"borrowing the theoretical approach from art and design, ... the affectivemeaning of a movie varies along three axes which account for the natural (warm/cold), temporal (dynamic/slow), and energetic (energetic/minimal) dimension"

"For self-assessment, the emotation wheel is preferred to other models, such as PAD, since it is simpler for the users to provide a unique emotional label than to express their emotional state by a combination of values of pleasure, arousal, and dominance."

"Exploiting distances between emotions, for each scene, we then turn emotations into a 1-to-5 bipolar scale by unfolding the wheel only on the five most voted contiguous emotions, as shown in Fig. 8."

"Moreover, the choice of discarding, separately for each scene, the three least voted contiguous emotions is supported by Osgood,..."

Tuesday, November 15, 2011

Law of two-and-a-half

http://jhimusic.com/blog/?p=121



Emotion representation, analysis, and synthesis in continuous space: a survey

"Emotions are complex constructs with fuzzy boundaries and with substantial individual variations in expression and experience."

"To guarantee a more complete description of affective colouring, some resedarchers include expectation (the degree of anticipating or being taken unaware) as the fourth dimension, and intensity (how far a person is away from a state of pure, cool rationality) as the fifth dimension."

"Despite the existence of diverse affect models, search for optimal low-dimensional representation of affect, for analysis and synthesis, and for each modality or cue, remains open."

"While visual signals appear to be better for interpreting valence, audio signals seem to be better for interpreting arousal."

"There are also spin off companies emerging out of collaborative research at well-known universities (e.g., Affectiva established R. Picard and colleagues of MIT Media Lab)."

Monday, November 7, 2011

Music Discovery with Social Networks

by Cédric S. Mesnage et al, womrad, 2011

  • Study "social shuffle" (or flooding, diffusion) over Facebook by using Starnet Ap on FB
  • Definition of a successful music discovery: "it occurs when the user of the application likes a track that s/he has never heard before."
  • Conclusion: social recom > non-social recom > random recom
  • Prototype system: apps.facebook.com/music_valley

Friday, November 4, 2011

Exploring Automatic Music Annotation with Acoustically-Objective Tags

Exploring Automatic Music Annotation with Acoustically-Objective Tags
by Derek Tingle, Youngmoo E. Kim, and Douglas Turnbull, MIR 2010

http://cosmal.ucsd.edu/cal/projects/CAL10K/

  • consists of 10,870 songs annotated using a vocabulary of 475 acoustic tags and 153 genre tags from Pandora’s Music Genome Project
  • use Echo Nest API for feature extraction
  • train on cal10k and test on cal500
the 55 overlapping tags between the vocabularies of cal10k and cal500:

  1. acoustic
  2. acoustic guitar
  3. aggressive
  4. alternative
  5. ambient sounds
  6. bebop
  7. bluegrass
  8. blues
  9. breathy
  10. call and response
  11. catchy
  12. classic rock
  13. cool jazz
  14. country
  15. dance pop
  16. danceable
  17. distorted electric guitar
  18. drum set
  19. duet
  20. electric
  21. electric blues
  22. electronica
  23. emotional
  24. female lead vocals
  25. folk
  26. funk
  27. gospel
  28. gravelly
  29. hand drums
  30. harmonica
  31. heavy beat
  32. hip hop
  33. jazz
  34. light beat
  35. low pitched
  36. major
  37. male lead vocals
  38. mellow
  39. minor
  40. organ
  41. piano
  42. pop
  43. punk
  44. r&b
  45. rock
  46. saxophone
  47. slow
  48. soul
  49. string ensemble
  50. studio recording
  51. swing
  52. synthesized
  53. synthesizer
  54. trumpet
  55. vocal harmonies

Unifying Low-Level and High-Level Music Similarity Measures

This paper proposes three of distance measures based on the audio content:
  1. A low-level measure based on tempo-related description
  2. A high-level semantic measure based on the inference of different musical dimensions by support vector machines. These dimensions include genre, culture, moods, instruments, rhythm, and tempo annotations
  3. A hybrid measure which combines the above two
Evaluation:
  1. Objective evaluation: By using classification benchmark as ground truth: "For each collection, we considered songs from the same class to be similar and songs from different classes to be dissimilar, and assessed the relevance of the songs’ rankings returned by each approach."
  2. Subjective evaluation: The listener was presented with 5different playlists (one for each measure) generated from the same seed song. Independently for each playlist, we asked the listeners to provide1) a playlist similarity rating (six-point) and 2) a playlist inconsistency boolean answer (bipolar).