MusicBox: Mapping and visualizing music collections

What makes a feature?

I’m getting more into my coding, and now am trying to answer the question, “What is a feature?” Specifically, what are the features that I can glean from the audio to get a meaningful distinction between one song and the next, and what is the general description of this thing I call a “feature”.

My project is not intended to be focused on figuring out or implementing these features; I am focused on bringing them together into a navigable representation. But in the design of the Feature class I’m finding myself wondering:

Do features operate at the song level, or at the section level? I should have the ability to deal with either type, but then, I am sometimes mapping song sections, and sometimes whole songs. What do I show to the user in the interface if I’m only really mapping a section of a song?
Should I try to choose one section to be representative of the piece as a whole, and just do my analysis on that section?
What are the must-have features people have already written code for, that can be easily adapted and plugged in to my engine?
What kind of rhythm-based features can I pull out? (I mention this because I am sorely lacking in the rhythm arena.)

I will start with features like this (for each track):

number of sections
number of types of sections (counted by timbre type)
number of types of sections (counted by pitch pattern type)
mode of timbres
mean level of specific timbre coefficients (coefficients shown visually at the bottom of this page)
tempo
mean loudness (or max, maybe)
confidence level of autotag assignment with tag1, tag2, tag3, etc… (multiple features here)
frequency of appearance of tag assignment with tagA, tagB, tagC, etc… (multiple features here)
time signature
time signature stability
track duration

(Note that, right now, I am not talking about similarity measures for pairs of songs, but rather quantifiable measures for one song at a time. I’ll deal with similarity later.)

This entry was posted on Monday, February 4th, 2008 at 12:17 am and is filed under audio features, definitions, feedback. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

4 Comments »

Comment by Paul Lamere

Posted on February 4, 2008 at 12:38 pm

I’m sure you’ve taken a look at the new echonest analyzer api: http://analyze.echonest.com/AudioAnalysis.html

They have a pretty detailed set of features per song:

http://analyze.echonest.com/XMLDescription.html

This looks like it might be a good place for you to start.
Comment by Anita

Posted on February 5, 2008 at 3:06 pm

Yes, I’ve certainly looked at EN’s analyzer — that’s where most of the features in the list come from, actually. The major thing I’m lacking here is musical features, instead of just acoustical features. EN’s musical features are limited mostly to section identification… what about chords, melody, rhythm (hard to get from EN because of the sample window for the tatums), and form?
Comment by Tristan Jehan

Posted on February 8, 2008 at 2:28 pm

I believe this is a fundamental question! What you typically find in the literature is complex systems that deal with very low-level features (LLD for Low-Level Descriptor): the so-called bag-of-frame approach. There are dozens of those features (many are described in the MPEG7 standard`). People usually apply statistical models on these (GMM, ANN, SOM, HMM, etc) hoping for relevant discrimination to come out of them.

There are at least three main issues with that approach:

1) it’s unclear whether LLDs are at all relevant to music description and especially when combined statistically. I think it better describes sound at an instantaneous time than music.

2) it’s unclear whether applying statistical modeling tools on the whole song is actually helpful or if it adds even more noise to an already complex problem. A simple example for this is that listeners easily judge similarities with only a few seconds of audio.

3) this all business is generally about modeling overall “timbre” and that’s about it. There’s much less work done on rhythm or pitch structures.

But anyways, it’s also unclear what actually matters when it comes to music similarity, and I believe it depends a lot on people. An extreme case would be a friend of mine who cares almost exclusively on lyrics.

For reviews on these approaches, read François Pachet’s recent papers. He shows that all these attempts give very similar results. He also tried to generalize all this, and showed that although he was arguably getting slightly better results, there’s a glass-ceiling inherent to the concept.

I think we need more relevant features for this problem, and it’s probably not low level features, but rather musical features. A good example of that kind of features is “tempo”. Yet, it is not necessarily consistent throughout the song, hard to extract properly, and people even disagree on the right answer…
Comment by Anita

Posted on February 8, 2008 at 3:03 pm

Thank you, Tristan! This is very thoughtful, and I appreciate it. I will take a look at Francois Pachet’s papers again… I had looked at them last year but I think I should be able to make more sense of them now, specifically in relation to my thesis topic.

I definitely agree with the need for more high-level, musical features. Who are the key people doing work on this kind of thing? (i.e. deciding which musical features help distinguish pieces of music, and how to implement extraction of those features)

Anita Lillie's Masters Thesis at the MIT Media Lab

What makes a feature?

4 Comments »

Comment by Paul Lamere

Comment by Anita

Comment by Tristan Jehan

Comment by Anita

Leave a comment

Search

Categories

Archives