09 October 2008

Pandora's DNA

Pandora is an application that sits above yet another ginormous database of music information. (Are you noticing a pattern here?) It uses this database to generate lists of songs to play as stations for users.

So, what about this database? It is from the Music Genome Project, which brought together geeky music experts to find a way to describe music in numbers, in a way that would let have tracks that sound similar to each other have close-together numerical descriptions. (My non-Amber mathematician hat keeps trying to get technical, but I am beating it down. So far.)

To do this, they first developed lists of genes, things or qualities that a track might have to some degree. For example, in the category of genes "Strings", they have the genes Background String Section, Bowed Strings, Melodic String Accompaniment, Melodic String Section, Solo Strings, String Ensemble, String Section, String Section Beds, Subtle Use of Strings, Use of a String Ensemble, and Use of Strings. Human expert listeners give each track a score for each gene that applies to its genre. Rock and pop have 150 genes, rap has 350, jazz has 400, etc. That list of values, the score for each gene, is the track's genotype. That goes into the ginormous database with the usual track information like artist and title.

To give you an idea, here are some of those genotypes presented in words, just hitting the highlights. These are taken from the Pandora "Why was this song selected?" feature.

  1. Features electric rock instrumentation, a subtle use of vocal harmony, repetitive melodic phrasing, extensive vamping, and major key tonality. (U2, "I Still Haven't Found What I'm Looking For")

  2. Features hard rock roots, groove based composition, minor key tonality, dirty electric guitar riffs. (Tool, "Sober")

  3. Features hard rock roots, folk influences, mild rhythmic syncopation, repetitive melodic phrasing, and extensive vamping. (Jerry Cantrell, "Anger Rising")

Within the gene system, the distance between two tracks can be calculated mathematically. If they set their system up right, that would correspond almost exactly to how similar they would sound to a listener. (Or to a group of listeners, averaged out.) You can also use math to describe different subgenres or do all sorts of other cool fun stuff, for values of "cool" and "fun" that only really hardcore geeks would understand.

This certainly has my geek propeller hat spinning. I keep wondering things like which gene set did they apply to Linkin Park and Body Count? Does their system let you measure the distance between a rap track and a rock track? Have they explored the data for clusters that cross standard genre labels? How do different recordings of the same song compare? I would love to get my hands on their data and start playing with it.

I can't, directly. But I can play with Pandora. And I have been. Reports on the cool fun will follow.

