Data

Data preprocessing is the most time-consuming part of our project. Because of the license issue, it is quite hard to get a large amount of full lyrics data with genre as label. We utilized several sources to complete this task.

We first use Million Song Dataset(MSD) website to get the track information and the corresponding genre label. MSD is a freely available collections of data for a million popular music tracks. We obtained table “Genre” containing 133283 trackID and corresponding genre label from the website.

MSD also partners with Musixmatch website and provides song lyrics in bag-of-words format. However, since we use TF-IDF method in sklearn package to get lyrics features, the bag-of-words format data is not suitable for being as our input data. The input data should either be full lyrics text or text that contains every word in the lyrics, which means, if a certain word w appear n times in the lyrics, the text should contain n numbers of word w. If we use the bag-of-words format data, we need to expand and transform the data of each song into text data that contains words with exact count. This could be pretty time-consuming considering our training dataset is large.

Therefore, we decided to find an API that could allow us to obtain the song lyrics directly. MSD website provides a table called MXM that contains trackID, artist, title and corresponding MusixmatchID that allow us to get the lyrics and other information of the track via Musixmatch API. Through Musixmatch API, we could get track information such as language with MusixmatchID. Since we only want to consider English song, we filter out non-English song through Musixmatch API. As for the lyrics part, due to the license issue, the lyrics of each song it provides only contains 30% of the full lyrics. Therefore we turned to another API called “Lyricwikia” API to get song lyrics. Lyricwikia provides an API that allows us to obtain the full lyrics of the song by entering its artist and title.

In the end, we utilize two tables- Genre, MXM from MSD website, in combination with Musixmatch API and Lyricswikia API to get the data we want. Suppose we want to

develop a classifier that can classify the song into three genre: Country, Rap and Jazz. The detailed process is as follows (all the table operation was done by SQLite):

We finally got the table “Lyrics” that contain title, artist, genre and lyrics, and this is our desired dataset. Since there are lots of songs in the original dataset that are not English songs or have no Lyrics in Lyricwikia, the final song data of each genre is far less than the original dataset (<10%). We run 6000 dataset for each genre and got 400 songs in “Lyrics” table for each genre.