Week 4 (19.10.-23.10.)
- Topic: Text Classification
- Lessons: Web Scraping, Regular Expression, HTML Parsing, Language Models, Class Imbalance, Bag-of-Words, Naive Bayes, Python Functions, Command Line Interface
- Dataset: Self web-scraped song lyrics
- Project: Predict the artist from song lyrics.
- Code: GitHub
I was super excited about this week, because it was about language models and first steps into NLP, my favorite ML topic! The challenge was to create a Python program that scrapes lyrics from a website, preprocesses them, and predicts the artist from the text.
The first step in this project was to build a web-scraper in Python with BeautifulSoup. I found this to be the most difficult and frustrating part, from finding a good lyrics website that doesn’t contain tens of duplicate lyrics, to implementing BeautifulSoup, and transforming the code from a JupyterNotebook into a Python file.
Next, I had to clean and preprocess the scraped lyrics, in order to feed them to the classification model. I used TextHero and SpaCy to tokenize the text, make the words lowercase, remove punctuation, numbers, and stop words (i.e. filling words or words that don’t change the meaning of a sentence, like the, a, to). This step was quite easy, mainly because the lyrics were in English. But remember:
NLP != English NLP
Another important part of text processing was to check for class imbalance, i.e. if and artist had far more observations than another. Class imbalance can skew the prediction in the way that the model would predict the majority. There are four main ways to deal with class imbalance:
- over-sampling: collect more samples for the minority class.
- under-sampling: reduce the number of samples in the majority class.
- SMOTE: generate new samples by interpolation.
- weighting: assign higher weights to observation from the minority class during training.
In my case, I had more lyrics by Metallica than by Iron Maiden. I went with under-sampling and, after dropping the duplicates, I randomly removed 15 more songs by Metallica. This way, I ended up with 86 songs by each band.
Naive Bayes Classifier
Now that that the dataset was made of clean lyrics, it was ready to train a classification model. I used the Multinomial Naive Bayes Classifier (MNBC), a probabilistic model based on Bayes’ Theorem, which assumes strong independence between the features and uses a multinomial distribution for each of the features. MNBC is typically used in text classification for calculating the probability of a word occurring in a text. For this, the words are vectorized or transformed into numbers. My model achieved an accuracy of 96.7% on the train set and 68.4% on the test set. This means that the model overfit.
Text analysis of Metallica and Iron Maiden lyrics
Just for fun, I also used NLTK, SpaCy, and VADER to explore the lyrics. Specifically, I looks for similarities and differences between the two bands. It turned out that Iron Maiden have a richer vocabulary and use more long words than Metallica. The latter are slightly more negative, but overall the similarity score of the two bands was 0.99.
The most fun part was identifying named entities in text with SpaCy. Though SpaCy recognized dates and time adverbs pretty well, it mistagged some words in hilarious/creepy ways.
Friday Lightning Talk
At the end of this week, I presented my text analysis of lyrics, to give an overview of NLTK and SpaCy functions for linguistic analysis. As usual, I also got inspiration from my colleagues’ projects on how to improve the command line interface for my webscraper/artist predictor!