Project completed in week 12 (14.12.-18.12.20) of the Data Science Bootcamp at Spiced Academy in Berlin.

I did it! I graduated from the Data Science Bootcamp! On Friday I presented my final project, which was about detecting emotions from speech with neural networks. It was one of the most challenging project I’ve worked on, because I had to learn something new (how to process audio data and make live voice predictions) and prepare everything nicely in only 7 days. Here’s how it went…

Planning

First and foremost, I designed a project plan, after having a brief look at the dataset. From my work experience and the assignments completed in the past three months, I’ve learned that this step is crucial for the success of a coding project. Planning helps me (and the team) organize my ideas, break down the big project into smaller tasks, identify issues, and track the progress – and not despair at the amount of work to be done in a short time. For this purpose, I created a simple Kanban board directly in the GitHub repository of my project, so that I have the code and tasks in one place.

Data

I used the RAVDESS dataset, which contains 1440 audio files. These are voice recordings of 24 actors (12 male, 12 female) who say two sentences in two different intensities (normal and strong) with eight intonations that express different emotions: calm, happy, sad, angry, fearful, surprised, disgusted, and neutral. There are 192 recordings for each emotion, except for neutral, which doesn’t have recordings in strong intensity. For this reason, the dataset was imbalanced, so I used the RandomOversample method to create new features for the neutral class. Oversampling added 96 new datapoints, so in the end I had 1536 audio files to work with. There were also slightly more recordings by males and in normal intensity, but I didn’t deal with this imbalance because it wasn’t significant and it wouldn’t have affected my analysis, since I wanted to predict the emotion.

Features

There are many features that can be extracted from audio files. I decided to work with the Mel Frequency Cepstral Coefficient (MFCC), which is typically used in automatic speech and speaker recognition.

Models

I trained three different neural networks models on the MFCC and emotion labels:

  • Multi-Layer Perceptron (MLP) with default settings
  • Convolutional Neural Network (CNN) with 2 Conv1D layers, 2 Dense layers, and 1 Dropout 0.1 layer
  • Long Short-Term Memory (LSTM) with 2 LSTM layers and 1 Dense layer

After several iterations of tweaking the hyperparameters, I found that generally the models performed better with low learning rates (0.001), adam optimizer, and less layers. Here is the accuracy achieved by the models on the train and test set:

All models overfit, which means they couldn’t generalize on unseen data, but this is a common issue in neural networks and on audio data. As expected, MLP had the lowest accuracy, since it’s a very basic model (a simple feed-forward artificial neural network). CNN and LSTM had similar train accuracy (80%), but CNN performed better on test data (60%) than LSTM (51%). To give you some context, state-of-the-art models for speech classification have an accuracy of 70-80%, so I was quite happy with my CNN model accuracy.

It was particularly interesting to look at the actual vs. predicted emotions, to see what emotions were misclassified. From the correlations matrices of CNN and LSTM, I noticed that both models misclassified emotions that sound similar or are ambiguous (even for humans), like sad-calm or angry-happy.

Results

The exciting part was to make predictions on new data. First, I tried the CNN and LSTM models on a couple of movies sound clips and both models identified the plausible emotion. Next, I used the sounddevice library to record my voice and classify it in real-time, and again both models recognized my expressed emotion!

Next steps

Cool as it may be, there are several limitations and aspects of this project that I’d like to improve:

  • Try other models (not necessarily neural networks).
  • Extract other audio features to see if they are better predictors than the MFCC.
  • Train on larger datasets, since 1500 files and only 200 samples per emotion is not enough.
  • Train on natural data, i.e. on recordings of people speaking in unstaged situations, so that the emotional speech sounds more realistic.
  • Train on more diverse data, i.e. on recordings of people of different cultures and languages. This is important because the expression of emotions varies across cultures and is influenced also by individual experiences.
  • Combine speech with facial expressions and text (speech-to-text) for multimodal sentiment analysis.