Week 12/12 #DataScienceBootcamp

Week 12 (14.12.-18.12.)

  • Topic: Final Project & Graduation
  • Project: Speech Emotion Recognition
  • Dataset: RAVDESS
  • Code: GitHub

I did it! I graduated from the Data Science Bootcamp! On Friday I presented my final project, which was about detecting emotions from speech with neural networks. It was one of the most challenging project I’ve worked on, because I had to learn how to process audio data and make live voice predictions, and prepare everything nicely – in 7 days! Here’s how it went:


First and foremost, I designed a project plan, after having a brief look at the dataset. From my work experience and the assignments completed in the past three months, I’ve learned this step is crucial for the successful development of a coding project. It helps me (and the team) organize my ideas, break down the big project into smaller tasks, identify issues, and track my progress – and not despair at the amount of work to be done in a short time. For this, I created a simple Kanban board directly in the GitHub repository of my project, so that I have the code and tasks in one place.


I used the RAVDESS dataset, which contains 1440 audio files. These are voice recordings of 24 actors (12 male, 12 female) who say two sentences in two different intensities (normal and strong) with eight intonations that express different emotions: calm, happy, sad, angry, fearful, surprised, disgusted, neutral. There are 192 recordings for each emotion, except for neutral, which doesn’t have recordings in strong intensity. For this reason, the dataset was imbalanced, so I used the RandomOversample method to create new features for the neutral class. This added 96 new datapoints, so in the end I had 1536 audio files to work with. There were also slightly more recordings by males and in normal intensity, but I didn’t deal with this imbalance because it wasn’t significant and it wouldn’t have affected my analysis, since I wanted to predict the emotion.


From each audio file, I extracted the Mel Frequency Cepstral Coefficient (MFCC), which are typically used in automatic speech and speaker recognition.


I trained three different neural networks models on the MFCC and emotion labels:

  • Multi-Layer Perceptron (MLP)
  • Convolutional Neural Network (CNN) with 2 Conv1D layers, 2 Dense layers, and 1 Dropout 0.1 layer.
  • Long Short-Term Memory (LSTM) with 2 LSTM layers and 1 Dense layer.

After several iterations of tweaking the hyperparameters, I’ve found that generally the models performed better with low learning rates (0.001), adam optimizer, and less layers. And this is the accuracy achieved by the models on the train and test set:

All models overfit, so they couldn’t generalize on unseen data, but this is a common issue in neural networks and on audio data. As expected, MLP had the lowest accuracy, since it’s the basic model (a simple feed-forward artificial neural network). CNN and LSTM has similar train accuracy (80%), but CNN performed better on test data (60%) than LSTM (51%). To give you some context, state-of-the-art models for speech classification have an accuracy of 70-80%, so I was quite happy with my CNN model accuracy.

The interesting part was to look at the actual vs. predicted emotions, to see what emotions were misclassified. From the correlations matrices of CNN and LSTM, I noticed that both models misclassified emotions that sound similar or are ambiguous (even for humans), like sad-calm or angry-happy.


The exciting part was to make predictions on new data. First, I tried the CNN and LSTM models on a couple of movies sound clips, and both models identified the plausible emotion. Next, I used the sounddevice library to record my voice and classify it in real-time, and again both models recognized my expressed emotion!


Cool as it may be, there are several limitations and aspects of this project that I’d like to improve:

  • Try other models (not necessarily neural networks).
  • Extract other audio features to see if they are better predictors than the MFCC.
  • Train on larger datasets, since 1500 files and only 200 samples per emotion is not enough.
  • Train on natural data, so on recordings of people speaking in unstaged situations, so that the emotional speech sounds more realistic.
  • Train on more diverse data, so on recordings of people of different cultures and languages. This is because the expression of emotions varies across cultures and is influenced also by individual experiences.
  • Combine speech with facial expressions and text (speech-to-text) for multimodal sentiment analysis.

Comments are closed.

Website Powered by WordPress.com.

Up ↑

%d bloggers like this: