Week 7 (09.11.-13.11.)
- Topic: Data Engineering Pipeline
- Lessons: APIs, Docker, MongoDB, ETL, Airflow, Logging, Sentiment Analysis, Slackbot
- Project: Build a pipeline that collects tweets, stores them in a Mongo database, applies sentiment analysis, loads the tweets into a Postgres database, then a Slackbot sends them in a slack channel.
- Dataset: Tweets collected with the Twitter API
- Code: GitHub
This week’s project was the most complex and difficult I’ve worked on. As you can tell from the description, it involved new tools and many steps. Here’s the pipeline of the whole process:
Creating a Docker-Compose pipeline
The image above represents the pipeline I had to build with Docker-Compose. Each rectangle represents a container, i.e. a part of the whole project. I had a pipeline with five containers and I’ll explain each of them below.
The first step was to collect tweets using the Twitter API and to process them with the tweepy library. It took me only a couple of minutes to register an app on Twitter and get my credentials (API key and access token). Then, I wrote a function in Python to stream live tweets with the hashtag #OnThisDay and collect the tweet text, the user’s handle, their location, and description.
Storing tweets in MongoDB
This step builds the second Docker container in my pipeline. The collected tweet information is then stored in MongoDB using the pymongo library. MongoDB is a non-relational database (no-SQL) that stores data in JSON-like documents. Since the tweet data is collected as key-value pairs, MongoDB is a good way to store this information.
ETL-ing tweets from MongoDB to Postgres
ETL stands for Extract, Transform, Load – basically copying data from one or more sources into a destination system which represents the data differently from the source. This process represents the third container in my pipeline. I extracted the tweet texts from MongoDB with pymongo, then evaluated the sentiment score of each tweet with VADER and TextBlob, and finally loaded this new data (tweet and sentiment score) into a Postgres database using SQLAlchemy.
Creating a Slackbot
This was the coronation of a week’s work: making a shareable product out of the data engineering process. I followed the helpful instructions in this article and, after some time of tweaking the connection to the Postgres tweets database, my bot sent a message in our #channel about a historical event that happened on that day.
But don’t get excited too fast! I still have some things to figure out, like how to make the bot include the sentiment score along with the tweet, how to schedule it to post in a given time interval, and maybe how to pick only the last or most positive tweet. I’ll try to solve these issues using Airflow and dive deeper into data engineering. Though it’s a complicated process, I realize I really enjoyed building data pipelines!
Friday Lightning Talk
This week I focused on the Transform part of the project, which was about Sentiment Analysis. When we learned about the VADER library, some colleagues asked how the training data is collected and who rates the training texts. To answer these questions, I decided to present one of my dearest personal projects in which I did exactly that: I created a list of emotion verbs and asked native speakers to rate their valence, arousal, and duration. You can read more about this project (and data analysis) on GitHub and in this blog post from a student conference.