Week 7 (09.11.-13.11.)
- Topic: Data Engineering Pipeline
- Lessons: APIs, Docker(-Compose), MongoDB, ETL, Airflow, Logging, Sentiment Analysis, Slackbot
- Project: Build a pipeline that collects tweets, stores them in a Mongo database, applies sentiment analysis, loads them into a Postgres database, then a Slackbot sends them in a slack channel.
- Dataset: Tweets collected with the Twitter API
- Code: GitHub
This week’s project was the most complex and difficult I’ve worked on! Here’s the pipeline of the whole process (of course I could’ve created a nicer image, but it’s more authentic to see my messy hand-written notes):
Creating a Docker-Compose pipeline
The image above represents the pipeline I had to build with Docker-Compose. Each rectangle represents a container, so I had a pipeline with five containers. Below I’ll explain what I put in each of them.
The first step was to collect tweets, using the Twitter API and the tweepy library for processing the data with Python. It took me only a couple of minutes to register an app on Twitter and get my credentials (API key and access token). Then in Python I wrote a function that streams live tweets that contain the hashtag #OnThisDay and collects the tweet text, the user’s handle, their location, and description.
Storing tweets in MongoDB
This step builds the second Docker container in my pipeline. The collected tweet information is then stored in a MongoDB, using the pymongo library. MongoDB is a non-relational database (no-SQL) that stores data in JSON-like documents. Since the tweet data is collected as key-value pairs, MongoDB is a good way to store this information.
ETL-ing tweets from MongoDB to Postgres
ETL stands for Extract, Transform, Load – basically copying data from one or more sources into a destination system which represents the data differently from the source. This process represents the third container in my pipeline. I extracted the tweet texts from MongoDB with pymongo, then evaluated the sentiment score of each tweet with VADER and TextBlob, then loaded this new data (tweet and sentiment score) into a Postgres database using SQLAlchemy.
Creating a Slackbot
This was the coronation of a week’s work: making a shareable product out of the data engineering process. I followed the helpful instructions in this article and, after some time of tweaking the connection to the Postgres tweets database, my bot sent a message in our #channel about a historical event that happened on that day.
But don’t get excited too fast! I still have some things to figure out, like how to make the bot include the sentiment score along with the tweet, how to schedule it to post in a given time interval, and maybe how to pick only the last or most positive tweet. I’ll try to solve these issues using Airflow and dive deeper into data engineering. Though it’s a complicated process, I realize I really enjoyed building data pipelines and taking care of data storage!
Friday Lightning Talk
This week I focused on the Transform part of the project, which was about Sentiment Analysis. When we learned about the VADER library, some colleagues asked how the training data is collected and who rates the training texts. So I decided to present one of my dearest personal projects, in which I did exactly that: I created a list of emotion verbs and asked native speakers to rate their valence, arousal, and duration. You can read more about this project (and data analysis) on GitHub and in this blog post from a student conference.