data science · films

Web Scraping IMDb with R

Web scraping is a method of automatically gathering data from websites in a structured manner and saving it into a local database or spreadsheet. Why would you do this? Maybe you want to compare the prices of similar products from different companies, or generate leads for your sales team, or find trend in a certain area, or, like me, create a data frame of popular movies and and do some data analysis for fun. There are many tools for web scraping: browser plug-ins (like Webscraper) and softwares (like Parsehub) are easy to use and don’t require coding, but if you need to go more in depth, use the Python libraries Beautiful Soup and Selenium, or the R package rvest. The latter is the one I used for scraping IMDb and you can find the commented code on my GitHubBefore I proceed to the fun part, note that the legality of web scraping is not clearly defined around the world, so you should check the website’s terms of use before scraping it!

So let’s dive in. My goal was to see what were the most successful movies released in 2018,  what genres they belonged to, and what was their duration. I took IMDb as a reference: on the website I selected the movies released between 01.01.-31.12.2018, sorted by popularity, and limited to only the first page, so the top 50 movies. The top 5 popular movies in 2018 were:

  1. Aquaman
  2. Green Book
  3. Bohemian Rhapsody
  4. Spider-Man: Into the Spider-Verse
  5. Avengers: Infinity War

At a glance, I noticed that 3 out 5 are action-hero movies, so I looked closer at the genre distribution.

genres_count

The initial observation was confirmed: Action and Drama are the most popular genres, followed by Biography. I guess most people enjoy, on one hand, movies that transport them into wild worlds and simulate experiences out of the ordinary, and on the other hand, movies that depict dramatic life stories and relate to some extent to their real life.

Next, I looked at the distribution of movie runtimes and found that most popular movies lasted on average 104 minutes (median 117 minutes). The longest movie was Avengers: Infinity War (149 minutes) and the shortest movie (excluding TV-shows) was A.I. Rising (85 minutes). From the histogram it is clear that the bars on the left represent the TV-shows (under 60 minutes)

hist_runtime

I also broke down the runtime distribution by genre and found that among genres Biographies were the longest (on average 127 minutes) and Crimes were shortest (on average 85 minutes). This was not entirely surprising, since I think that, first, it is quite a challenge to pack a lifetime in a biographical movie, and second, there’s only so much nerve-wrecking tension a person can take following a crime. However, I was expecting the average duration of Animations to be shorter than 110 minutes, because they are produced mainly for children, who have a short attention span and low patience to sit through a two-hour movie. But then again, we are talking about the most popular movies of last year on IMDb, which means that adults made up the large audience.

mean_genre_runtime

It would be interesting to also look at the total gross and see which movies and genres have sold best in 2018. Now you could try to scrape and analyze this information with your preferred tool and let me know what you found out 🙂

With all this in mind, I’m heading to watch the 36th movie and only documentary in IMDb’s Top 2018: Free Solo. Some realistic action and drama, for once.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s