Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

What are the best years of the IMDB Top 250 movies?

I used to print out IMDb’s list of the top 250 movies and cross out the movies I’d already seen. I’d then pick four or five of those I hadn’t seen and and try to find them at my local video rental store. My brother humored me on more than one occasion, sitting through some extremely long and boring movies. I’m looking at you, Lawrence of Arabia.

I’ve found the older movies harder to get into. I suspect this is due to the technological abilities of the time or a necessary knowledge of historical significance or just my inability to connect with characters from another period. This got me interested in how the top 250 is represented over time. Specifically, which years have the most movies in the top 250. Using R, we can scrape the information from the IMDb website and analyze the data to answer the question.

In R, we first load the packager that we’ll need.

library(tidyverse)
library(rvest)
library(janitor)

theme_set(theme_minimal())

The first thing is use to use the rvest package to scrape the information from IMDb. Then, we convert the data into an R data frame.

url %
  html_table() %>%
  as.data.frame()

Once the data is in a data frame, we see that we need to do some data cleaning. Of all the available information, I only care about the movie title, the rank in the top 250, the year of release and the overall rating. These four variables are actually stored in two columns, so we need to separate them.

First we use the clean_names() function from the janitor package to make all variable names into snake case. Then we have to separate the rank_title variable into two variable, title and year.

Each variable needs a little work. Year needs to be parsed into a number, which drops the closed parentheses. Then, for graphing purposes, it needs to be treated as a factor. We need to drop the rank, the period and the “//n”  from the title variable. Lastly, we need to make a new variable called rank that is based on the row number.

movies_processed %
  clean_names() %>%
  separate(rank_title, c("title", "year"), sep = "\\(") %>%
  mutate(year = as.factor(parse_number(year)),
         title = str_replace(title, "\\d+.", ""),
         title = str_replace_all(title, "\\\n", ""),
         rank = row_number()) %>%
  select(rank, title, year, im_db_rating)

Bar chart of top 10 years

The 10 most represented years in the IMDb Top 250.

Code for the bar graph.

movies_processed %>%
  group_by(year) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  head(10) %>%
  ggplot(aes(fct_reorder(year, count), count)) +
    geom_col(fill = "darkgreen") +
    coord_flip() +
    scale_y_continuous(expand = c(0, 0)) +
    labs(title = "Top 10 years with most movies in the IMDb Top 250",
         x = "",
         y = "Number of movies") +
    theme(text = element_text(size=20))

There looks to be some bias towards more recent years, as their are only two years in the top 10 years that come before the mid 90s. I’m guessing this is due to an internet movie sight being unavailable before that time. I’m assuming voters will tend to skew younger, preferring films that are released more recently.

Here are the movies released in the year with the most movies, 1995.

movies_processed %>%
  filter(year == 1995) %>%
RankTitleYearRating
21Se7en 19958.6
26The Usual Suspects 19958.5
73Braveheart 19958.3
91Toy Story 19958.3
122Heat 19958.2
144Casino 19958.2
206Before Sunrise 19958.1
229La Haine 19958

The post What are the best years of the IMDB Top 250 movies? appeared first on Justin Shotwell.



This post first appeared on Justin Shotwell - Home, please read the originial post: here

Share the post

What are the best years of the IMDB Top 250 movies?

×

Subscribe to Justin Shotwell - Home

Get updates delivered right to your inbox!

Thank you for your subscription

×