I used to print out IMDb’s list of the top 250 movies and cross out the movies I’d already seen. I’d then pick four or five of those I hadn’t seen and and try to find them at my local video rental store. My brother humored me on more than one occasion, sitting through some extremely long and boring movies. I’m looking at you, Lawrence of Arabia.
I’ve found the older movies harder to get into. I suspect this is due to the technological abilities of the time or a necessary knowledge of historical significance or just my inability to connect with characters from another period. This got me interested in how the top 250 is represented over time. Specifically, which years have the most movies in the top 250. Using R, we can scrape the information from the IMDb website and analyze the data to answer the question.
In R, we first load the packager that we’ll need.
library(tidyverse) library(rvest) library(janitor) theme_set(theme_minimal())
The first thing is use to use the rvest package to scrape the information from IMDb. Then, we convert the data into an R data frame.
url % html_table() %>% as.data.frame()
Once the data is in a data frame, we see that we need to do some data cleaning. Of all the available information, I only care about the movie title, the rank in the top 250, the year of release and the overall rating. These four variables are actually stored in two columns, so we need to separate them.
First we use the clean_names() function from the janitor package to make all variable names into snake case. Then we have to separate the rank_title variable into two variable, title and year.
Each variable needs a little work. Year needs to be parsed into a number, which drops the closed parentheses. Then, for graphing purposes, it needs to be treated as a factor. We need to drop the rank, the period and the “//n” from the title variable. Lastly, we need to make a new variable called rank that is based on the row number.
movies_processed % clean_names() %>% separate(rank_title, c("title", "year"), sep = "\\(") %>% mutate(year = as.factor(parse_number(year)), title = str_replace(title, "\\d+.", ""), title = str_replace_all(title, "\\\n", ""), rank = row_number()) %>% select(rank, title, year, im_db_rating)
Bar chart of top 10 years
Code for the bar graph.
movies_processed %>% group_by(year) %>% summarise(count = n()) %>% arrange(desc(count)) %>% head(10) %>% ggplot(aes(fct_reorder(year, count), count)) + geom_col(fill = "darkgreen") + coord_flip() + scale_y_continuous(expand = c(0, 0)) + labs(title = "Top 10 years with most movies in the IMDb Top 250", x = "", y = "Number of movies") + theme(text = element_text(size=20))
There looks to be some bias towards more recent years, as their are only two years in the top 10 years that come before the mid 90s. I’m guessing this is due to an internet movie sight being unavailable before that time. I’m assuming voters will tend to skew younger, preferring films that are released more recently.
Here are the movies released in the year with the most movies, 1995.
movies_processed %>% filter(year == 1995) %>%
|26||The Usual Suspects||1995||8.5|
The post What are the best years of the IMDB Top 250 movies? appeared first on Justin Shotwell.