Tidy Tuesday: Christmas films

Authors

Tom Fowler

Nick Christofides

Andrew Saul

Jon Minton

Published

December 21, 2023

A shorter and even tardier Tidy Tuesday this week, given we gave ourselves only half an hour rather than the usual hour to look over the most recent dataset.

The dataset was about Christmas films.

Our first question: is Die Hard a Christmas film?

Not according to the methods used to produce the dataset. If a film doesn’t have Christmas or equivalent in its title, it’s not coming in!

tt <- tidytuesdayR::tt_load('2023-12-12')
Only 7 Github queries remaining until 2024-02-22 11:44:38 AM GMT.
Only 7 Github queries remaining until 2024-02-22 11:44:38 AM GMT.
Only 7 Github queries remaining until 2024-02-22 11:44:38 AM GMT.
Only 7 Github queries remaining until 2024-02-22 11:44:38 AM GMT.
Only 7 Github queries remaining until 2024-02-22 11:44:38 AM GMT.
Only 6 Github queries remaining until 2024-02-22 11:44:38 AM GMT.
--- Compiling #TidyTuesday Information for 2023-12-12 ----
Only 5 Github queries remaining until 2024-02-22 11:44:38 AM GMT.
--- There are 2 files available ---
Only 4 Github queries remaining until 2024-02-22 11:44:38 AM GMT.
--- Starting Download ---
Only 4 Github queries remaining until 2024-02-22 11:44:38 AM GMT.
    Downloading file 1 of 2: `holiday_movies.csv`
Only 3 Github queries remaining until 2024-02-22 11:44:38 AM GMT.
    Downloading file 2 of 2: `holiday_movie_genres.csv`
Only 2 Github queries remaining until 2024-02-22 11:44:38 AM GMT.
--- Download complete ---
df1 <- tt[[1]]
df2 <- tt[[2]]
df1 %>%
  count(year, sort = TRUE)
# A tibble: 91 × 2
    year     n
   <dbl> <int>
 1  2021   183
 2  2022   173
 3  2020   172
 4  2019   143
 5  2018   129
 6  2023   107
 7  2017   102
 8  2015    76
 9  2016    75
10  2012    68
# ℹ 81 more rows
df1 %>%
  count(year) %>%
  ggplot(aes(x = year, y = n))+
  geom_point()+
  #stat_smooth()+
  scale_y_log10()

df1 %>%
  filter(year >= 1960) %>%
  count(year) %>%
  ggplot(aes(x = year, y = n))+
  geom_point()+
  #stat_smooth()+
  scale_y_log10()

df1 %>%
  filter(year >= 1960) %>%
  count(year) %>%
  ggplot(aes(x = year, y = n))+
  geom_point()+
  stat_smooth(method = "lm")+
  scale_y_log10()
`geom_smooth()` using formula = 'y ~ x'

questions

  • how are they published? [cinema / streaming?]
  • is it on imdb?
  • full inclusion of 2023?
  • are more recent films rubbish?
df1 %>%
  
  group_by(year) %>%
  summarise(avg_rating = mean(average_rating)) %>%
  ggplot(aes(x = year, y = avg_rating))+
  geom_point()

  • number of films vs avg rating

  • fewer films may drive extreme values

  • number of films vs avg rating

df1 %>%
  
  group_by(year) %>%
  summarise(
    avg_rating = mean(average_rating), 
    n_films = n() ) %>%
  ggplot(aes(x = n_films, y = avg_rating))+
  geom_point()