Tardy Tuesday: American Idol

tidy tuesday
American Idol










July 24, 2024

This session looked at data on American Idol. Abram had already made a head-start with the analysis so (with some encouragement) led the session:


Loading the package

Loading the data

auditions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/auditions.csv')
eliminations <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/eliminations.csv')
finalists <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/finalists.csv')
ratings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/ratings.csv')
seasons <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/seasons.csv')
songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/songs.csv')
Abram made use of the slightly exotic %<>% pipe, which passes its output back to its first argument.

Some data tidying and basic exploration:

songs %<>% mutate(artist = if_else(artist == "*NSYNC", "NSYNC", artist))
songs_n <- songs %>% group_by(artist, song) %>% summarise(n = n()) %>% arrange(-n)
artists_n <- songs %>% group_by(artist) %>% summarise(n = n()) %>% arrange(-n)
winning_songs <- songs %>% group_by(artist, song, result) %>% summarise(n = n())
First we looked at viewing figures by show number and by season

ratings %>% filter(!is.na(viewers_in_millions)) %>%
  ggplot(aes(x = show_number, y = viewers_in_millions)) + geom_point() + geom_line() + facet_wrap(vars(season), scales = "free_y")

Then some preparation of the date column to get it in date format:

ratings %<>% mutate(airdate = if_else(season == 13, paste0(airdate, ", 2014"), airdate),
                    proper_airdate = mdy(airdate))

Then a visualisation over time

ratings %>% ggplot(aes(x = proper_airdate, y = viewers_in_millions)) + geom_point() +
  expand_limits(y = 0) + stat_smooth()
 ratings$season  n    percent
              1 25 0.04215852
              2 41 0.06913997
              3 44 0.07419899
              4 43 0.07251265
              5 41 0.06913997
              6 41 0.06913997
              7 42 0.07082631
              8 40 0.06745363
              9 43 0.07251265
             10 39 0.06576728
             11 40 0.06745363
             12 37 0.06239460
             13 39 0.06576728
             15 24 0.04047218
             16 19 0.03204047
             17 19 0.03204047
             18 16 0.02698145

Now average views

average_views <- ratings %>% group_by(season) %>% summarise(avg_views = mean(viewers_in_millions, na.rm = TRUE))

We saw a jump in most seasons at the very end, so decided to look at how big a proportional jump this was:

rel_views <- ratings %>% group_by(season) %>% slice_tail(n=2) %>%
  summarise(relative_views = viewers_in_millions[2]/viewers_in_millions[1])
rel_views %>% ggplot(aes(x = season, y = relative_views)) + geom_point()

Then average views

average_views %>% ggplot(aes(season, avg_views)) + geom_line() + expand_limits(y = 0)

Now to join average views (over whole season) to with jump at the end (rel_views) to see if any obvious relationship:

full_join(average_views, rel_views) %>% ggplot(aes(x = avg_views, y = relative_views)) + geom_point()
Joining with `by = join_by(season)`


Another pattern we saw was that the first episode in a season seemed to be about the most popular, then there was a drop-off over time

ratings %<>% group_by(season) %>% arrange(show_number) %>%
  mutate(share_of_first = viewers_in_millions / viewers_in_millions[1])

ratings %>%
  ggplot(aes(show_number, share_of_first, group = season, color = as.factor(season))) +

Finally, we looked at an interactive visualising using the ggplotly() convenience function using the plotly package:

gg<- ratings %>% filter(season >= 3) %>%
  ggplot(aes(show_number, share_of_first, group = season, color = as.factor(season))) +
  geom_point() + scale_y_log10()
