Jon Minton’s Blog - Tardy Tuesday: American Idol

This session looked at data on American Idol. Abram had already made a head-start with the analysis so (with some encouragement) led the session:

Preparation

Loading the package

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(magrittr)


Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

library(lubridate)
# install.packages("styler")
# library(styler)

Loading the data

auditions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/auditions.csv')

Rows: 142 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): audition_city, audition_venue, episodes, episode_air_date, callbac...
dbl  (2): season, tickets_to_hollywood
date (4): audition_date_start, audition_date_end, callback_date_start, callb...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

eliminations <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/eliminations.csv')

Rows: 456 Columns: 46
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (44): place, gender, contestant, top_36, top_36_2, top_36_3, top_36_4, t...
dbl  (1): season
lgl  (1): comeback

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

finalists <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/finalists.csv')

Rows: 190 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Contestant, Birthday, Birthplace, Hometown, Description
dbl (1): Season

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ratings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/ratings.csv')

Rows: 593 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): episode, airdate, 18_49_rating_share, timeslot_et, dvr_18_49, dvr_...
dbl  (4): season, show_number, viewers_in_millions, nightlyrank
lgl  (1): ref

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

seasons <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/seasons.csv')

Rows: 18 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): winner, runner_up, original_release, original_network, hosted_by, j...
dbl (2): season, no_of_episodes

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/songs.csv')

Rows: 2429 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): season, week, contestant, song, artist, song_theme, result
dbl (1): order

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Abram made use of the slightly exotic %<>% pipe, which passes its output back to its first argument.

Some data tidying and basic exploration:

songs %<>% mutate(artist = if_else(artist == "*NSYNC", "NSYNC", artist))
songs_n <- songs %>% group_by(artist, song) %>% summarise(n = n()) %>% arrange(-n)

`summarise()` has grouped output by 'artist'. You can override using the
`.groups` argument.

artists_n <- songs %>% group_by(artist) %>% summarise(n = n()) %>% arrange(-n)
winning_songs <- songs %>% group_by(artist, song, result) %>% summarise(n = n())

`summarise()` has grouped output by 'artist', 'song'. You can override using
the `.groups` argument.

Exploration

First we looked at viewing figures by show number and by season

ratings %>% filter(!is.na(viewers_in_millions)) %>%
  ggplot(aes(x = show_number, y = viewers_in_millions)) + geom_point() + geom_line() + facet_wrap(vars(season), scales = "free_y")

Then some preparation of the date column to get it in date format:

ratings %<>% mutate(airdate = if_else(season == 13, paste0(airdate, ", 2014"), airdate),
                    proper_airdate = mdy(airdate))

Then a visualisation over time

ratings %>% ggplot(aes(x = proper_airdate, y = viewers_in_millions)) + geom_point() +
  expand_limits(y = 0) + stat_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).

Tabulation:

tabyl(ratings$season)

 ratings$season  n    percent
              1 25 0.04215852
              2 41 0.06913997
              3 44 0.07419899
              4 43 0.07251265
              5 41 0.06913997
              6 41 0.06913997
              7 42 0.07082631
              8 40 0.06745363
              9 43 0.07251265
             10 39 0.06576728
             11 40 0.06745363
             12 37 0.06239460
             13 39 0.06576728
             15 24 0.04047218
             16 19 0.03204047
             17 19 0.03204047
             18 16 0.02698145

Now average views

average_views <- ratings %>% group_by(season) %>% summarise(avg_views = mean(viewers_in_millions, na.rm = TRUE))

We saw a jump in most seasons at the very end, so decided to look at how big a proportional jump this was:

rel_views <- ratings %>% group_by(season) %>% slice_tail(n=2) %>%
  summarise(relative_views = viewers_in_millions[2]/viewers_in_millions[1])

rel_views %>% ggplot(aes(x = season, y = relative_views)) + geom_point()

Then average views

average_views %>% ggplot(aes(season, avg_views)) + geom_line() + expand_limits(y = 0)

Now to join average views (over whole season) to with jump at the end (rel_views) to see if any obvious relationship:

full_join(average_views, rel_views) %>% ggplot(aes(x = avg_views, y = relative_views)) + geom_point()

Joining with `by = join_by(season)`

Nope..

Another pattern we saw was that the first episode in a season seemed to be about the most popular, then there was a drop-off over time

ratings %<>% group_by(season) %>% arrange(show_number) %>%
  mutate(share_of_first = viewers_in_millions / viewers_in_millions[1])

ratings %>%
  ggplot(aes(show_number, share_of_first, group = season, color = as.factor(season))) +
  geom_line()

Finally, we looked at an interactive visualising using the ggplotly() convenience function using the plotly package:

gg<- ratings %>% filter(season >= 3) %>%
  ggplot(aes(show_number, share_of_first, group = season, color = as.factor(season))) +
  geom_point() + scale_y_log10()

ggplotly(gg)