Double Tardy Tuesday

Mea Culpa

Tidy Tuesday
Tardy
Data Science
Author

Jon Minton

Published

April 2, 2024

Introduction

I’ve been exceptionally tardy updating the Tardy Tuesday blog the last couple of weeks. So there are a couple of weeks worth of code to update with at once. Unlike previous times I’m going to list only myself as the author of this blog, as none of the contributors deserve any blame for my tardiness!

Additionally, the scripts will be presented more ‘as-is’ than on previous occasions, without as much additional discussion or amendments.

Common package dependencies

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Mutant Moneyball

This TidyTuesday dataset involved understanding the relationship between the value of old Mavel comics and the appearance of particular characters in those comics. I’m not sure it was a good or bad thing that we didn’t know the names of most of the characters…

The session was led by Brendan, who wrote (with our support) the script below:

tidytuesdayR::tt_load('2024-03-19') 
--- Compiling #TidyTuesday Information for 2024-03-19 ----
--- There is 1 file available ---
--- Starting Download ---

    Downloading file 1 of 1: `mutant_moneyball.csv`
--- Download complete ---
Available datasets:
    mutant_moneyball 
    
mutant_moneyball <- tidytuesdayR::tt_load('2024-03-19') |>
  pluck(1)
--- Compiling #TidyTuesday Information for 2024-03-19 ----
--- There is 1 file available ---
--- Starting Download ---

    Downloading file 1 of 1: `mutant_moneyball.csv`
--- Download complete ---
str(mutant_moneyball)
spc_tbl_ [26 × 45] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Member                : chr [1:26] "warrenWorthington" "hankMcCoy" "scottSummers" "bobbyDrake" ...
 $ TotalIssues           : num [1:26] 139 119 197 123 164 68 48 190 120 167 ...
 $ TotalIssues60s        : num [1:26] 61 62 63 62 63 8 9 0 0 0 ...
 $ TotalIssues70s        : num [1:26] 35 38 69 35 58 13 13 36 36 36 ...
 $ TotalIssues80s        : num [1:26] 20 9 56 6 14 43 19 121 84 115 ...
 $ TotalIssues90s        : num [1:26] 23 10 9 20 29 4 7 33 0 16 ...
 $ totalIssueCheck       : num [1:26] 139 119 197 123 164 68 48 190 120 167 ...
 $ TotalValue_heritage   : num [1:26] 1108558 957993 1152230 1086749 1132091 ...
 $ TotalValue60s_heritage: num [1:26] 929056 929776 933616 929776 933616 ...
 $ TotalValue70s_heritage: num [1:26] 154585 20705 188635 154585 179899 ...
 $ TotalValue80s_heritage: num [1:26] 23957 6631 29240 1514 16868 ...
 $ TotalValue90s_heritage: num [1:26] 960 881 739 874 1708 ...
 $ TotalValue_ebay       : num [1:26] 27377 24972 29964 26902 29158 ...
 $ TotalValue60s_ebay    : num [1:26] 23335 23377 23420 23377 23420 ...
 $ TotalValue70s_ebay    : num [1:26] 3362 1224 5431 3362 4903 ...
 $ TotalValue80s_ebay    : num [1:26] 583 289 1031 70 665 ...
 $ TotalValue90s_ebay    : num [1:26] 97 82 82 93 170 21 41 334 0 100 ...
 $ 60s_Appearance_Percent: chr [1:26] "96.83%" "98.41%" "100.00%" "98.41%" ...
 $ 70s_Appearance_Percent: chr [1:26] "50.72%" "55.07%" "100.00%" "50.72%" ...
 $ 80s_Appearance_Percent: chr [1:26] "14.60%" "6.57%" "40.88%" "4.38%" ...
 $ 90s_Appearance_Percent: chr [1:26] "51.11%" "22.22%" "20.00%" "44.44%" ...
 $ PPI60s_heritage       : chr [1:26] "$15,230.43 " "$14,996.39 " "$14,819.30 " "$14,996.39 " ...
 $ PPI70s_heritage       : chr [1:26] "$4,416.71 " "$544.87 " "$2,733.84 " "$4,416.71 " ...
 $ PPI80s_heritage       : chr [1:26] "$1,197.85 " "$736.78 " "$522.14 " "$252.33 " ...
 $ PPI90s_heritage       : chr [1:26] "$41.74 " "$88.10 " "$82.11 " "$43.70 " ...
 $ PPI60s_ebay           : chr [1:26] "$382.54 " "$377.05 " "$371.75 " "$377.05 " ...
 $ PPI70s_ebay           : chr [1:26] "$96.06 " "$32.21 " "$78.71 " "$96.06 " ...
 $ PPI80s_ebay           : chr [1:26] "$29.15 " "$32.11 " "$18.41 " "$11.67 " ...
 $ PPI90s_ebay           : chr [1:26] "$4.22 " "$8.20 " "$9.11 " "$4.65 " ...
 $ TotalValue60s_wiz     : chr [1:26] "$7,913.00 " "$7,953.00 " "$7,993.00 " "$7,953.00 " ...
 $ TotalValue70s_wiz     : chr [1:26] "$1,105.00 " "$851.00 " "$1,979.00 " "$1,105.00 " ...
 $ TotalValue80s_wiz     : chr [1:26] "$226.00 " "$89.00 " "$438.00 " "$48.00 " ...
 $ TotalValue90s_wiz     : chr [1:26] "$65.75 " "$38.50 " "$39.25 " "$62.00 " ...
 $ TotalValue60s_oStreet : chr [1:26] "$68,160.00 " "$68,390.00 " "$68,590.00 " "$68,390.00 " ...
 $ TotalValue70s_oStreet : chr [1:26] "$7,360.00 " "$5,260.00 " "$11,675.00 " "$7,360.00 " ...
 $ TotalValue80s_oStreet : chr [1:26] "$975.00 " "$431.00 " "$1,427.00 " "$137.00 " ...
 $ TotalValue90s_oStreet : chr [1:26] "$123.00 " "$81.00 " "$74.00 " "$108.00 " ...
 $ PPI60s_wiz            : chr [1:26] "$129.72 " "$128.27 " "$126.87 " "$128.27 " ...
 $ PPI70s_wiz            : chr [1:26] "$31.57 " "$22.39 " "$28.68 " "$31.57 " ...
 $ PPI80s_wiz            : chr [1:26] "$11.30 " "$9.89 " "$7.82 " "$8.00 " ...
 $ PPI90s_wiz            : chr [1:26] "$2.86 " "$3.85 " "$4.36 " "$3.10 " ...
 $ PPI60s_oStreet        : chr [1:26] "$1,117.38 " "$1,103.06 " "$1,088.73 " "$1,103.06 " ...
 $ PPI70s_oStreet        : chr [1:26] "$210.29 " "$138.42 " "$169.20 " "$210.29 " ...
 $ PPI80s_oStreet        : chr [1:26] "$48.75 " "$47.89 " "$25.48 " "$22.83 " ...
 $ PPI90s_oStreet        : chr [1:26] "$5.35 " "$8.10 " "$8.22 " "$5.40 " ...
 - attr(*, "spec")=
  .. cols(
  ..   Member = col_character(),
  ..   TotalIssues = col_double(),
  ..   TotalIssues60s = col_double(),
  ..   TotalIssues70s = col_double(),
  ..   TotalIssues80s = col_double(),
  ..   TotalIssues90s = col_double(),
  ..   totalIssueCheck = col_double(),
  ..   TotalValue_heritage = col_double(),
  ..   TotalValue60s_heritage = col_double(),
  ..   TotalValue70s_heritage = col_double(),
  ..   TotalValue80s_heritage = col_double(),
  ..   TotalValue90s_heritage = col_double(),
  ..   TotalValue_ebay = col_double(),
  ..   TotalValue60s_ebay = col_double(),
  ..   TotalValue70s_ebay = col_double(),
  ..   TotalValue80s_ebay = col_double(),
  ..   TotalValue90s_ebay = col_double(),
  ..   `60s_Appearance_Percent` = col_character(),
  ..   `70s_Appearance_Percent` = col_character(),
  ..   `80s_Appearance_Percent` = col_character(),
  ..   `90s_Appearance_Percent` = col_character(),
  ..   PPI60s_heritage = col_character(),
  ..   PPI70s_heritage = col_character(),
  ..   PPI80s_heritage = col_character(),
  ..   PPI90s_heritage = col_character(),
  ..   PPI60s_ebay = col_character(),
  ..   PPI70s_ebay = col_character(),
  ..   PPI80s_ebay = col_character(),
  ..   PPI90s_ebay = col_character(),
  ..   TotalValue60s_wiz = col_character(),
  ..   TotalValue70s_wiz = col_character(),
  ..   TotalValue80s_wiz = col_character(),
  ..   TotalValue90s_wiz = col_character(),
  ..   TotalValue60s_oStreet = col_character(),
  ..   TotalValue70s_oStreet = col_character(),
  ..   TotalValue80s_oStreet = col_character(),
  ..   TotalValue90s_oStreet = col_character(),
  ..   PPI60s_wiz = col_character(),
  ..   PPI70s_wiz = col_character(),
  ..   PPI80s_wiz = col_character(),
  ..   PPI90s_wiz = col_character(),
  ..   PPI60s_oStreet = col_character(),
  ..   PPI70s_oStreet = col_character(),
  ..   PPI80s_oStreet = col_character(),
  ..   PPI90s_oStreet = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
# filter by totalIssueCheck == TotalIssues for missing issues, then remove both

mutant_moneyball |>
  filter(totalIssueCheck != TotalIssues) |>
  nrow()
[1] 0
# split TotalIssues and Member into issues data

appearances <- mutant_moneyball |>
  select(Member, contains("ssues")) |>
  pivot_longer(!Member) |>
  mutate(first = str_to_title(str_extract(Member, "^[[:lower:]]*"))) |>
  mutate(last = str_remove(Member, "^[[:lower:]]*" )) |>
  mutate(Member = paste(first, last), .keep="unused") 

appearances |>
  filter(name == "TotalIssues") |>
  arrange(desc(value))
# A tibble: 26 × 3
   Member             name        value
   <chr>              <chr>       <dbl>
 1 Scott Summers      TotalIssues   197
 2 Ororo Munroe       TotalIssues   190
 3 Peter Rasputin     TotalIssues   169
 4 Charles Xavier     TotalIssues   169
 5 Logan Howlett      TotalIssues   167
 6 Jean Grey          TotalIssues   164
 7 Warren Worthington TotalIssues   139
 8 Bobby Drake        TotalIssues   123
 9 Kurt Wagner        TotalIssues   120
10 Hank McCoy         TotalIssues   119
# ℹ 16 more rows
best_xm <- appearances |>
  slice_max(value, n=10) |>
  pull(Member)

appearances |>
  mutate(first = str_to_title(str_extract(Member, "^[[:lower:]]*"))) |>
  mutate(last = str_remove(Member, "^[[:lower:]]*" )) |>
  mutate(Member = paste(first, last), .keep="unused") 
# A tibble: 130 × 3
   Member                name           value
   <chr>                 <chr>          <dbl>
 1 " Warren Worthington" TotalIssues      139
 2 " Warren Worthington" TotalIssues60s    61
 3 " Warren Worthington" TotalIssues70s    35
 4 " Warren Worthington" TotalIssues80s    20
 5 " Warren Worthington" TotalIssues90s    23
 6 " Hank McCoy"         TotalIssues      119
 7 " Hank McCoy"         TotalIssues60s    62
 8 " Hank McCoy"         TotalIssues70s    38
 9 " Hank McCoy"         TotalIssues80s     9
10 " Hank McCoy"         TotalIssues90s    10
# ℹ 120 more rows
dates <- tribble(
  ~start, ~end, ~decade,
  1963, 1969, 60,
  1970, 1979, 70,
  1980, 1989, 80,
  1990, 1992, 90
)

plot <- appearances |>
  mutate(decade = as.numeric(str_extract(name, "\\d{2}"))) |>
  filter(!is.na(decade)) |>
  left_join(dates) |>
  mutate(year_range = 1 + end - start) |>
  mutate(rate = value / year_range) |>
  relocate(last_col()) |>
  filter(Member %in% best_xm) |>
  ggplot(aes(x = start, y = rate, colour = Member)) +
  geom_line() +
  geom_point() +
  theme(legend.position = "bottom")
Joining with `by = join_by(decade)`
plotly::ggplotly(plot)

Recollections

For this session we focused more on regex than perhaps we thought we would, with a particular focus on how to produce nicely readable names for different characters. We also focused on producing metrics like appearances per year, given that some of the time periods were full decades, but others were just of a few years within the decade.

NCAA Men’s March Madness

This session was based around data from a basketball competition. It had information both on how well different teams performed against each other, and also how well various members of the public thought they would perform, meaning it can be used to assess how well expectations match with reality.

This particular session was led by Nic, who supplied the code below.

if (is.na(utils::packageVersion("pacman"))){
  install.packages("pacman")
}
library(pacman)
p_load(tidytuesdayR)
 
### Download last tuesday's data
 
tt_data <- tt_load('2024-03-26')
--- Compiling #TidyTuesday Information for 2024-03-26 ----
--- There are 2 files available ---
--- Starting Download ---

    Downloading file 1 of 2: `team-results.csv`
    Downloading file 2 of 2: `public-picks.csv`
--- Download complete ---
tt_data
Available datasets:
    team-results 
    public-picks 
    
### Assign the datasets to our global environment
 
list2env(tt_data, globalenv())
<environment: R_GlobalEnv>
# Load packages -----------------------------------------------------------
 
p_load(tidyverse, # The tidyverse
       cheapr, # Cheap (fast & efficient) functions
       cppdoubles, # Floating-point comparisons
       timeplyr, # Date-Time manipulation
       tidytext, # Text manipulation
       phsmethods, phsstyles) # PHS functions
 
 
# Exploratory -------------------------------------------------------------
 
 
overview(`team-results`)
obs: 236 
cols: 20 

----- Numeric -----
          col   class n_missing p_complete n_unique   mean   p0   p25   p50
1      TEAMID numeric         0          1      236 124.33    1 63.75 123.5
2        PAKE numeric         0          1       75  -0.01 -6.7  -0.8  -0.2
3    PAKERANK numeric         0          1       75 114.53    1    56   118
4        PASE numeric         0          1       79      0 -8.5  -0.9 -0.15
5    PASERANK numeric         0          1       79 114.78    1    59   111
6       GAMES numeric         0          1       38   8.01    1     2     4
7           W numeric         0          1       31      4    0     0     1
8           L numeric         0          1       15      4    1     1     3
9  WINPERCENT numeric         0          1       55   0.26    0     0  0.25
10        R64 numeric         0          1       15   4.07    1     1     3
11        R32 numeric         0          1       15   2.03    0     0     1
12        S16 numeric         0          1       10   1.02    0     0     0
13         E8 numeric         0          1        8   0.51    0     0     0
14         F4 numeric         0          1        6   0.25    0     0     0
15         F2 numeric         0          1        5   0.13    0     0     0
16      CHAMP numeric         0          1        4   0.06    0     0     0
17       TOP2 numeric         0          1       11   0.51    0     0     0
      p75 p100   iqr    sd
1  185.25  245 121.5 70.66
2     0.5   12   1.3  2.21
3     175  236   119 68.46
4     0.5 10.6   1.4  2.35
5     175  236   116 68.58
6       9   53     7 10.26
7       4   40     4  7.38
8       6   15     5  3.31
9     0.5  0.8   0.5  0.26
10      6   15     5  3.44
11      3   15     3  3.01
12      1    9     1  2.06
13      0    7     0  1.28
14      0    5     0  0.78
15      0    4     0  0.51
16      0    3     0  0.35
17      0   12     0  1.56

----- Categorical -----
           col     class n_missing p_complete n_unique n_levels
1         TEAM character         0          1      236       NA
2    F4PERCENT character         0          1      119       NA
3 CHAMPPERCENT character         0          1       72       NA
                min    max
1 Abilene Christian   Yale
2             0.00% 98.80%
3             0.00%  9.10%
overview(`public-picks`)
obs: 64 
cols: 9 

----- Numeric -----
     col   class n_missing p_complete n_unique    mean   p0     p25    p50
1   YEAR numeric         0          1        1    2024 2024    2024   2024
2 TEAMNO numeric         0          1       64 1045.39 1012 1028.75 1045.5
      p75 p100  iqr    sd
1    2024 2024    0     0
2 1062.25 1079 33.5 19.71

----- Categorical -----
     col     class n_missing p_complete n_unique n_levels   min    max
1   TEAM character         0          1       64       NA Akron   Yale
2    R64 character         0          1       64       NA 1.50% 98.41%
3    R32 character         0          1       62       NA 0.67% 93.59%
4    S16 character         0          1       60       NA 0.33% 80.22%
5     E8 character         0          1       54       NA 0.17%  8.37%
6     F4 character         0          1       42       NA 0.07%  9.99%
7 FINALS character         0          1       37       NA 0.03%  9.27%
`public-picks` |>
  pivot_longer(R64:FINALS) |> 
  group_by(YEAR, TEAMNO) |> 
  mutate(stage = row_number()) %>%
  mutate(perc = as.numeric(str_remove(value, "%")),
         perc = perc / 100) %>%
  ggplot(aes(x = stage, y = perc)) + 
  geom_line(aes(group = paste(YEAR, TEAM)))

  # geom_col()
top_predicted_teams <- `public-picks` |>
  pivot_longer(R64:FINALS) |> 
  group_by(YEAR, TEAMNO) |> 
  mutate(stage = row_number()) %>%
  mutate(perc = as.numeric(str_remove(value, "%")),
         perc = perc / 100) %>%
  filter(stage == 6) %>%
  arrange(desc(perc)) %>%
  ungroup() %>%
  slice(1:5)

`team-results` %>%
  count(CHAMP)
# A tibble: 4 × 2
  CHAMP     n
  <dbl> <int>
1     0   227
2     1     4
3     2     4
4     3     1
`team-results` %>%
  count(CHAMP)
# A tibble: 4 × 2
  CHAMP     n
  <dbl> <int>
1     0   227
2     1     4
3     2     4
4     3     1
`team-results` %>%
  filter(F2 >= 2)
# A tibble: 9 × 20
  TEAMID TEAM    PAKE PAKERANK  PASE PASERANK GAMES     W     L WINPERCENT   R64
   <dbl> <chr>  <dbl>    <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl>      <dbl> <dbl>
1     24 Butler   7          4   8.7        4    26    17     9      0.654     9
2     40 Conne…   8.6        2  10.6        1    29    23     6      0.793     9
3     50 Duke     2         24   0         78    46    34    12      0.739    14
4     68 Gonza…   3.1       18   3.8       14    47    32    15      0.681    15
5     86 Kansas   4.2       13  -1.8      205    53    40    13      0.755    15
6     90 Kentu…   6.9        6   8.5        5    43    32    11      0.744    12
7    114 Michi…   7          4   7.6        6    35    24    11      0.686    11
8    135 North…  12          1   9.8        2    50    39    11      0.78     13
9    228 Villa…   4.8        9   4         11    40    29    11      0.725    13
# ℹ 9 more variables: R32 <dbl>, S16 <dbl>, E8 <dbl>, F4 <dbl>, F2 <dbl>,
#   CHAMP <dbl>, TOP2 <dbl>, F4PERCENT <chr>, CHAMPPERCENT <chr>
top_predicted_teams %>%
  inner_join(`team-results`, by = "TEAM")
# A tibble: 5 × 26
   YEAR TEAMNO TEAM         name  value stage   perc TEAMID  PAKE PAKERANK  PASE
  <dbl>  <dbl> <chr>        <chr> <chr> <int>  <dbl>  <dbl> <dbl>    <dbl> <dbl>
1  2024   1067 Connecticut  FINA… 34.9…     6 0.349      40   8.6        2  10.6
2  2024   1038 North Carol… FINA… 12.1…     6 0.121     135  12          1   9.8
3  2024   1033 Purdue       FINA… 10.2…     6 0.102     167  -4.4      232  -4.4
4  2024   1056 Houston      FINA… 9.27%     6 0.0927     76   0         76   2.1
5  2024   1053 Iowa St.     FINA… 4.78%     6 0.0478     83  -2        209  -1.5
# ℹ 15 more variables: PASERANK <dbl>, GAMES <dbl>, W <dbl>, L <dbl>,
#   WINPERCENT <dbl>, R64 <dbl>, R32 <dbl>, S16 <dbl>, E8 <dbl>, F4 <dbl>,
#   F2 <dbl>, CHAMP <dbl>, TOP2 <dbl>, F4PERCENT <chr>, CHAMPPERCENT <chr>

Shoutouts

Please check out Nic’s timeplyr package, which is now on CRAN and was masterfully presented at a previous EdinbR R users’ group meeting.