Double Tardy Tuesday

Mea Culpa

Tidy Tuesday
Tardy
Data Science
Author

Jon Minton

Published

April 2, 2024

Introduction

I’ve been exceptionally tardy updating the Tardy Tuesday blog the last couple of weeks. So there are a couple of weeks worth of code to update with at once. Unlike previous times I’m going to list only myself as the author of this blog, as none of the contributors deserve any blame for my tardiness!

Additionally, the scripts will be presented more ‘as-is’ than on previous occasions, without as much additional discussion or amendments.

Common package dependencies

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Mutant Moneyball

This TidyTuesday dataset involved understanding the relationship between the value of old Mavel comics and the appearance of particular characters in those comics. I’m not sure it was a good or bad thing that we didn’t know the names of most of the characters…

The session was led by Brendan, who wrote (with our support) the script below:

tidytuesdayR::tt_load('2024-03-19') 
---- Compiling #TidyTuesday Information for 2024-03-19 ----
--- There is 1 file available ---


── Downloading files ───────────────────────────────────────────────────────────

  1 of 1: "mutant_moneyball.csv"
Available datasets:
    mutant_moneyball 
    
mutant_moneyball <- tidytuesdayR::tt_load('2024-03-19') |>
  pluck(1)
---- Compiling #TidyTuesday Information for 2024-03-19 ----
--- There is 1 file available ---


── Downloading files ───────────────────────────────────────────────────────────

  1 of 1: "mutant_moneyball.csv"
str(mutant_moneyball)
spc_tbl_ [26 × 45] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Member                : chr [1:26] "warrenWorthington" "hankMcCoy" "scottSummers" "bobbyDrake" ...
 $ TotalIssues           : num [1:26] 139 119 197 123 164 68 48 190 120 167 ...
 $ TotalIssues60s        : num [1:26] 61 62 63 62 63 8 9 0 0 0 ...
 $ TotalIssues70s        : num [1:26] 35 38 69 35 58 13 13 36 36 36 ...
 $ TotalIssues80s        : num [1:26] 20 9 56 6 14 43 19 121 84 115 ...
 $ TotalIssues90s        : num [1:26] 23 10 9 20 29 4 7 33 0 16 ...
 $ totalIssueCheck       : num [1:26] 139 119 197 123 164 68 48 190 120 167 ...
 $ TotalValue_heritage   : num [1:26] 1108558 957993 1152230 1086749 1132091 ...
 $ TotalValue60s_heritage: num [1:26] 929056 929776 933616 929776 933616 ...
 $ TotalValue70s_heritage: num [1:26] 154585 20705 188635 154585 179899 ...
 $ TotalValue80s_heritage: num [1:26] 23957 6631 29240 1514 16868 ...
 $ TotalValue90s_heritage: num [1:26] 960 881 739 874 1708 ...
 $ TotalValue_ebay       : num [1:26] 27377 24972 29964 26902 29158 ...
 $ TotalValue60s_ebay    : num [1:26] 23335 23377 23420 23377 23420 ...
 $ TotalValue70s_ebay    : num [1:26] 3362 1224 5431 3362 4903 ...
 $ TotalValue80s_ebay    : num [1:26] 583 289 1031 70 665 ...
 $ TotalValue90s_ebay    : num [1:26] 97 82 82 93 170 21 41 334 0 100 ...
 $ 60s_Appearance_Percent: chr [1:26] "96.83%" "98.41%" "100.00%" "98.41%" ...
 $ 70s_Appearance_Percent: chr [1:26] "50.72%" "55.07%" "100.00%" "50.72%" ...
 $ 80s_Appearance_Percent: chr [1:26] "14.60%" "6.57%" "40.88%" "4.38%" ...
 $ 90s_Appearance_Percent: chr [1:26] "51.11%" "22.22%" "20.00%" "44.44%" ...
 $ PPI60s_heritage       : chr [1:26] "$15,230.43 " "$14,996.39 " "$14,819.30 " "$14,996.39 " ...
 $ PPI70s_heritage       : chr [1:26] "$4,416.71 " "$544.87 " "$2,733.84 " "$4,416.71 " ...
 $ PPI80s_heritage       : chr [1:26] "$1,197.85 " "$736.78 " "$522.14 " "$252.33 " ...
 $ PPI90s_heritage       : chr [1:26] "$41.74 " "$88.10 " "$82.11 " "$43.70 " ...
 $ PPI60s_ebay           : chr [1:26] "$382.54 " "$377.05 " "$371.75 " "$377.05 " ...
 $ PPI70s_ebay           : chr [1:26] "$96.06 " "$32.21 " "$78.71 " "$96.06 " ...
 $ PPI80s_ebay           : chr [1:26] "$29.15 " "$32.11 " "$18.41 " "$11.67 " ...
 $ PPI90s_ebay           : chr [1:26] "$4.22 " "$8.20 " "$9.11 " "$4.65 " ...
 $ TotalValue60s_wiz     : chr [1:26] "$7,913.00 " "$7,953.00 " "$7,993.00 " "$7,953.00 " ...
 $ TotalValue70s_wiz     : chr [1:26] "$1,105.00 " "$851.00 " "$1,979.00 " "$1,105.00 " ...
 $ TotalValue80s_wiz     : chr [1:26] "$226.00 " "$89.00 " "$438.00 " "$48.00 " ...
 $ TotalValue90s_wiz     : chr [1:26] "$65.75 " "$38.50 " "$39.25 " "$62.00 " ...
 $ TotalValue60s_oStreet : chr [1:26] "$68,160.00 " "$68,390.00 " "$68,590.00 " "$68,390.00 " ...
 $ TotalValue70s_oStreet : chr [1:26] "$7,360.00 " "$5,260.00 " "$11,675.00 " "$7,360.00 " ...
 $ TotalValue80s_oStreet : chr [1:26] "$975.00 " "$431.00 " "$1,427.00 " "$137.00 " ...
 $ TotalValue90s_oStreet : chr [1:26] "$123.00 " "$81.00 " "$74.00 " "$108.00 " ...
 $ PPI60s_wiz            : chr [1:26] "$129.72 " "$128.27 " "$126.87 " "$128.27 " ...
 $ PPI70s_wiz            : chr [1:26] "$31.57 " "$22.39 " "$28.68 " "$31.57 " ...
 $ PPI80s_wiz            : chr [1:26] "$11.30 " "$9.89 " "$7.82 " "$8.00 " ...
 $ PPI90s_wiz            : chr [1:26] "$2.86 " "$3.85 " "$4.36 " "$3.10 " ...
 $ PPI60s_oStreet        : chr [1:26] "$1,117.38 " "$1,103.06 " "$1,088.73 " "$1,103.06 " ...
 $ PPI70s_oStreet        : chr [1:26] "$210.29 " "$138.42 " "$169.20 " "$210.29 " ...
 $ PPI80s_oStreet        : chr [1:26] "$48.75 " "$47.89 " "$25.48 " "$22.83 " ...
 $ PPI90s_oStreet        : chr [1:26] "$5.35 " "$8.10 " "$8.22 " "$5.40 " ...
 - attr(*, "spec")=
  .. cols(
  ..   Member = col_character(),
  ..   TotalIssues = col_double(),
  ..   TotalIssues60s = col_double(),
  ..   TotalIssues70s = col_double(),
  ..   TotalIssues80s = col_double(),
  ..   TotalIssues90s = col_double(),
  ..   totalIssueCheck = col_double(),
  ..   TotalValue_heritage = col_double(),
  ..   TotalValue60s_heritage = col_double(),
  ..   TotalValue70s_heritage = col_double(),
  ..   TotalValue80s_heritage = col_double(),
  ..   TotalValue90s_heritage = col_double(),
  ..   TotalValue_ebay = col_double(),
  ..   TotalValue60s_ebay = col_double(),
  ..   TotalValue70s_ebay = col_double(),
  ..   TotalValue80s_ebay = col_double(),
  ..   TotalValue90s_ebay = col_double(),
  ..   `60s_Appearance_Percent` = col_character(),
  ..   `70s_Appearance_Percent` = col_character(),
  ..   `80s_Appearance_Percent` = col_character(),
  ..   `90s_Appearance_Percent` = col_character(),
  ..   PPI60s_heritage = col_character(),
  ..   PPI70s_heritage = col_character(),
  ..   PPI80s_heritage = col_character(),
  ..   PPI90s_heritage = col_character(),
  ..   PPI60s_ebay = col_character(),
  ..   PPI70s_ebay = col_character(),
  ..   PPI80s_ebay = col_character(),
  ..   PPI90s_ebay = col_character(),
  ..   TotalValue60s_wiz = col_character(),
  ..   TotalValue70s_wiz = col_character(),
  ..   TotalValue80s_wiz = col_character(),
  ..   TotalValue90s_wiz = col_character(),
  ..   TotalValue60s_oStreet = col_character(),
  ..   TotalValue70s_oStreet = col_character(),
  ..   TotalValue80s_oStreet = col_character(),
  ..   TotalValue90s_oStreet = col_character(),
  ..   PPI60s_wiz = col_character(),
  ..   PPI70s_wiz = col_character(),
  ..   PPI80s_wiz = col_character(),
  ..   PPI90s_wiz = col_character(),
  ..   PPI60s_oStreet = col_character(),
  ..   PPI70s_oStreet = col_character(),
  ..   PPI80s_oStreet = col_character(),
  ..   PPI90s_oStreet = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
# filter by totalIssueCheck == TotalIssues for missing issues, then remove both

mutant_moneyball |>
  filter(totalIssueCheck != TotalIssues) |>
  nrow()
[1] 0
# split TotalIssues and Member into issues data

appearances <- mutant_moneyball |>
  select(Member, contains("ssues")) |>
  pivot_longer(!Member) |>
  mutate(first = str_to_title(str_extract(Member, "^[[:lower:]]*"))) |>
  mutate(last = str_remove(Member, "^[[:lower:]]*" )) |>
  mutate(Member = paste(first, last), .keep="unused") 

appearances |>
  filter(name == "TotalIssues") |>
  arrange(desc(value))
# A tibble: 26 × 3
   Member             name        value
   <chr>              <chr>       <dbl>
 1 Scott Summers      TotalIssues   197
 2 Ororo Munroe       TotalIssues   190
 3 Peter Rasputin     TotalIssues   169
 4 Charles Xavier     TotalIssues   169
 5 Logan Howlett      TotalIssues   167
 6 Jean Grey          TotalIssues   164
 7 Warren Worthington TotalIssues   139
 8 Bobby Drake        TotalIssues   123
 9 Kurt Wagner        TotalIssues   120
10 Hank McCoy         TotalIssues   119
# ℹ 16 more rows
best_xm <- appearances |>
  slice_max(value, n=10) |>
  pull(Member)

appearances |>
  mutate(first = str_to_title(str_extract(Member, "^[[:lower:]]*"))) |>
  mutate(last = str_remove(Member, "^[[:lower:]]*" )) |>
  mutate(Member = paste(first, last), .keep="unused") 
# A tibble: 130 × 3
   Member                name           value
   <chr>                 <chr>          <dbl>
 1 " Warren Worthington" TotalIssues      139
 2 " Warren Worthington" TotalIssues60s    61
 3 " Warren Worthington" TotalIssues70s    35
 4 " Warren Worthington" TotalIssues80s    20
 5 " Warren Worthington" TotalIssues90s    23
 6 " Hank McCoy"         TotalIssues      119
 7 " Hank McCoy"         TotalIssues60s    62
 8 " Hank McCoy"         TotalIssues70s    38
 9 " Hank McCoy"         TotalIssues80s     9
10 " Hank McCoy"         TotalIssues90s    10
# ℹ 120 more rows
dates <- tribble(
  ~start, ~end, ~decade,
  1963, 1969, 60,
  1970, 1979, 70,
  1980, 1989, 80,
  1990, 1992, 90
)

plot <- appearances |>
  mutate(decade = as.numeric(str_extract(name, "\\d{2}"))) |>
  filter(!is.na(decade)) |>
  left_join(dates) |>
  mutate(year_range = 1 + end - start) |>
  mutate(rate = value / year_range) |>
  relocate(last_col()) |>
  filter(Member %in% best_xm) |>
  ggplot(aes(x = start, y = rate, colour = Member)) +
  geom_line() +
  geom_point() +
  theme(legend.position = "bottom")
Joining with `by = join_by(decade)`
plotly::ggplotly(plot)

Recollections

For this session we focused more on regex than perhaps we thought we would, with a particular focus on how to produce nicely readable names for different characters. We also focused on producing metrics like appearances per year, given that some of the time periods were full decades, but others were just of a few years within the decade.

NCAA Men’s March Madness

This session was based around data from a basketball competition. It had information both on how well different teams performed against each other, and also how well various members of the public thought they would perform, meaning it can be used to assess how well expectations match with reality.

This particular session was led by Nic, who supplied the code below.

if (is.na(utils::packageVersion("pacman"))){
  install.packages("pacman")
}
library(pacman)
p_load(tidytuesdayR)
 
### Download last tuesday's data
 
tt_data <- tt_load('2024-03-26')
---- Compiling #TidyTuesday Information for 2024-03-26 ----
--- There are 2 files available ---


── Downloading files ───────────────────────────────────────────────────────────

  1 of 2: "team-results.csv"
  2 of 2: "public-picks.csv"
tt_data
Available datasets:
    team-results 
    public-picks 
    
### Assign the datasets to our global environment
 
list2env(tt_data, globalenv())
<environment: R_GlobalEnv>
# Load packages -----------------------------------------------------------
 
p_load(tidyverse, # The tidyverse
       cheapr, # Cheap (fast & efficient) functions
       cppdoubles, # Floating-point comparisons
       timeplyr, # Date-Time manipulation
       tidytext, # Text manipulation
       phsmethods, phsstyles) # PHS functions
Warning: package 'phsstyles' is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
Warning: 'BiocManager' not available.  Could not check Bioconductor.

Please use `install.packages('BiocManager')` and then retry.
Warning in p_install(package, character.only = TRUE, ...):
Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'phsstyles'
Warning in p_load(tidyverse, cheapr, cppdoubles, timeplyr, tidytext, phsmethods, : Failed to install/load:
phsstyles
# Exploratory -------------------------------------------------------------
 
 
overview(`team-results`)
obs: 236 
cols: 20 

----- Numeric -----
          col n_missng p_complt n_unique      mean   p0   p25    p50    p75
1      TEAMID        0        1      236    124.33    1 63.75  123.5 185.25
2        PAKE        0        1       75   -0.0093 -6.7  -0.8   -0.2    0.5
3    PAKERANK        0        1       75    114.53    1    56    118    175
4        PASE        0        1       79  -0.00042 -8.5  -0.9  -0.15    0.5
5    PASERANK        0        1       79    114.78    1    59    111    175
6       GAMES        0        1       38      8.01    1     2      4      9
7           W        0        1       31         4    0     0      1      4
8           L        0        1       15         4    1     1      3      6
9  WINPERCENT        0        1       55      0.26    0     0   0.25    0.5
10        R64        0        1       15      4.07    1     1      3      6
11        R32        0        1       15      2.03    0     0      1      3
12        S16        0        1       10      1.02    0     0      0      1
13         E8        0        1        8      0.51    0     0      0      0
14         F4        0        1        6      0.25    0     0      0      0
15         F2        0        1        5      0.13    0     0      0      0
16      CHAMP        0        1        4     0.064    0     0      0      0
17       TOP2        0        1       11      0.51    0     0      0      0
    p100   iqr    sd  hist
1    245 121.5 70.66 ▇▇▇▇▇
2     12   1.3  2.21 ▁▇▂▁▁
3    236   119 68.46 ▆▇▃▆▆
4   10.6   1.4  2.35 ▁▃▇▁▁
5    236   116 68.58 ▇▇▇▆▇
6     53     7 10.26 ▇▁▁▁▁
7     40     4  7.38 ▇▁▁▁▁
8     15     5  3.31 ▇▂▂▁▁
9    0.8   0.5  0.26 ▇▂▃▃▂
10    15     5  3.44 ▇▂▂▁▁
11    15     3  3.01 ▇▁▁▁▁
12     9     1  2.06 ▇▁▁▁▁
13     7     0  1.28 ▇▁▁▁▁
14     5     0  0.78 ▇▁▁▁▁
15     4     0  0.51 ▇▁▁▁▁
16     3     0  0.35 ▇▁▁▁▁
17    12     0  1.56 ▇▁▁▁▁

----- Categorical -----
           col n_missng p_complt n_unique n_levels               min    max
1         TEAM        0        1      236       NA Abilene Christian   Yale
2    F4PERCENT        0        1      119       NA             0.00% 98.80%
3 CHAMPPERCENT        0        1       72       NA             0.00%  9.10%
overview(`public-picks`)
obs: 64 
cols: 9 

----- Numeric -----
     col n_missng p_complt n_unique    mean   p0     p25    p50     p75 p100
1   YEAR        0        1        1    2024 2024    2024   2024    2024 2024
2 TEAMNO        0        1       64 1045.39 1012 1028.75 1045.5 1062.25 1079
   iqr    sd  hist
1    0     0 ▁▁▇▁▁
2 33.5 19.71 ▇▇▇▇▇

----- Categorical -----
     col n_missng p_complt n_unique n_levels   min    max
1   TEAM        0        1       64       NA Akron   Yale
2    R64        0        1       64       NA 1.50% 98.41%
3    R32        0        1       62       NA 0.67% 93.59%
4    S16        0        1       60       NA 0.33% 80.22%
5     E8        0        1       54       NA 0.17%  8.37%
6     F4        0        1       42       NA 0.07%  9.99%
7 FINALS        0        1       37       NA 0.03%  9.27%
`public-picks` |>
  pivot_longer(R64:FINALS) |> 
  group_by(YEAR, TEAMNO) |> 
  mutate(stage = row_number()) %>%
  mutate(perc = as.numeric(str_remove(value, "%")),
         perc = perc / 100) %>%
  ggplot(aes(x = stage, y = perc)) + 
  geom_line(aes(group = paste(YEAR, TEAM)))

  # geom_col()
top_predicted_teams <- `public-picks` |>
  pivot_longer(R64:FINALS) |> 
  group_by(YEAR, TEAMNO) |> 
  mutate(stage = row_number()) %>%
  mutate(perc = as.numeric(str_remove(value, "%")),
         perc = perc / 100) %>%
  filter(stage == 6) %>%
  arrange(desc(perc)) %>%
  ungroup() %>%
  slice(1:5)

`team-results` %>%
  count(CHAMP)
# A tibble: 4 × 2
  CHAMP     n
  <dbl> <int>
1     0   227
2     1     4
3     2     4
4     3     1
`team-results` %>%
  count(CHAMP)
# A tibble: 4 × 2
  CHAMP     n
  <dbl> <int>
1     0   227
2     1     4
3     2     4
4     3     1
`team-results` %>%
  filter(F2 >= 2)
# A tibble: 9 × 20
  TEAMID TEAM    PAKE PAKERANK  PASE PASERANK GAMES     W     L WINPERCENT   R64
   <dbl> <chr>  <dbl>    <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl>      <dbl> <dbl>
1     24 Butler   7          4   8.7        4    26    17     9      0.654     9
2     40 Conne…   8.6        2  10.6        1    29    23     6      0.793     9
3     50 Duke     2         24   0         78    46    34    12      0.739    14
4     68 Gonza…   3.1       18   3.8       14    47    32    15      0.681    15
5     86 Kansas   4.2       13  -1.8      205    53    40    13      0.755    15
6     90 Kentu…   6.9        6   8.5        5    43    32    11      0.744    12
7    114 Michi…   7          4   7.6        6    35    24    11      0.686    11
8    135 North…  12          1   9.8        2    50    39    11      0.78     13
9    228 Villa…   4.8        9   4         11    40    29    11      0.725    13
# ℹ 9 more variables: R32 <dbl>, S16 <dbl>, E8 <dbl>, F4 <dbl>, F2 <dbl>,
#   CHAMP <dbl>, TOP2 <dbl>, F4PERCENT <chr>, CHAMPPERCENT <chr>
top_predicted_teams %>%
  inner_join(`team-results`, by = "TEAM")
# A tibble: 5 × 26
   YEAR TEAMNO TEAM         name  value stage   perc TEAMID  PAKE PAKERANK  PASE
  <dbl>  <dbl> <chr>        <chr> <chr> <int>  <dbl>  <dbl> <dbl>    <dbl> <dbl>
1  2024   1067 Connecticut  FINA… 34.9…     6 0.349      40   8.6        2  10.6
2  2024   1038 North Carol… FINA… 12.1…     6 0.121     135  12          1   9.8
3  2024   1033 Purdue       FINA… 10.2…     6 0.102     167  -4.4      232  -4.4
4  2024   1056 Houston      FINA… 9.27%     6 0.0927     76   0         76   2.1
5  2024   1053 Iowa St.     FINA… 4.78%     6 0.0478     83  -2        209  -1.5
# ℹ 15 more variables: PASERANK <dbl>, GAMES <dbl>, W <dbl>, L <dbl>,
#   WINPERCENT <dbl>, R64 <dbl>, R32 <dbl>, S16 <dbl>, E8 <dbl>, F4 <dbl>,
#   F2 <dbl>, CHAMP <dbl>, TOP2 <dbl>, F4PERCENT <chr>, CHAMPPERCENT <chr>

Shoutouts

Please check out Nic’s timeplyr package, which is now on CRAN and was masterfully presented at a previous EdinbR R users’ group meeting.