Jon Minton’s Blog - Tidy Tuesday 30 Jan 2024: Groundhogs

The latest TidyTuesday dataset is on Groundhog Days, a North American tradition in which the behaviours of specific groundhogs are used to make predictions about the weather over the next six weeks, as immortalised in the eponymous sci-fi rom-com featuring Bill Murray.

Oddly, the data provided does not include meteorological information on whether the groundhogs’ predictions are accurate. (Who knows? Maybe they are!) But the data do include latitude, longitude, and other geographic information. So, we decided to see if we could plot these Groundhog Day events on an interactive map.

First we load the data, using the bespoke tidytuesdayR package:

Code

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

# The number of queries made via tidyTuesdayR is limited. The commented code below shows how to extract the TidyTuesday data using the tidytuesdayR package. Instead I'll link directly:
# dat <- tidytuesdayR::tt_load('2024-01-30')

Code

# predictions <- dat |>
#   pluck(1) # that's the predictions

Code

# dat |>
#   pluck(2) # that's the groundhogs

Code

# groundhogs <- dat |>
#   pluck(2) 

# Direct approach 

groundhogs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-01-30/groundhogs.csv')

Rows: 75 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): slug, shortname, name, city, region, country, source, current_pred...
dbl  (4): id, latitude, longitude, predictions_count
lgl  (2): is_groundhog, active

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

predictions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-01-30/predictions.csv')

Rows: 1462 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): details
dbl (2): id, year
lgl (1): shadow

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We now have the two datasets in separate objects. Let’s look at some of the information in the description field

Code

groundhogs |>
  filter(!is_groundhog) |>
  select(description)

# A tibble: 43 × 1
   description                                                                  
   <chr>                                                                        
 1 Octoraro Orphie, from Quarryville Pennsylvania, is a taxidermied world-renow…
 2 Concord Charlie is a presumed groundhog from Athens, West Virginia. In a tra…
 3 Lander Lil, a bronze statue of a prairie dog, has been predicting the future…
 4 Groundhog puppet Manitoba Merv has been predicting the arrival of spring at …
 5 Schnogadahl Sammi is a literally unpronounceable taxidermied groundhog mount…
 6 Poor Richard is a taxidermied groundhog who prognosticates for the Slumberin…
 7 Balzac Billy is the “Prairie Prognosticator”, a man-sized gopher mascot who …
 8 Every year on February 2nd, Myerstown’s favorite groundhog “Uni” is placed o…
 9 Grover the Groundhog and Sweet Arrow Sue are a taxidermied groundhog couple …
10 Stormy Marmot is a plush yellow-bellied marmot from Aurora, Colorado. He is …
# ℹ 33 more rows

Code

predictions |>
  count(shadow, sort=T)

# A tibble: 3 × 2
  shadow     n
  <lgl>  <int>
1 TRUE     665
2 FALSE    652
3 NA       145

Code

# dat |>
#   pluck(1) |>
#   left_join(groundhogs) 
predictions |> 
    left_join(groundhogs)

Joining with `by = join_by(id)`

# A tibble: 1,462 × 20
      id  year shadow details         slug  shortname name  city  region country
   <dbl> <dbl> <lgl>  <chr>           <chr> <chr>     <chr> <chr> <chr>  <chr>  
 1     1  1886 NA     Groundhog Day … punx… Phil      Punx… Punx… Penns… USA    
 2     1  1887 TRUE   First Official… punx… Phil      Punx… Punx… Penns… USA    
 3     1  1888 TRUE   Saw Shadow.     punx… Phil      Punx… Punx… Penns… USA    
 4     1  1889 NA     No Record.      punx… Phil      Punx… Punx… Penns… USA    
 5     1  1890 FALSE  No Shadow.      punx… Phil      Punx… Punx… Penns… USA    
 6     1  1891 NA     No Record.      punx… Phil      Punx… Punx… Penns… USA    
 7     1  1892 NA     No Record.      punx… Phil      Punx… Punx… Penns… USA    
 8     1  1893 NA     No Record.      punx… Phil      Punx… Punx… Penns… USA    
 9     1  1894 NA     No Record.      punx… Phil      Punx… Punx… Penns… USA    
10     1  1895 NA     No Record.      punx… Phil      Punx… Punx… Penns… USA    
# ℹ 1,452 more rows
# ℹ 10 more variables: latitude <dbl>, longitude <dbl>, source <chr>,
#   current_prediction <chr>, is_groundhog <lgl>, type <chr>, active <lgl>,
#   description <chr>, image <chr>, predictions_count <dbl>

Mapping

We decided to try using Leaflet to plot where the Groundhog day events occurred.

Code

#install.packages('leaflet')
library(leaflet)

We start by following one of the first examples in the Leaflet intro page above, adding markers to geolocate the sightings/events, and a popup with the name (assigned to) the Groundhog:

Code

groundhogs |>
  select(lat = latitude, lng = longitude, popup = name) |>
  leaflet() |>
  addTiles()|>
  addMarkers(~lng, ~lat, popup = ~popup)

We next wanted to colour these markers based on whether the predictions are classed as ‘active’ or not. This was slightly more tricky, but the example given in the Awesome Icons section of the markers part of the Leaflet documentation gave the following code pattern we could work with:

```{r}
# first 20 quakes
df.20 <- quakes[1:20,]

getColor <- function(quakes) {
  sapply(quakes$mag, function(mag) {
  if(mag <= 4) {
    "green"
  } else if(mag <= 5) {
    "orange"
  } else {
    "red"
  } })
}

icons <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = getColor(df.20)
)

leaflet(df.20) %>% addTiles() %>%
  addAwesomeMarkers(~long, ~lat, icon=icons, label=~as.character(mag))

```

So, in the above addAwesomeMarkers() is used in place of addMarkers(), and takes an argument icon. A vector of icons is created of the same length as the number of rows of the dataframe, whose colour is determined through the getColor function.

In our case we are interested in the active column, which has just two mutually exclusive categories: TRUE and FALSE. So we just need two colours to be specified

Code

# We saw if we could implement the code pattern above using list columns, but were not successful

# colouring markers
# groundhogs_icons <- groundhogs |>
#   mutate(markerColor = ifelse(active, "green", "red")) |>
#   rowwise() |>
#   mutate(icon = list(awesomeIcons(
#     icon = 'ios-close',
#     iconColor = 'black',
#     library = 'ion',
#     markerColor = markerColor
#   )))

getColor <- function(groundhogs) {
  sapply(groundhogs$active, function(active) {
    if(active) {
      "green"  } else {
          "red"  } })
}

# create vector of matching vectors
icons <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = getColor(groundhogs)
)

# now with active/inactive icons

groundhogs |>
  select(lat = latitude, lng = longitude, popup = name) |>
  leaflet() |>
  addTiles()|>
  addAwesomeMarkers(~lng, ~lat, popup = ~popup, icon = icons)

Challenge complete! As we would expect, most predictions are not currently active.

Other possibilities

Some other things we could have explored include:

Attempting to link to appropriate meteorological data to see if the predictions came true at more than chance rates (likely a challenge)
Natural Language Programming to identify patterns and key terms in the free text fields like description
Additional customisation of the leaflet maps, such as including additional popup fields, further customising the icons based on multiple variables, and adding date sliders to give a third dimension (latitude, longitude, and date) to the user display

Additional

Andrew presented the following code solution for how to use plotly to produce multiple traces based on summary stats:

Code

library(plotly)

df <- 
  mpg %>% 
  summarise(avg_city = mean(cty), .by = c(manufacturer, year))

manfs <- df %>% distinct(manufacturer) %>% pull()

p <- plot_ly()

for(manf in manfs){
  df_manf <- df %>% 
    filter(manufacturer == manf)
  p <- add_trace(p,
                 mode = "lines+markers",
                 x = ~year,
                 y = ~avg_city,
                 data = df_manf) # must include new df as data for plolty layer
}

p