The latest TidyTuesday dataset is on Groundhog Days, a North American tradition in which the behaviours of specific groundhogs are used to make predictions about the weather over the next six weeks, as immortalised in the eponymous sci-fi rom-com featuring Bill Murray.
Oddly, the data provided does not include meteorological information on whether the groundhogs’ predictions are accurate. (Who knows? Maybe they are!) But the data do include latitude, longitude, and other geographic information. So, we decided to see if we could plot these Groundhog Day events on an interactive map.
First we load the data, using the bespoke tidytuesdayR package:
Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
# The number of queries made via tidyTuesdayR is limited. The commented code below shows how to extract the TidyTuesday data using the tidytuesdayR package. Instead I'll link directly:# dat <- tidytuesdayR::tt_load('2024-01-30')
Code
# predictions <- dat |># pluck(1) # that's the predictions
Code
# dat |># pluck(2) # that's the groundhogs
Code
# groundhogs <- dat |># pluck(2) # Direct approach groundhogs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-01-30/groundhogs.csv')
Rows: 75 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): slug, shortname, name, city, region, country, source, current_pred...
dbl (4): id, latitude, longitude, predictions_count
lgl (2): is_groundhog, active
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1462 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): details
dbl (2): id, year
lgl (1): shadow
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We now have the two datasets in separate objects. Let’s look at some of the information in the description field
# A tibble: 43 × 1
description
<chr>
1 Octoraro Orphie, from Quarryville Pennsylvania, is a taxidermied world-renow…
2 Concord Charlie is a presumed groundhog from Athens, West Virginia. In a tra…
3 Lander Lil, a bronze statue of a prairie dog, has been predicting the future…
4 Groundhog puppet Manitoba Merv has been predicting the arrival of spring at …
5 Schnogadahl Sammi is a literally unpronounceable taxidermied groundhog mount…
6 Poor Richard is a taxidermied groundhog who prognosticates for the Slumberin…
7 Balzac Billy is the “Prairie Prognosticator”, a man-sized gopher mascot who …
8 Every year on February 2nd, Myerstown’s favorite groundhog “Uni” is placed o…
9 Grover the Groundhog and Sweet Arrow Sue are a taxidermied groundhog couple …
10 Stormy Marmot is a plush yellow-bellied marmot from Aurora, Colorado. He is …
# ℹ 33 more rows
Code
predictions |>count(shadow, sort=T)
# A tibble: 3 × 2
shadow n
<lgl> <int>
1 TRUE 665
2 FALSE 652
3 NA 145
Code
# dat |># pluck(1) |># left_join(groundhogs) predictions |>left_join(groundhogs)
Joining with `by = join_by(id)`
# A tibble: 1,462 × 20
id year shadow details slug shortname name city region country
<dbl> <dbl> <lgl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1886 NA Groundhog Day … punx… Phil Punx… Punx… Penns… USA
2 1 1887 TRUE First Official… punx… Phil Punx… Punx… Penns… USA
3 1 1888 TRUE Saw Shadow. punx… Phil Punx… Punx… Penns… USA
4 1 1889 NA No Record. punx… Phil Punx… Punx… Penns… USA
5 1 1890 FALSE No Shadow. punx… Phil Punx… Punx… Penns… USA
6 1 1891 NA No Record. punx… Phil Punx… Punx… Penns… USA
7 1 1892 NA No Record. punx… Phil Punx… Punx… Penns… USA
8 1 1893 NA No Record. punx… Phil Punx… Punx… Penns… USA
9 1 1894 NA No Record. punx… Phil Punx… Punx… Penns… USA
10 1 1895 NA No Record. punx… Phil Punx… Punx… Penns… USA
# ℹ 1,452 more rows
# ℹ 10 more variables: latitude <dbl>, longitude <dbl>, source <chr>,
# current_prediction <chr>, is_groundhog <lgl>, type <chr>, active <lgl>,
# description <chr>, image <chr>, predictions_count <dbl>
Mapping
We decided to try using Leaflet to plot where the Groundhog day events occurred.
Code
#install.packages('leaflet')library(leaflet)
We start by following one of the first examples in the Leaflet intro page above, adding markers to geolocate the sightings/events, and a popup with the name (assigned to) the Groundhog:
We next wanted to colour these markers based on whether the predictions are classed as ‘active’ or not. This was slightly more tricky, but the example given in the Awesome Icons section of the markers part of the Leaflet documentation gave the following code pattern we could work with:
So, in the above addAwesomeMarkers() is used in place of addMarkers(), and takes an argument icon. A vector of icons is created of the same length as the number of rows of the dataframe, whose colour is determined through the getColor function.
In our case we are interested in the active column, which has just two mutually exclusive categories: TRUE and FALSE. So we just need two colours to be specified
Code
# We saw if we could implement the code pattern above using list columns, but were not successful# colouring markers# groundhogs_icons <- groundhogs |># mutate(markerColor = ifelse(active, "green", "red")) |># rowwise() |># mutate(icon = list(awesomeIcons(# icon = 'ios-close',# iconColor = 'black',# library = 'ion',# markerColor = markerColor# )))getColor <-function(groundhogs) {sapply(groundhogs$active, function(active) {if(active) {"green" } else {"red" } })}# create vector of matching vectorsicons <-awesomeIcons(icon ='ios-close',iconColor ='black',library ='ion',markerColor =getColor(groundhogs))# now with active/inactive iconsgroundhogs |>select(lat = latitude, lng = longitude, popup = name) |>leaflet() |>addTiles()|>addAwesomeMarkers(~lng, ~lat, popup =~popup, icon = icons)
Challenge complete! As we would expect, most predictions are not currently active.
Other possibilities
Some other things we could have explored include:
Attempting to link to appropriate meteorological data to see if the predictions came true at more than chance rates (likely a challenge)
Natural Language Programming to identify patterns and key terms in the free text fields like description
Additional customisation of the leaflet maps, such as including additional popup fields, further customising the icons based on multiple variables, and adding date sliders to give a third dimension (latitude, longitude, and date) to the user display
Additional
Andrew presented the following code solution for how to use plotly to produce multiple traces based on summary stats:
Code
library(plotly)df <- mpg %>%summarise(avg_city =mean(cty), .by =c(manufacturer, year))manfs <- df %>%distinct(manufacturer) %>%pull()p <-plot_ly()for(manf in manfs){ df_manf <- df %>%filter(manufacturer == manf) p <-add_trace(p,mode ="lines+markers",x =~year,y =~avg_city,data = df_manf) # must include new df as data for plolty layer}p