Jon Minton’s Blog - Tidy Tuesday: 20 Feb 2024

Introduction

This TidyTuesday session investigated the funding of intrastructure steering committee grants from the R consortium over time, and was led by Kennedy Owuso-Afriyie.

Data loading

We looked at two options for loading the dataset: one using the tidytuesdayR package; the other linking to the url directly.

Code

# Option 1: tidytuesdayR package 
## install.packages("tidytuesdayR")
 
library(tidyverse)
library(tidytuesdayR)
 
 
# tuesdata <- tidytuesdayR::tt_load('2024-02-20')
# ## OR
# tuesdata <- tidytuesdayR::tt_load(2024, week = 8)
 
# isc_grants <- tuesdata$isc_grants
 
# Option 2: Read directly from GitHub
 
isc_grants <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-02-20/isc_grants.csv')
 
isc_grants

# A tibble: 85 × 7
    year group title                          funded proposed_by summary website
   <dbl> <dbl> <chr>                           <dbl> <chr>       <chr>   <chr>  
 1  2023     1 The future of DBI (extension …  10000 "Kirill Mü… "This … <NA>   
 2  2023     1 Secure TLS Communications for…  10000 "Charlie G… "The p… <NA>   
 3  2023     1 volcalc: Calculate predicted …  12265 "Kristina … "This … <NA>   
 4  2023     1 autotest: Automated testing o…   3000 "Mark Padg… "The p… <NA>   
 5  2023     1 api2r: An R Package for Auto-…  15750 "Jon Harmo… "This … <NA>   
 6  2022     2 D3po: R Package for Easy Inte…   8000 "Mauricio … "The D… <NA>   
 7  2022     2 Tooling and Guidance for Tran…   8000 "Maëlle Sa… "Tooli… <NA>   
 8  2022     2 Online Submission and Review …  22000 "Simon Urb… "The O… <NA>   
 9  2022     2 Upgrading SatRdays Website Te…   6000 "Ben Ubah"  "The U… <NA>   
10  2022     2 Building the “Spatial Data Sc…  25000 "Orhun Ayd… "The B… <NA>   
# ℹ 75 more rows

Some questions we initially thought about asking:

Are there any keywords that stand out in the titles or summaries of awarded grants?
Have the funded amounts changed over time?

As a fairly new user to R, Kennedy focused on the second question, creating a bar plot of funding over time using ggplot. Meanwhile, Clarke and Clark investigated and proposed some approaches for addressing the first question.

Graph of funding over time

Code

funding_by_year <- isc_grants %>% 
  group_by(year) %>% 
  summarise(total_funded = sum(funded)) %>% 
  ungroup()
 
funding_by_year %>% 
  ggplot(aes(x=year, y=total_funded)) + 
  geom_col() + 
  labs(
    x = "Year", 
    y = "total funded in dollars",
    title = "Total funding by year",
    caption = "source: TidyTuesday",
    subtitle = "2018 is a bit weird" 
  )

We discussed piping with the %>% operator, and the value this has for being able to develop code step-by-step in a way similar to human languages.

We said, when we see <- or ->, this should be read as ‘is assigned to’.
And we said, when we see the %>% (or |>) operator in a script, this should be read as, and then.
We noted how R can tell when it encounters an incomplete expression, and so doesn’t evaluate, just as when someone hears a sentence that ends ‘and then’, they know it’s not really the end of the sentence.

We also discussed how when making a graph, we should consider how objective or how subjective we should be when presenting the image to the viewer. This will depend on the audience. In our example, the x axis, y axis, title and caption labels are all just objective information. However the subtitle is more subjective, and so more our opinion rather than something no one could reasonably disagree with.

Tidy Text to get important key words

Brendan offered the following code chunk to explore the content of the free text summary field in the dataset:

Code

#install.packages("tidytext")
#install.packages("SnowballC")
library(tidytext)
library(SnowballC) # for wordStem
 
isc_grants |>
  unnest_tokens(word, summary) |>
  anti_join(get_stopwords()) |>
  mutate(stem = wordStem(word))

# A tibble: 6,242 × 8
    year group title                      funded proposed_by website word  stem 
   <dbl> <dbl> <chr>                       <dbl> <chr>       <chr>   <chr> <chr>
 1  2023     1 The future of DBI (extens…  10000 Kirill Mül… <NA>    prop… prop…
 2  2023     1 The future of DBI (extens…  10000 Kirill Mül… <NA>    most… most…
 3  2023     1 The future of DBI (extens…  10000 Kirill Mül… <NA>    focu… focus
 4  2023     1 The future of DBI (extens…  10000 Kirill Mül… <NA>    main… main…
 5  2023     1 The future of DBI (extens…  10000 Kirill Mül… <NA>    supp… supp…
 6  2023     1 The future of DBI (extens…  10000 Kirill Mül… <NA>    dbi   dbi  
 7  2023     1 The future of DBI (extens…  10000 Kirill Mül… <NA>    dbit… dbit…
 8  2023     1 The future of DBI (extens…  10000 Kirill Mül… <NA>    test  test 
 9  2023     1 The future of DBI (extens…  10000 Kirill Mül… <NA>    suite suit 
10  2023     1 The future of DBI (extens…  10000 Kirill Mül… <NA>    three three
# ℹ 6,232 more rows

This pulled out words (other than stopwords¹) from the summary field, and identified the stem of these words. This potentially means the number of unique stems can be compared, rather than the number of unique words.

Antony suggested that, as the summaries are all about supporting a technical programing language, some additional words are also so common they should also be considered stopwords. He also produced a wordcloud visualisation showing the most common non-stopwords in the corpus of summary text”

Code

# tidytext with SnowballC ----

# Tokenize the text


my_stop_words <- 
  bind_rows(
    get_stopwords(),
    tibble(
      word = c("r","package","data","users","project","cran","community","use",
               "development","documentation","can","also","system","new","code",
               "available","existing","support","make","two","build"),
      lexicon = "tony's custom stop words"
    )
  )

my_stop_words <- 
  my_stop_words %>% 
  mutate(stem = wordStem(word))

tokens <- 
  isc_grants %>%
  unnest_tokens(word, summary) %>% 
  mutate(stem = wordStem(word)) %>% 
  anti_join(my_stop_words, by = "stem")


token_frequency <- tokens %>% count(word) %>% arrange(-n)



# View the processed stems
wordcloud::wordcloud(words = token_frequency$word, 
          freq = token_frequency$n, min.freq = 1,
          max.words = 20, random.order = FALSE, rot.per = 0.35, 
          colors = RColorBrewer::brewer.pal(8, "Dark2"))

Footnotes

Stop words are terms that are so common within sentences they don’t really add much unique information. They’re words like ‘and’, ‘the’, ‘an’, and so on.↩︎