Tidy Tuesday: Self-generated data challenge

R
tidy tuesday
data wrangling
Authors

Jon Minton

Kate Pyper

Kennedy Owusu-Afriyie

Brendan Clarke

Imran Chowdhury

Andrew Saul

Published

May 29, 2024

Introduction

For this session we chose to focus on a data wrangling challenge that Andrew brought to the table, rather than the most recent dataset. The challenge involved converting slightly messy data structure from the first format seen below, to the second (tidy) structure:

library(tidyverse)

df <-   tribble(~`...1`, ~`...2`, ~`...3`, ~`...4`, ~`...5`, ~`...6`, ~`...7`, ~`...8`, ~`...9`,
          NA,NA, NA, "House Stats in Nov", NA, "Flat Stats in Nov", NA, "All Stats in Nov", NA,
          "Region", "LA", "LACode", "Count", "Sold", "Count", "Sold", "Count", "Sold",
          "Scotland",  "Minyip", "M394932", "1000", "900", "600", "300", "1600", "1200")

df
# A tibble: 3 × 9
  ...1     ...2   ...3    ...4               ...5  ...6        ...7  ...8  ...9 
  <chr>    <chr>  <chr>   <chr>              <chr> <chr>       <chr> <chr> <chr>
1 <NA>     <NA>   <NA>    House Stats in Nov <NA>  Flat Stats… <NA>  All … <NA> 
2 Region   LA     LACode  Count              Sold  Count       Sold  Count Sold 
3 Scotland Minyip M394932 1000               900   600         300   1600  1200 
# desired output
tribble(~"region",  ~"la", ~"la_code", ~"house_count", ~"house_sold", ~"flat_count", ~"flat_sold", ~"all_count", ~"all_sold",
        "Scotland",  "Minyip", "M394932", "1000", "900", "600", "300", "1600", "1200") 
# A tibble: 1 × 9
  region   la     la_code house_count house_sold flat_count flat_sold all_count
  <chr>    <chr>  <chr>   <chr>       <chr>      <chr>      <chr>     <chr>    
1 Scotland Minyip M394932 1000        900        600        300       1600     
# ℹ 1 more variable: all_sold <chr>

Andrew had a prepared solution. But Kate led the session by developing another solution from scratch

Katie’s solution

library(tidyverse)
library(janitor)
library(fillr)
 
df <- 
  tribble(~`...1`, ~`...2`, ~`...3`, ~`...4`, ~`...5`, ~`...6`, ~`...7`, ~`...8`, ~`...9`,
          NA,NA, NA, "House Stats in Nov", NA, "Flat Stats in Nov", NA, "All Stats in Nov", NA,
          "Region", "LA", "LACode", "Count", "Sold", "Count", "Sold", "Count", "Sold",
          "Scotland",  "Minyip", "M394932", "1000", "900", "600", "300", "1600", "1200")
 
nm1 <- unlist(df[1,])
nm2 <- unlist(df[2,])
nm1
                ...1                 ...2                 ...3 
                  NA                   NA                   NA 
                ...4                 ...5                 ...6 
"House Stats in Nov"                   NA  "Flat Stats in Nov" 
                ...7                 ...8                 ...9 
                  NA   "All Stats in Nov"                   NA 
nm2
    ...1     ...2     ...3     ...4     ...5     ...6     ...7     ...8 
"Region"     "LA" "LACode"  "Count"   "Sold"  "Count"   "Sold"  "Count" 
    ...9 
  "Sold" 
nm1 <- str_extract(nm1, "\\w*(?=\\s)")
nm1 <- fill_missing_previous(nm1)
nm1[is.na(nm1)] <- ""
nm1
[1] ""      ""      ""      "House" "House" "Flat"  "Flat"  "All"   "All"  
nms <- paste(nm1, nm2)
 
names(df) <- nms
 
df <- clean_names(df[-(1:2),])
df
# A tibble: 1 × 9
  region   la     la_code house_count house_sold flat_count flat_sold all_count
  <chr>    <chr>  <chr>   <chr>       <chr>      <chr>      <chr>     <chr>    
1 Scotland Minyip M394932 1000        900        600        300       1600     
# ℹ 1 more variable: all_sold <chr>

This solution involved a mixture of tidyverse and base R expressions. Brendan made the point that tidyverse is great at doing 80% of the work needed quickly, but sometimes base R is needed to complete the remaining 20%.

Other solutions

Andrew proposed the following function for doing the data cleaning.

merge_rows <- function(df){ # eg. lst[[1]]

  # convert rows1&2 into columns
  row1 <- df[1, ] |> t() 
  row2 <- df[2, ] |> t() 
  # remove selected text then fill down NAs
  row1 <- str_remove(row1, " Stats in Nov") |> 
    as_tibble() |> 
    fill(value) |> 
    pull() |> replace_na("")

  row3 <- str_c(row1, " ", row2) |> 
    str_trim()
  # create same header vector as original df
  header_vec <- character()
  for (i in seq_along(df)){
    header_vec[i] <- str_c("...", i)
  }
  # create tibble with header and 1st row of df
  tib <-
    t(row3) |> 
    as_tibble() |> 
    set_names(header_vec) 
  return(tib)
}
 
create_header_df <- function(df) {
  merge_rows(df) |> 
    bind_rows(df) |> 
    janitor::row_to_names(1) |> 
    clean_names()
}
 
create_header_df(df) |> 
  slice(-c(1:2))
# A tibble: 0 × 18
# ℹ 18 variables: na <chr>, na_2 <chr>, na_3 <chr>, na_4 <chr>, na_5 <chr>,
#   na_6 <chr>, na_7 <chr>, na_8 <chr>, na_9 <chr>, na_10 <chr>, na_11 <chr>,
#   na_12 <chr>, na_13 <chr>, na_14 <chr>, na_15 <chr>, na_16 <chr>,
#   na_17 <chr>, na_18 <chr>

And Brendan suggested the following tidyverse solution:

nice_names <- tibble(one = unlist(df[1,]), 
       two = unlist(df[2,])) |>
  mutate(one = str_replace_all(one,  "Stats in Nov", "")) |>
  fill(one) |>
  mutate(three = str_replace_all(paste(one, two), "NA ", "")) |>
  pull(three)

names(df) <- nice_names
df |>
  janitor::clean_names()
# A tibble: 1 × 9
  scotland_na minyip_na m394932_na x1000_na x900_na x600_na x300_na x1600_na
  <chr>       <chr>     <chr>      <chr>    <chr>   <chr>   <chr>   <chr>   
1 Scotland    Minyip    M394932    1000     900     600     300     1600    
# ℹ 1 more variable: x1200_na <chr>

Discussion

After Kate solved Andrew’s problem with lots of time to spare, she then ran through a real-world example that makes use of some similar approaches. The use of unlist() to extract vectors to work on was a big part of both solutions.

We all agreed Kate was excellent with these kinds of data wrangling challenges, despite coding with no prior warning of the challenge and wet hair.