Intro

This is a blog post where we’ve covered a pair programming exercise completed as part of the EdinbR group.

This session [was]… led by Dr Brittany Blankinship and Dr Kasia Banas, two academics from the Data Driven Innovation for Health and Social Care Talent Team, based at the Usher Institute. Brittany and Kasia are avid R programmers and data science educators.

The link to the dataset and exercise description is here

There are two datasets. We are looking at one related to Madrid (and maybe why we shouldn’t go there as it’s polluted?!)

About the data

The data for this tutorial were collected under the instructions from Madrid’s City Council and are publicly available on their website. In recent years, high levels of pollution during certain dry periods has forced the authorities to take measures against the use of cars and act as a reasoning to propose certain regulations. These data include daily and hourly measurements of air quality from 2001 to 2008. Pollutants are categorized based on their chemical properties.

There are a number of stations set up around Madrid and each station’s data frame contains all particle measurements that such station has registered from 01/2001 - 04/2008. Not every station has the same equipment, therefore each station can measure only a certain subset of particles. The complete list of possible measurements and their explanations are given by the website:

SO_2: sulphur dioxide level measured in μg/m³. High levels can produce irritation in the skin and membranes, and worsen asthma or heart diseases in sensitive groups.
CO: carbon monoxide level measured in mg/m³. Carbon monoxide poisoning involves headaches, dizziness and confusion in short exposures and can result in loss of consciousness, arrhythmias, seizures or even death.
NO_2: nitrogen dioxide level measured in μg/m³. Long-term exposure is a cause of chronic lung diseases, and are harmful for the vegetation.
PM10: particles smaller than 10 μm. Even though they cannot penetrate the alveolus, they can still penetrate through the lungs and affect other organs. Long term exposure can result in lung cancer and cardiovascular complications.
NOx: nitrous oxides level measured in μg/m³. Affect the human respiratory system worsening asthma or other diseases, and are responsible of the yellowish-brown color of photochemical smog.
O_3: ozone level measured in μg/m³. High levels can produce asthma, bronchytis or other chronic pulmonary diseases in sensitive groups or outdoor workers.
TOL: toluene (methylbenzene) level measured in μg/m³. Long-term exposure to this substance (present in tobacco smoke as well) can result in kidney complications or permanent brain damage.
BEN: benzene level measured in μg/m³. Benzene is a eye and skin irritant, and long exposures may result in several types of cancer, leukaemia and anaemias. Benzene is considered a group 1 carcinogenic to humans.
EBE: ethylbenzene level measured in μg/m³. Long term exposure can cause hearing or kidney problems and the IARC has concluded that long-term exposure can produce cancer.
MXY: m-xylene level measured in μg/m³. Xylenes can affect not only air but also water and soil, and a long exposure to high levels of xylenes can result in diseases affecting the liver, kidney and nervous system.
PXY: p-xylene level measured in μg/m³. See MXY for xylene exposure effects on health.
OXY: o-xylene level measured in μg/m³. See MXY for xylene exposure effects on health.
TCH: total hydrocarbons level measured in mg/m³. This group of substances can be responsible of different blood, immune system, liver, spleen, kidneys or lung diseases.
NMHC: non-methane hydrocarbons (volatile organic compounds) level measured in mg/m³. Long exposure to some of these substances can result in damage to the liver, kidney, and central nervous system. Some of them are suspected to cause cancer in humans.

Tutorial goals

The goal of this tutorial is to see if pollutants are decreasing (is air quality improving) and also compare which pollutant has decreased the most over the span of 5 years (2001 - 2006).

High-level summary of tasks

First do a plot of one of the pollutants (EBE).
Next, group it by month and year; calculate the maximum value and plot it (to see the trend through time).
Now we will look at which pollutant decreased the most. Repeat the same thing for every column - to speed up the process, use the map() function. First we will look at pollution in 2001 (get the maximum value for each of the pollutants). And then do the same for 2006.

Packages

We will be using the tidyverse set of packages:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Exercises

Remember to switch the driver after each task.

Task 1

To begin with, load the madrid_pollution.csv data set into your R environment. Assign the data to an object called madrid.

data_url <- "https://raw.githubusercontent.com/bblankinship/EdinbRTalk-2024-02-23/main/madrid_pollution.csv"

dta <- read_tsv(data_url)

Rows: 51864 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr   (1): mnth
dbl  (15): BEN, CO, EBE, MXY, NMHC, NO_2, NOx, OXY, O_3, PM10, PXY, SO_2, TC...
dttm  (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dta

# A tibble: 51,864 × 17
   date                  BEN     CO   EBE   MXY   NMHC  NO_2   NOx   OXY   O_3
   <dttm>              <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
 1 2001-08-01 01:00:00 1.5   0.340  1.49   4.10 0.07    56.2  75.2 2.11  42.2 
 2 2001-08-01 02:00:00 0.870 0.0600 0.880  2.41 0.01    29.7  31.4 1.20  56.5 
 3 2001-08-01 03:00:00 0.660 0.02   0.610  1.60 0.01    22.8  22.5 0.800 64.1 
 4 2001-08-01 04:00:00 0.470 0.0400 0.410  1    0.02    31.6  34.8 0.470 60.8 
 5 2001-08-01 05:00:00 0.600 0.0400 0.670  1.68 0.01    30.9  32.5 0.740 65.6 
 6 2001-08-01 06:00:00 0.520 0.0900 0.460  1.27 0.01    66.7  78.0 0.590 41.7 
 7 2001-08-01 07:00:00 0.540 0.120  0.510  1.33 0.01    69.7  85.3 0.630 32.6 
 8 2001-08-01 08:00:00 0.910 0.430  0.730  1.91 0.0600  97.8 139.  0.970 17.0 
 9 2001-08-01 09:00:00 1.62  0.75   1.29   3.40 0.120  108   177   1.65   8.36
10 2001-08-01 10:00:00 2.85  1.65   2.60   6.89 0.210  110.  202.  3.27   8.38
# ℹ 51,854 more rows
# ℹ 7 more variables: PM10 <dbl>, PXY <dbl>, SO_2 <dbl>, TCH <dbl>, TOL <dbl>,
#   year <dbl>, mnth <chr>

First thing we realised, it’s not really a CSV.

Task 2

Now that the data is loaded in R, create a scatter plot that compares ethylbenzene (EBE) values against the date they were recorded. This graph will showcase the concentration of ethylbenzene in Madrid over time. As usual, label your axes:

x = Date
y = Ethylbenzene (μg/m³)

Assign your answer to an object called EBE_pollution.

EBE_pollution <- dta |>
  ggplot(aes(date, EBE)) +
  geom_point(alpha = 0.4) +
  scale_x_datetime() +
  scale_y_log10() +
  labs(x = "Date", y = "Ethylbenzene (μg/m³)")

What is your conclusion about the level of EBE over time?

Looks like it has a seasonal pattern.

Task 3

The question above asks you to write out code that allows visualization of all EBE recordings in the dataset - which are taken every single hour of every day. Consequently the graph consists of many points and appears densely plotted. In this question, we are going to clean up the graph and focus on max EBE readings from each month. Create a new data set with maximum EBE reading from each month in each year. Save your new data set as madrid_pollution.

madrid_pollution <-
    dta |>
        mutate(
            mnth = month(date, label = TRUE),
            yr   = year(date)
        ) |>
        group_by(
            yr, mnth
        ) |>
        summarise(
            max_ebe = max(EBE, na.rm = TRUE)
        ) |>
        ungroup()

`summarise()` has grouped output by 'yr'. You can override using the `.groups`
argument.

madrid_pollution

# A tibble: 72 × 3
      yr mnth  max_ebe
   <dbl> <ord>   <dbl>
 1  2001 Jan     11.7 
 2  2001 Feb     18.9 
 3  2001 Mar     15.6 
 4  2001 Apr     12.5 
 5  2001 May     14.7 
 6  2001 Jun     12.0 
 7  2001 Jul     73.0 
 8  2001 Aug      8.39
 9  2001 Sep     10.1 
10  2001 Oct     39.8 
# ℹ 62 more rows

Task 4

Plot the new maximum EBE values versus the month they were recorded, split into side-by-side plots for each year.

Assign your answer to an object called madrid_plot.

madrid_plot <- madrid_pollution |>
  ggplot(aes(mnth, max_ebe)) +
  geom_col() +
  scale_y_log10() +
  labs(x = "Month", y = "Maximum Ethylbenzene (μg/m³)") +
  facet_wrap(~ yr) + 
  coord_flip()

madrid_plot

Task 5

Now we want to see which of the pollutants has decreased the most in 2001.

Assign your answer to an object called pollution_2001.

dta |>
    filter(year(date) == 2001) |>
    select(-year, -mnth) |>
    pivot_longer(-date, names_to = "pollutant", values_to = "value") |>
    group_by(pollutant) |>
    summarise(
        max_val = max(value, na.rm = TRUE),
        min_val = min(value, na.rm = TRUE)
    ) |>
    mutate(diff = max_val - min_val) |> 
    arrange(desc(diff))

# A tibble: 14 × 4
   pollutant max_val min_val    diff
   <chr>       <dbl>   <dbl>   <dbl>
 1 NOx       1416      5.13  1411.  
 2 NO_2       271.     5.03   266.  
 3 PM10       266.     0.820  265.  
 4 TOL        243.     0.360  243.  
 5 O_3        173.     4.98   168.  
 6 SO_2       137.     0.190  137.  
 7 PXY        103      0.150  103.  
 8 OXY        103      0.190  103.  
 9 MXY         93.1    0.230   92.9 
10 EBE         77.3    0.140   77.1 
11 BEN         49.9    0.100   49.8 
12 CO          10.4    0.01    10.4 
13 TCH          4.77   0.760    4.01
14 NMHC         2.42   0        2.42

Thoughts:

This is an ambiguous question. We’ve chosen to answer it in a way that doesn’t make sense but is quick to solve!

Task 6

Now repeat what you did for Task 5, but filter for 2006 instead. Assign your answer to an object called pollution_2006.

pollution_2006 <- dta |>
    filter(year(date) == 2006) |>
    select(-year, -mnth) |>
    pivot_longer(-date, names_to = "pollutant", values_to = "value") |>
    group_by(pollutant) |>
    summarise(
        max_val = max(value, na.rm = TRUE),
        min_val = min(value, na.rm = TRUE)
    ) |>
    mutate(diff = max_val - min_val) |> 
    arrange(desc(diff))

pollution_2006

# A tibble: 14 × 4
   pollutant  max_val min_val     diff
   <chr>        <dbl>   <dbl>    <dbl>
 1 NOx       1274      6.68   1267.   
 2 NO_2       287.     5.45    282.   
 3 PM10       269.     0.660   268.   
 4 O_3        132      3.20    129.   
 5 TOL         64.8    0.160    64.7  
 6 SO_2        66.2    5.59     60.6  
 7 MXY         54.9    0.260    54.6  
 8 OXY         31.4    0.310    31.1  
 9 PXY         27.0    0.300    26.7  
10 EBE         20.0    0.210    19.8  
11 BEN         16.9    0.150    16.7  
12 CO           3.48   0.140     3.34 
13 TCH          2.84   1.08      1.76 
14 NMHC         0.970  0.0800    0.890

Task 7

Which pollutant decreased by the greatest magnitude between 2001 and 2006? Come up with a programmatic solution

We’ll interpret this as the average for the year for each pollutant in both years.

dta |> 
    filter(year %in% c(2001, 2006)) |>
    select(-date, -mnth) |>
    pivot_longer(-year, names_to = "pollutant", values_to = "value") |>
    group_by(year, pollutant) |>
    summarise(mean_value = mean(value, na.rm = TRUE)) |>
    ungroup() |> 
    group_by(pollutant) |>
    summarise(
        change = mean_value[year == 2006] - mean_value[year == 2001]
    ) |>
    ungroup() |>
    arrange(change)

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

# A tibble: 14 × 2
   pollutant   change
   <chr>        <dbl>
 1 NOx       -52.4   
 2 NO_2      -11.2   
 3 TOL        -9.53  
 4 SO_2       -4.18  
 5 MXY        -3.70  
 6 BEN        -1.76  
 7 PXY        -1.63  
 8 EBE        -1.59  
 9 OXY        -1.55  
10 CO         -0.445 
11 TCH        -0.103 
12 NMHC        0.0363
13 O_3         1.66  
14 PM10        3.14

NOTE: This worksheet has been adapted from Data Science: First Introduction Worksheets, by Tiffany Timbers, Trevor Campbell and Melissa Lee, available at https://worksheets.datasciencebook.ca/