Tidy Tuesday Trash

Into the pythonverse!

Authors

Gatz Osario

Jon Minton

Antony Clark

Brendan Clarke

Kennedy Owusu-Afriyie

Kate Pyper

Andrew Saul

Myriam Scansetti

Published

March 6, 2024

Introduction

The latest TidyTuesday dataset is on trash collected as part of the Mr Trash Wheel Baltimore Healthy Harbor Initiative.

This session was led by Gatz Osario, and different to previous TardyTuesday sessions in that both Gatz and Jon looked at the dataset and prepared some materials ahead of the session.

Gatz provided an expert introduction to using Python for data science and data visualisation, using the Plotly libraries for interactive visualisation. Gatz used Google Colab for the session itself, which allows jupyter notebooks to be created and run online. In this post the same python chunks are run within Quarto.

Gatz used a subset of the data containing 2 factor regression scores Jon generated in R. The R code for generating this derived dataset is shown below but was not presented at the (already packed) session.

Factor Analysis in R

Jon started by loading tidyverse and the most recent dataset

library(tidyverse)

df <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-03-05/trashwheel.csv')

glimpse(df)

The dataset contains the numbers of different types of item extracted each time the barge went to collect trash. These items aren’t in comparable units (i.e. they weren’t by weight or volume, which could be compared).

Jon looked at one and two factor solutions to see if there are relationships between the types of items that tend to be collected together. First one factor

f_1 <- factanal(~ PlasticBottles + Polystyrene + CigaretteButts + GlassBottles + PlasticBags + Wrappers + SportsBalls, data = df, factors = 1)

f_1

A single factor has a loading of 3.2, meaning (roughly) that it contains about three variables’ worth of informaiton.

Polystyrene, Plastic Bags and Wrappers all had strong factor loadings. The most unique item (i.e. the one least well captured by the factor) was SportsBalls.

Now two factor solution:

f_2 <- factanal(~ PlasticBottles + Polystyrene + CigaretteButts + GlassBottles + PlasticBags + Wrappers + SportsBalls, data = df, factors = 2, scores = "regression")

f_2

The first factor has a strong loading on Polysytrene, Plastic bags, and Wrappers. The secton factor has a strong loading for glass bottles and cigarette butts. (So, smoking and drinking related trash?)

The argument scores = "regression" was added to allow the scores of each factor to be returned and attached to all rows in the original dataframe where it could be calculated.

df2 <- df %>%
    filter(complete.cases(.)) |>
    mutate(
        plastic_dumping_score = f_2$scores[,1],
        drinking_smoking_score = f_2$scores[,2]
    )

The following shows how the contents returned by the trash barge varied in terms of these two factor scores by year

df2 |>
    mutate(density = Weight / Volume) |>
    ggplot(aes(x = plastic_dumping_score, y = drinking_smoking_score)) + 
    geom_point(aes(alpha = density)) + 
    facet_wrap(~ Year) +
    geom_vline(xintercept = 0) + 
    geom_hline(yintercept = 0)

Originally there seemed to be more variation in the types of item returned by the barge, and more glass bottles and cigarettes. Over the first few years the amount of plastic waste returned seemed to increase, but declined afer peaking in 2017.

To make it easier for python to read the file with factor scores we generated, I (Jon) will save it as a csv file

write_csv(df2,  here::here("posts", "tardy-tuesday", "tidy-tuesday-trash", "df_with_factor_scores.csv"))

Data manipulation and visualisation in Python

First Gatz imported the relevant libraries

import pandas as pd
import datetime
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
pd.options.display.max_colwidth = 300

Just checking python works in the quarto document:

1 + 1
2

Load the data

df = pd.read_csv("df_with_factor_scores.csv")
df['Date'] = pd.to_datetime(df['Date'], format = "%m/%d/%Y", errors = 'coerce')
print(df.shape)
df.head(3)
(629, 18)
ID Name Dumpster Month Year Date Weight Volume PlasticBottles Polystyrene CigaretteButts GlassBottles PlasticBags Wrappers SportsBalls HomesPowered plastic_dumping_score drinking_smoking_score
0 mister Mister Trash Wheel 1 May 2014 2014-05-16 4.31 18 1450 1820 126000 72 584 1162 7 0 -1.261913 3.589004
1 mister Mister Trash Wheel 2 May 2014 2014-05-16 2.74 13 1120 1030 91000 42 496 874 5 0 -0.927719 1.558504
2 mister Mister Trash Wheel 3 May 2014 2014-05-16 3.45 15 2450 3100 105000 50 1080 2032 6 0 -0.022253 1.875345

More information

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 629 entries, 0 to 628
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   ID                      629 non-null    object        
 1   Name                    629 non-null    object        
 2   Dumpster                629 non-null    int64         
 3   Month                   629 non-null    object        
 4   Year                    629 non-null    int64         
 5   Date                    592 non-null    datetime64[ns]
 6   Weight                  629 non-null    float64       
 7   Volume                  629 non-null    int64         
 8   PlasticBottles          629 non-null    int64         
 9   Polystyrene             629 non-null    int64         
 10  CigaretteButts          629 non-null    int64         
 11  GlassBottles            629 non-null    int64         
 12  PlasticBags             629 non-null    int64         
 13  Wrappers                629 non-null    int64         
 14  SportsBalls             629 non-null    int64         
 15  HomesPowered            629 non-null    int64         
 16  plastic_dumping_score   629 non-null    float64       
 17  drinking_smoking_score  629 non-null    float64       
dtypes: datetime64[ns](1), float64(3), int64(11), object(3)
memory usage: 88.6+ KB

Convert year to integer

df['Year'] = df['Year'].astype(int)

Check no missing observations

df.isna().sum()
ID                         0
Name                       0
Dumpster                   0
Month                      0
Year                       0
Date                      37
Weight                     0
Volume                     0
PlasticBottles             0
Polystyrene                0
CigaretteButts             0
GlassBottles               0
PlasticBags                0
Wrappers                   0
SportsBalls                0
HomesPowered               0
plastic_dumping_score      0
drinking_smoking_score     0
dtype: int64

Drop NAs (there aren’t any)

df = df.dropna()
print(df.shape)
df.isna().sum()
(592, 18)
ID                        0
Name                      0
Dumpster                  0
Month                     0
Year                      0
Date                      0
Weight                    0
Volume                    0
PlasticBottles            0
Polystyrene               0
CigaretteButts            0
GlassBottles              0
PlasticBags               0
Wrappers                  0
SportsBalls               0
HomesPowered              0
plastic_dumping_score     0
drinking_smoking_score    0
dtype: int64

Sort by date

df.sort_values(by=['Date'], inplace=True)

Visualisation

Produce list of theme options and select the third

options = ["plotly", "plotly_white", "plotly_dark", "ggplot2", "seaborn", "simple_white"]
template = options[2]

Look at the new query syntax

df.query("Year < 2017")
ID Name Dumpster Month Year Date Weight Volume PlasticBottles Polystyrene CigaretteButts GlassBottles PlasticBags Wrappers SportsBalls HomesPowered plastic_dumping_score drinking_smoking_score
0 mister Mister Trash Wheel 1 May 2014 2014-05-16 4.31 18 1450 1820 126000 72 584 1162 7 0 -1.261913 3.589004
1 mister Mister Trash Wheel 2 May 2014 2014-05-16 2.74 13 1120 1030 91000 42 496 874 5 0 -0.927719 1.558504
2 mister Mister Trash Wheel 3 May 2014 2014-05-16 3.45 15 2450 3100 105000 50 1080 2032 6 0 -0.022253 1.875345
3 mister Mister Trash Wheel 4 May 2014 2014-05-17 3.10 15 2380 2730 100000 52 896 1971 6 0 -0.285576 2.064766
4 mister Mister Trash Wheel 5 May 2014 2014-05-17 4.06 18 980 870 120000 72 368 753 7 0 -1.664942 3.678861
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
161 mister Mister Trash Wheel 162 November 2016 2016-11-30 2.75 18 3460 5840 16000 42 3260 3430 34 46 2.801646 0.685263
162 mister Mister Trash Wheel 163 December 2016 2016-12-01 3.41 15 1840 4760 23000 43 3470 3800 6 57 2.925675 0.713372
163 mister Mister Trash Wheel 164 December 2016 2016-12-06 2.55 15 1360 3850 34000 39 2340 4220 24 43 1.972863 0.685088
164 mister Mister Trash Wheel 165 December 2016 2016-12-16 1.74 18 1880 2890 26000 59 2100 4040 20 29 1.233737 2.149005
165 mister Mister Trash Wheel 166 December 2016 2017-01-02 2.13 15 2460 2740 32000 48 3250 4430 15 36 2.624921 1.110940

162 rows × 18 columns

Produce the first plot

dftemp = df.query("Year < 2017").copy()
fig = px.box(dftemp, x='Year',y='Volume',color = 'ID')
fig.update_layout(
    title = "<b>Plot 1: Volume per Id box plot",
    xaxis = dict(title='Years available'),
    yaxis = dict(title='Volume (m3)'),
    template=template
)
fig.show()

Second figure

dftemp = df[['Date','ID','PlasticBottles']].copy()
dftemp['Yearmonth'] = dftemp['Date'].apply(lambda x: x.strftime('%Y-%m'))
del dftemp['Date']
dftemp=dftemp.groupby(['Yearmonth','ID']).sum()
dftemp.reset_index(inplace=True)
fig_line = px.line(dftemp, x = 'Yearmonth',y = 'PlasticBottles',color = 'ID',
  labels = {'PlasticBottles': 'N of bottles', 'ID': 'Identifier', 'Yearmonth': 'Year and month'},template=template
)
fig_line.update_layout(
    title = "<b>Plot 2: N of bottles per year and month</b>",
    xaxis = dict(title='Time series'),
    yaxis = dict(title='Amount (units)')
)
fig_line.show()

Check which unique years we have

df["Year"].unique()
array([2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023])

Produce a subplot for different years

fig = make_subplots(rows=1, cols=2)
c = 1
for year in df["Year"].unique():
  if year > 2014 and year < 2017:
    dftemp = df.query("Year == {}".format(year)).copy()
    dftemp["Month"] = dftemp["Date"].apply(lambda x: x.strftime('%m'))
    dftemp = dftemp[['Year','Month','PlasticBottles']].copy()
    dftemp = dftemp.groupby(['Year','Month']).sum()
    dftemp.reset_index(inplace=True)
    fig.add_trace(go.Scatter(x=dftemp.Month, y=dftemp.PlasticBottles, name=str(year)), row=1, col=c)
    c = c + 1
fig.update_layout(title_text="<b>Plot 3: Side By Side Subplots</b>", template=template)
fig.show()

Stacked subplots

fig = make_subplots(rows=2, cols=1)
r = 1
for year in df["Year"].unique():
  if year > 2014 and year < 2017:
    dftemp = df.query("Year == {}".format(year)).copy()
    dftemp["Month"] = dftemp["Date"].apply(lambda x: x.strftime('%m'))
    dftemp = dftemp[['Year','Month','PlasticBottles']].copy()
    dftemp = dftemp.groupby(['Year','Month']).sum()
    dftemp.reset_index(inplace=True)
    fig.append_trace(go.Scatter(x=dftemp.Month, y=dftemp.PlasticBottles, name=str(year)), row=r, col=1)
    r = r + 1
fig.update_layout(title_text="<b>Plot 4: Stacked Subplots</b>", template=template)
fig.show()

Gridded subplots with made-up data:

fig = make_subplots(rows=2, cols=2)
fig.add_trace(go.Scatter(x=[1, 2, 3], y=[4, 5, 6]), row=1, col=1)
fig.add_trace(go.Scatter(x=[20, 30, 40], y=[50, 60, 70]), row=1, col=2)
fig.add_trace(go.Scatter(x=[300, 400, 500], y=[600, 700, 800]), row=2, col=1)
fig.add_trace(go.Scatter(x=[4000, 5000, 6000], y=[7000, 8000, 9000]), row=2, col=2)
fig.update_layout(title_text="Grid Subplots", template=template)
fig.show()

There’s only one barge at the moment. I guess they’re hoping to get more?

df["Name"].unique()
array(['Mister Trash Wheel'], dtype=object)
dftemp = df[['Date','plastic_dumping_score','Name']].copy()
dftemp['Yearmonth'] = df['Date'].apply(lambda x: x.strftime('%Y-%m'))
del dftemp['Date']
dftemp=dftemp.groupby(['Yearmonth','Name']).sum()
dftemp.reset_index(inplace=True)
fig_area = px.area(dftemp, x = 'Yearmonth',y = 'plastic_dumping_score',color = 'Name', template=template)
fig_area.update_layout(
    title = "<b>Plot 4: Dumping score per year and name</b>",
    xaxis = dict(title='Year and Month'),
    yaxis = dict(title='Total dumping score')
)
fig_area.show()

An interactive treemap

dftemp = df[['Month','Year','PlasticBottles']].copy()
dftemp=dftemp.groupby(['Month','Year']).sum()
dftemp.reset_index(inplace=True)
fig_tree_maps = px.treemap(dftemp, path= ['Year','Month'],values ='PlasticBottles',color_continuous_scale='RdBu', template=template)
fig_tree_maps.update_layout(
    title = "<b>Plot 7: Tree map about bottles per year and month</b>"
)
fig_tree_maps.show()

And a 3D plot!

dftemp = df[['Year','drinking_smoking_score','plastic_dumping_score','ID']].copy()
dftemp=dftemp.groupby(['Year','ID']).mean()
dftemp.reset_index(inplace=True)
fig_scatter3D = px.scatter_3d(dftemp,x = 'Year',y='drinking_smoking_score', z = 'plastic_dumping_score', color = 'ID',opacity=0.7, template=template)
fig_scatter3D.update_layout(title = "<b>Plot 8: Year and plastic and drinking scores</b>")
fig_scatter3D.show()

And a pie chart:

dftemp = df[['Year','PlasticBags']].copy()
dftemp=dftemp.groupby(['Year']).sum()
dftemp.reset_index(inplace=True)
fig = go.Figure(
    data=[go.Pie(
        labels=dftemp['Year'],
        values=dftemp['PlasticBags'],
        sort=False)
    ])
fig.update_layout(title  = "<b>Plot 7: Plastic bags per year</b>", template=template)
fig.show()

Reflections

  • Google colab appears a good way of getting a jupyter notebook up and running, and accessible on many devices without installing python and dependencies first.

  • There were actually more issues (related to date formatting and package versions) in running both R and python code in this quarto markdown document. Definitely a learning experience!

  • Katie Pyper had questions about rules-of-thumb/conventions for defining and using outliers (as shown in the box plots) in regressions etc. An important separate topic!

  • The same colab/python training will hopefully be of interest to a broader NHS audience