Jon Minton’s Blog - Tidy Tuesday Trash

Introduction

The latest TidyTuesday dataset is on trash collected as part of the Mr Trash Wheel Baltimore Healthy Harbor Initiative.

This session was led by Gatz Osario, and different to previous TardyTuesday sessions in that both Gatz and Jon looked at the dataset and prepared some materials ahead of the session.

Gatz provided an expert introduction to using Python for data science and data visualisation, using the Plotly libraries for interactive visualisation. Gatz used Google Colab for the session itself, which allows jupyter notebooks to be created and run online. In this post the same python chunks are run within Quarto.

Gatz used a subset of the data containing 2 factor regression scores Jon generated in R. The R code for generating this derived dataset is shown below but was not presented at the (already packed) session.

Factor Analysis in R

Jon started by loading tidyverse and the most recent dataset

library(tidyverse)

df <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-03-05/trashwheel.csv')

glimpse(df)

The dataset contains the numbers of different types of item extracted each time the barge went to collect trash. These items aren’t in comparable units (i.e. they weren’t by weight or volume, which could be compared).

Jon looked at one and two factor solutions to see if there are relationships between the types of items that tend to be collected together. First one factor

f_1 <- factanal(~ PlasticBottles + Polystyrene + CigaretteButts + GlassBottles + PlasticBags + Wrappers + SportsBalls, data = df, factors = 1)

f_1

A single factor has a loading of 3.2, meaning (roughly) that it contains about three variables’ worth of informaiton.

Polystyrene, Plastic Bags and Wrappers all had strong factor loadings. The most unique item (i.e. the one least well captured by the factor) was SportsBalls.

Now two factor solution:

f_2 <- factanal(~ PlasticBottles + Polystyrene + CigaretteButts + GlassBottles + PlasticBags + Wrappers + SportsBalls, data = df, factors = 2, scores = "regression")

f_2

The first factor has a strong loading on Polysytrene, Plastic bags, and Wrappers. The secton factor has a strong loading for glass bottles and cigarette butts. (So, smoking and drinking related trash?)

The argument scores = "regression" was added to allow the scores of each factor to be returned and attached to all rows in the original dataframe where it could be calculated.

df2 <- df %>%
    filter(complete.cases(.)) |>
    mutate(
        plastic_dumping_score = f_2$scores[,1],
        drinking_smoking_score = f_2$scores[,2]
    )

The following shows how the contents returned by the trash barge varied in terms of these two factor scores by year

df2 |>
    mutate(density = Weight / Volume) |>
    ggplot(aes(x = plastic_dumping_score, y = drinking_smoking_score)) + 
    geom_point(aes(alpha = density)) + 
    facet_wrap(~ Year) +
    geom_vline(xintercept = 0) + 
    geom_hline(yintercept = 0)

Originally there seemed to be more variation in the types of item returned by the barge, and more glass bottles and cigarettes. Over the first few years the amount of plastic waste returned seemed to increase, but declined afer peaking in 2017.

To make it easier for python to read the file with factor scores we generated, I (Jon) will save it as a csv file

write_csv(df2,  here::here("posts", "tardy-tuesday", "tidy-tuesday-trash", "df_with_factor_scores.csv"))

Data manipulation and visualisation in Python

First Gatz imported the relevant libraries

import pandas as pd
import datetime
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
pd.options.display.max_colwidth = 300

Just checking python works in the quarto document:

1 + 1

Load the data

df = pd.read_csv("df_with_factor_scores.csv")
df['Date'] = pd.to_datetime(df['Date'], format = "%m/%d/%Y", errors = 'coerce')
print(df.shape)
df.head(3)

(629, 18)

	ID	Name	Dumpster	Month	Year	Date	Weight	Volume	PlasticBottles	Polystyrene	CigaretteButts	GlassBottles	PlasticBags	Wrappers	SportsBalls	plastic_dumping_score	drinking_smoking_score
0	mister	Mister Trash Wheel	1	May	2014	2014-05-16	4.31	18	1450	1820	126000	72	584	1162	7	-1.261913	3.589004
1	mister	Mister Trash Wheel	2	May	2014	2014-05-16	2.74	13	1120	1030	91000	42	496	874	5	-0.927719	1.558504
2	mister	Mister Trash Wheel	3	May	2014	2014-05-16	3.45	15	2450	3100	105000	50	1080	2032	6	-0.022253	1.875345

More information

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 629 entries, 0 to 628
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   ID                      629 non-null    object        
 1   Name                    629 non-null    object        
 2   Dumpster                629 non-null    int64         
 3   Month                   629 non-null    object        
 4   Year                    629 non-null    int64         
 5   Date                    592 non-null    datetime64[ns]
 6   Weight                  629 non-null    float64       
 7   Volume                  629 non-null    int64         
 8   PlasticBottles          629 non-null    int64         
 9   Polystyrene             629 non-null    int64         
 10  CigaretteButts          629 non-null    int64         
 11  GlassBottles            629 non-null    int64         
 12  PlasticBags             629 non-null    int64         
 13  Wrappers                629 non-null    int64         
 14  SportsBalls             629 non-null    int64         
 15  HomesPowered            629 non-null    int64         
 16  plastic_dumping_score   629 non-null    float64       
 17  drinking_smoking_score  629 non-null    float64       
dtypes: datetime64[ns](1), float64(3), int64(11), object(3)
memory usage: 88.6+ KB

Convert year to integer

df['Year'] = df['Year'].astype(int)

Check no missing observations

df.isna().sum()

ID                         0
Name                       0
Dumpster                   0
Month                      0
Year                       0
Date                      37
Weight                     0
Volume                     0
PlasticBottles             0
Polystyrene                0
CigaretteButts             0
GlassBottles               0
PlasticBags                0
Wrappers                   0
SportsBalls                0
HomesPowered               0
plastic_dumping_score      0
drinking_smoking_score     0
dtype: int64

Drop NAs (there aren’t any)

df = df.dropna()
print(df.shape)
df.isna().sum()

(592, 18)

ID                        0
Name                      0
Dumpster                  0
Month                     0
Year                      0
Date                      0
Weight                    0
Volume                    0
PlasticBottles            0
Polystyrene               0
CigaretteButts            0
GlassBottles              0
PlasticBags               0
Wrappers                  0
SportsBalls               0
HomesPowered              0
plastic_dumping_score     0
drinking_smoking_score    0
dtype: int64

Sort by date

df.sort_values(by=['Date'], inplace=True)

Visualisation

Produce list of theme options and select the third

options = ["plotly", "plotly_white", "plotly_dark", "ggplot2", "seaborn", "simple_white"]
template = options[2]

Look at the new query syntax

df.query("Year < 2017")

	ID	Name	Dumpster	Month	Year	Date	Weight	Volume	PlasticBottles	Polystyrene	CigaretteButts	GlassBottles	PlasticBags	Wrappers	SportsBalls	HomesPowered	plastic_dumping_score	drinking_smoking_score
0	mister	Mister Trash Wheel	1	May	2014	2014-05-16	4.31	18	1450	1820	126000	72	584	1162	7	0	-1.261913	3.589004
1	mister	Mister Trash Wheel	2	May	2014	2014-05-16	2.74	13	1120	1030	91000	42	496	874	5	0	-0.927719	1.558504
2	mister	Mister Trash Wheel	3	May	2014	2014-05-16	3.45	15	2450	3100	105000	50	1080	2032	6	0	-0.022253	1.875345
3	mister	Mister Trash Wheel	4	May	2014	2014-05-17	3.10	15	2380	2730	100000	52	896	1971	6	0	-0.285576	2.064766
4	mister	Mister Trash Wheel	5	May	2014	2014-05-17	4.06	18	980	870	120000	72	368	753	7	0	-1.664942	3.678861
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
161	mister	Mister Trash Wheel	162	November	2016	2016-11-30	2.75	18	3460	5840	16000	42	3260	3430	34	46	2.801646	0.685263
162	mister	Mister Trash Wheel	163	December	2016	2016-12-01	3.41	15	1840	4760	23000	43	3470	3800	6	57	2.925675	0.713372
163	mister	Mister Trash Wheel	164	December	2016	2016-12-06	2.55	15	1360	3850	34000	39	2340	4220	24	43	1.972863	0.685088
164	mister	Mister Trash Wheel	165	December	2016	2016-12-16	1.74	18	1880	2890	26000	59	2100	4040	20	29	1.233737	2.149005
165	mister	Mister Trash Wheel	166	December	2016	2017-01-02	2.13	15	2460	2740	32000	48	3250	4430	15	36	2.624921	1.110940

162 rows × 18 columns

Produce the first plot

dftemp = df.query("Year < 2017").copy()
fig = px.box(dftemp, x='Year',y='Volume',color = 'ID')
fig.update_layout(
    title = "<b>Plot 1: Volume per Id box plot",
    xaxis = dict(title='Years available'),
    yaxis = dict(title='Volume (m3)'),
    template=template
)
fig.show()

Second figure

dftemp = df[['Date','ID','PlasticBottles']].copy()
dftemp['Yearmonth'] = dftemp['Date'].apply(lambda x: x.strftime('%Y-%m'))
del dftemp['Date']
dftemp=dftemp.groupby(['Yearmonth','ID']).sum()
dftemp.reset_index(inplace=True)
fig_line = px.line(dftemp, x = 'Yearmonth',y = 'PlasticBottles',color = 'ID',
  labels = {'PlasticBottles': 'N of bottles', 'ID': 'Identifier', 'Yearmonth': 'Year and month'},template=template
)
fig_line.update_layout(
    title = "<b>Plot 2: N of bottles per year and month</b>",
    xaxis = dict(title='Time series'),
    yaxis = dict(title='Amount (units)')
)
fig_line.show()

Check which unique years we have

df["Year"].unique()

array([2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023])

Produce a subplot for different years

fig = make_subplots(rows=1, cols=2)
c = 1
for year in df["Year"].unique():
  if year > 2014 and year < 2017:
    dftemp = df.query("Year == {}".format(year)).copy()
    dftemp["Month"] = dftemp["Date"].apply(lambda x: x.strftime('%m'))
    dftemp = dftemp[['Year','Month','PlasticBottles']].copy()
    dftemp = dftemp.groupby(['Year','Month']).sum()
    dftemp.reset_index(inplace=True)
    fig.add_trace(go.Scatter(x=dftemp.Month, y=dftemp.PlasticBottles, name=str(year)), row=1, col=c)
    c = c + 1
fig.update_layout(title_text="<b>Plot 3: Side By Side Subplots</b>", template=template)
fig.show()

Stacked subplots

fig = make_subplots(rows=2, cols=1)
r = 1
for year in df["Year"].unique():
  if year > 2014 and year < 2017:
    dftemp = df.query("Year == {}".format(year)).copy()
    dftemp["Month"] = dftemp["Date"].apply(lambda x: x.strftime('%m'))
    dftemp = dftemp[['Year','Month','PlasticBottles']].copy()
    dftemp = dftemp.groupby(['Year','Month']).sum()
    dftemp.reset_index(inplace=True)
    fig.append_trace(go.Scatter(x=dftemp.Month, y=dftemp.PlasticBottles, name=str(year)), row=r, col=1)
    r = r + 1
fig.update_layout(title_text="<b>Plot 4: Stacked Subplots</b>", template=template)
fig.show()

Gridded subplots with made-up data:

fig = make_subplots(rows=2, cols=2)
fig.add_trace(go.Scatter(x=[1, 2, 3], y=[4, 5, 6]), row=1, col=1)
fig.add_trace(go.Scatter(x=[20, 30, 40], y=[50, 60, 70]), row=1, col=2)
fig.add_trace(go.Scatter(x=[300, 400, 500], y=[600, 700, 800]), row=2, col=1)
fig.add_trace(go.Scatter(x=[4000, 5000, 6000], y=[7000, 8000, 9000]), row=2, col=2)
fig.update_layout(title_text="Grid Subplots", template=template)
fig.show()

There’s only one barge at the moment. I guess they’re hoping to get more?

df["Name"].unique()

array(['Mister Trash Wheel'], dtype=object)

dftemp = df[['Date','plastic_dumping_score','Name']].copy()
dftemp['Yearmonth'] = df['Date'].apply(lambda x: x.strftime('%Y-%m'))
del dftemp['Date']
dftemp=dftemp.groupby(['Yearmonth','Name']).sum()
dftemp.reset_index(inplace=True)
fig_area = px.area(dftemp, x = 'Yearmonth',y = 'plastic_dumping_score',color = 'Name', template=template)
fig_area.update_layout(
    title = "<b>Plot 4: Dumping score per year and name</b>",
    xaxis = dict(title='Year and Month'),
    yaxis = dict(title='Total dumping score')
)
fig_area.show()

An interactive treemap

dftemp = df[['Month','Year','PlasticBottles']].copy()
dftemp=dftemp.groupby(['Month','Year']).sum()
dftemp.reset_index(inplace=True)
fig_tree_maps = px.treemap(dftemp, path= ['Year','Month'],values ='PlasticBottles',color_continuous_scale='RdBu', template=template)
fig_tree_maps.update_layout(
    title = "<b>Plot 7: Tree map about bottles per year and month</b>"
)
fig_tree_maps.show()

And a 3D plot!

dftemp = df[['Year','drinking_smoking_score','plastic_dumping_score','ID']].copy()
dftemp=dftemp.groupby(['Year','ID']).mean()
dftemp.reset_index(inplace=True)
fig_scatter3D = px.scatter_3d(dftemp,x = 'Year',y='drinking_smoking_score', z = 'plastic_dumping_score', color = 'ID',opacity=0.7, template=template)
fig_scatter3D.update_layout(title = "<b>Plot 8: Year and plastic and drinking scores</b>")
fig_scatter3D.show()

And a pie chart:

dftemp = df[['Year','PlasticBags']].copy()
dftemp=dftemp.groupby(['Year']).sum()
dftemp.reset_index(inplace=True)
fig = go.Figure(
    data=[go.Pie(
        labels=dftemp['Year'],
        values=dftemp['PlasticBags'],
        sort=False)
    ])
fig.update_layout(title  = "<b>Plot 7: Plastic bags per year</b>", template=template)
fig.show()

Reflections

Google colab appears a good way of getting a jupyter notebook up and running, and accessible on many devices without installing python and dependencies first.
There were actually more issues (related to date formatting and package versions) in running both R and python code in this quarto markdown document. Definitely a learning experience!
Katie Pyper had questions about rules-of-thumb/conventions for defining and using outliers (as shown in the box plots) in regressions etc. An important separate topic!
The same colab/python training will hopefully be of interest to a broader NHS audience