import pandas as pd
import datetime
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
= 300 pd.options.display.max_colwidth
Introduction
The latest TidyTuesday dataset is on trash collected as part of the Mr Trash Wheel Baltimore Healthy Harbor Initiative.
This session was led by Gatz Osario, and different to previous TardyTuesday sessions in that both Gatz and Jon looked at the dataset and prepared some materials ahead of the session.
Gatz provided an expert introduction to using Python for data science and data visualisation, using the Plotly libraries for interactive visualisation. Gatz used Google Colab for the session itself, which allows jupyter notebooks to be created and run online. In this post the same python chunks are run within Quarto.
Gatz used a subset of the data containing 2 factor regression scores Jon generated in R. The R code for generating this derived dataset is shown below but was not presented at the (already packed) session.
Factor Analysis in R
Jon started by loading tidyverse and the most recent dataset
library(tidyverse)
df <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-03-05/trashwheel.csv')
glimpse(df)
The dataset contains the numbers of different types of item extracted each time the barge went to collect trash. These items aren’t in comparable units (i.e. they weren’t by weight or volume, which could be compared).
Jon looked at one and two factor solutions to see if there are relationships between the types of items that tend to be collected together. First one factor
f_1 <- factanal(~ PlasticBottles + Polystyrene + CigaretteButts + GlassBottles + PlasticBags + Wrappers + SportsBalls, data = df, factors = 1)
f_1
A single factor has a loading of 3.2, meaning (roughly) that it contains about three variables’ worth of informaiton.
Polystyrene, Plastic Bags and Wrappers all had strong factor loadings. The most unique item (i.e. the one least well captured by the factor) was SportsBalls.
Now two factor solution:
f_2 <- factanal(~ PlasticBottles + Polystyrene + CigaretteButts + GlassBottles + PlasticBags + Wrappers + SportsBalls, data = df, factors = 2, scores = "regression")
f_2
The first factor has a strong loading on Polysytrene, Plastic bags, and Wrappers. The secton factor has a strong loading for glass bottles and cigarette butts. (So, smoking and drinking related trash?)
The argument scores = "regression"
was added to allow the scores of each factor to be returned and attached to all rows in the original dataframe where it could be calculated.
df2 <- df %>%
filter(complete.cases(.)) |>
mutate(
plastic_dumping_score = f_2$scores[,1],
drinking_smoking_score = f_2$scores[,2]
)
The following shows how the contents returned by the trash barge varied in terms of these two factor scores by year
df2 |>
mutate(density = Weight / Volume) |>
ggplot(aes(x = plastic_dumping_score, y = drinking_smoking_score)) +
geom_point(aes(alpha = density)) +
facet_wrap(~ Year) +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0)
Originally there seemed to be more variation in the types of item returned by the barge, and more glass bottles and cigarettes. Over the first few years the amount of plastic waste returned seemed to increase, but declined afer peaking in 2017.
To make it easier for python to read the file with factor scores we generated, I (Jon) will save it as a csv file
write_csv(df2, here::here("posts", "tardy-tuesday", "tidy-tuesday-trash", "df_with_factor_scores.csv"))
Data manipulation and visualisation in Python
First Gatz imported the relevant libraries
Just checking python works in the quarto document:
1 + 1
2
Load the data
= pd.read_csv("df_with_factor_scores.csv")
df 'Date'] = pd.to_datetime(df['Date'], format = "%m/%d/%Y", errors = 'coerce')
df[print(df.shape)
3) df.head(
(629, 18)
ID | Name | Dumpster | Month | Year | Date | Weight | Volume | PlasticBottles | Polystyrene | CigaretteButts | GlassBottles | PlasticBags | Wrappers | SportsBalls | HomesPowered | plastic_dumping_score | drinking_smoking_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | mister | Mister Trash Wheel | 1 | May | 2014 | 2014-05-16 | 4.31 | 18 | 1450 | 1820 | 126000 | 72 | 584 | 1162 | 7 | 0 | -1.261913 | 3.589004 |
1 | mister | Mister Trash Wheel | 2 | May | 2014 | 2014-05-16 | 2.74 | 13 | 1120 | 1030 | 91000 | 42 | 496 | 874 | 5 | 0 | -0.927719 | 1.558504 |
2 | mister | Mister Trash Wheel | 3 | May | 2014 | 2014-05-16 | 3.45 | 15 | 2450 | 3100 | 105000 | 50 | 1080 | 2032 | 6 | 0 | -0.022253 | 1.875345 |
More information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 629 entries, 0 to 628
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 629 non-null object
1 Name 629 non-null object
2 Dumpster 629 non-null int64
3 Month 629 non-null object
4 Year 629 non-null int64
5 Date 592 non-null datetime64[ns]
6 Weight 629 non-null float64
7 Volume 629 non-null int64
8 PlasticBottles 629 non-null int64
9 Polystyrene 629 non-null int64
10 CigaretteButts 629 non-null int64
11 GlassBottles 629 non-null int64
12 PlasticBags 629 non-null int64
13 Wrappers 629 non-null int64
14 SportsBalls 629 non-null int64
15 HomesPowered 629 non-null int64
16 plastic_dumping_score 629 non-null float64
17 drinking_smoking_score 629 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(11), object(3)
memory usage: 88.6+ KB
Convert year to integer
'Year'] = df['Year'].astype(int) df[
Check no missing observations
sum() df.isna().
ID 0
Name 0
Dumpster 0
Month 0
Year 0
Date 37
Weight 0
Volume 0
PlasticBottles 0
Polystyrene 0
CigaretteButts 0
GlassBottles 0
PlasticBags 0
Wrappers 0
SportsBalls 0
HomesPowered 0
plastic_dumping_score 0
drinking_smoking_score 0
dtype: int64
Drop NAs (there aren’t any)
= df.dropna()
df print(df.shape)
sum() df.isna().
(592, 18)
ID 0
Name 0
Dumpster 0
Month 0
Year 0
Date 0
Weight 0
Volume 0
PlasticBottles 0
Polystyrene 0
CigaretteButts 0
GlassBottles 0
PlasticBags 0
Wrappers 0
SportsBalls 0
HomesPowered 0
plastic_dumping_score 0
drinking_smoking_score 0
dtype: int64
Sort by date
=['Date'], inplace=True) df.sort_values(by
Visualisation
Produce list of theme options and select the third
= ["plotly", "plotly_white", "plotly_dark", "ggplot2", "seaborn", "simple_white"]
options = options[2] template
Look at the new query syntax
"Year < 2017") df.query(
ID | Name | Dumpster | Month | Year | Date | Weight | Volume | PlasticBottles | Polystyrene | CigaretteButts | GlassBottles | PlasticBags | Wrappers | SportsBalls | HomesPowered | plastic_dumping_score | drinking_smoking_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | mister | Mister Trash Wheel | 1 | May | 2014 | 2014-05-16 | 4.31 | 18 | 1450 | 1820 | 126000 | 72 | 584 | 1162 | 7 | 0 | -1.261913 | 3.589004 |
1 | mister | Mister Trash Wheel | 2 | May | 2014 | 2014-05-16 | 2.74 | 13 | 1120 | 1030 | 91000 | 42 | 496 | 874 | 5 | 0 | -0.927719 | 1.558504 |
2 | mister | Mister Trash Wheel | 3 | May | 2014 | 2014-05-16 | 3.45 | 15 | 2450 | 3100 | 105000 | 50 | 1080 | 2032 | 6 | 0 | -0.022253 | 1.875345 |
3 | mister | Mister Trash Wheel | 4 | May | 2014 | 2014-05-17 | 3.10 | 15 | 2380 | 2730 | 100000 | 52 | 896 | 1971 | 6 | 0 | -0.285576 | 2.064766 |
4 | mister | Mister Trash Wheel | 5 | May | 2014 | 2014-05-17 | 4.06 | 18 | 980 | 870 | 120000 | 72 | 368 | 753 | 7 | 0 | -1.664942 | 3.678861 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
161 | mister | Mister Trash Wheel | 162 | November | 2016 | 2016-11-30 | 2.75 | 18 | 3460 | 5840 | 16000 | 42 | 3260 | 3430 | 34 | 46 | 2.801646 | 0.685263 |
162 | mister | Mister Trash Wheel | 163 | December | 2016 | 2016-12-01 | 3.41 | 15 | 1840 | 4760 | 23000 | 43 | 3470 | 3800 | 6 | 57 | 2.925675 | 0.713372 |
163 | mister | Mister Trash Wheel | 164 | December | 2016 | 2016-12-06 | 2.55 | 15 | 1360 | 3850 | 34000 | 39 | 2340 | 4220 | 24 | 43 | 1.972863 | 0.685088 |
164 | mister | Mister Trash Wheel | 165 | December | 2016 | 2016-12-16 | 1.74 | 18 | 1880 | 2890 | 26000 | 59 | 2100 | 4040 | 20 | 29 | 1.233737 | 2.149005 |
165 | mister | Mister Trash Wheel | 166 | December | 2016 | 2017-01-02 | 2.13 | 15 | 2460 | 2740 | 32000 | 48 | 3250 | 4430 | 15 | 36 | 2.624921 | 1.110940 |
162 rows × 18 columns
Produce the first plot
= df.query("Year < 2017").copy()
dftemp = px.box(dftemp, x='Year',y='Volume',color = 'ID')
fig
fig.update_layout(= "<b>Plot 1: Volume per Id box plot",
title = dict(title='Years available'),
xaxis = dict(title='Volume (m3)'),
yaxis =template
template
) fig.show()
Second figure
= df[['Date','ID','PlasticBottles']].copy()
dftemp 'Yearmonth'] = dftemp['Date'].apply(lambda x: x.strftime('%Y-%m'))
dftemp[del dftemp['Date']
=dftemp.groupby(['Yearmonth','ID']).sum()
dftemp=True)
dftemp.reset_index(inplace= px.line(dftemp, x = 'Yearmonth',y = 'PlasticBottles',color = 'ID',
fig_line = {'PlasticBottles': 'N of bottles', 'ID': 'Identifier', 'Yearmonth': 'Year and month'},template=template
labels
)
fig_line.update_layout(= "<b>Plot 2: N of bottles per year and month</b>",
title = dict(title='Time series'),
xaxis = dict(title='Amount (units)')
yaxis
) fig_line.show()
Check which unique years we have
"Year"].unique() df[
array([2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023])
Produce a subplot for different years
= make_subplots(rows=1, cols=2)
fig = 1
c for year in df["Year"].unique():
if year > 2014 and year < 2017:
= df.query("Year == {}".format(year)).copy()
dftemp "Month"] = dftemp["Date"].apply(lambda x: x.strftime('%m'))
dftemp[= dftemp[['Year','Month','PlasticBottles']].copy()
dftemp = dftemp.groupby(['Year','Month']).sum()
dftemp =True)
dftemp.reset_index(inplace=dftemp.Month, y=dftemp.PlasticBottles, name=str(year)), row=1, col=c)
fig.add_trace(go.Scatter(x= c + 1
c ="<b>Plot 3: Side By Side Subplots</b>", template=template)
fig.update_layout(title_text fig.show()
Stacked subplots
= make_subplots(rows=2, cols=1)
fig = 1
r for year in df["Year"].unique():
if year > 2014 and year < 2017:
= df.query("Year == {}".format(year)).copy()
dftemp "Month"] = dftemp["Date"].apply(lambda x: x.strftime('%m'))
dftemp[= dftemp[['Year','Month','PlasticBottles']].copy()
dftemp = dftemp.groupby(['Year','Month']).sum()
dftemp =True)
dftemp.reset_index(inplace=dftemp.Month, y=dftemp.PlasticBottles, name=str(year)), row=r, col=1)
fig.append_trace(go.Scatter(x= r + 1
r ="<b>Plot 4: Stacked Subplots</b>", template=template)
fig.update_layout(title_text fig.show()
Gridded subplots with made-up data:
= make_subplots(rows=2, cols=2)
fig =[1, 2, 3], y=[4, 5, 6]), row=1, col=1)
fig.add_trace(go.Scatter(x=[20, 30, 40], y=[50, 60, 70]), row=1, col=2)
fig.add_trace(go.Scatter(x=[300, 400, 500], y=[600, 700, 800]), row=2, col=1)
fig.add_trace(go.Scatter(x=[4000, 5000, 6000], y=[7000, 8000, 9000]), row=2, col=2)
fig.add_trace(go.Scatter(x="Grid Subplots", template=template)
fig.update_layout(title_text fig.show()
There’s only one barge at the moment. I guess they’re hoping to get more?
"Name"].unique() df[
array(['Mister Trash Wheel'], dtype=object)
= df[['Date','plastic_dumping_score','Name']].copy()
dftemp 'Yearmonth'] = df['Date'].apply(lambda x: x.strftime('%Y-%m'))
dftemp[del dftemp['Date']
=dftemp.groupby(['Yearmonth','Name']).sum()
dftemp=True)
dftemp.reset_index(inplace= px.area(dftemp, x = 'Yearmonth',y = 'plastic_dumping_score',color = 'Name', template=template)
fig_area
fig_area.update_layout(= "<b>Plot 4: Dumping score per year and name</b>",
title = dict(title='Year and Month'),
xaxis = dict(title='Total dumping score')
yaxis
) fig_area.show()
An interactive treemap
= df[['Month','Year','PlasticBottles']].copy()
dftemp =dftemp.groupby(['Month','Year']).sum()
dftemp=True)
dftemp.reset_index(inplace= px.treemap(dftemp, path= ['Year','Month'],values ='PlasticBottles',color_continuous_scale='RdBu', template=template)
fig_tree_maps
fig_tree_maps.update_layout(= "<b>Plot 7: Tree map about bottles per year and month</b>"
title
) fig_tree_maps.show()
And a 3D plot!
= df[['Year','drinking_smoking_score','plastic_dumping_score','ID']].copy()
dftemp =dftemp.groupby(['Year','ID']).mean()
dftemp=True)
dftemp.reset_index(inplace= px.scatter_3d(dftemp,x = 'Year',y='drinking_smoking_score', z = 'plastic_dumping_score', color = 'ID',opacity=0.7, template=template)
fig_scatter3D = "<b>Plot 8: Year and plastic and drinking scores</b>")
fig_scatter3D.update_layout(title fig_scatter3D.show()
And a pie chart:
= df[['Year','PlasticBags']].copy()
dftemp =dftemp.groupby(['Year']).sum()
dftemp=True)
dftemp.reset_index(inplace= go.Figure(
fig =[go.Pie(
data=dftemp['Year'],
labels=dftemp['PlasticBags'],
values=False)
sort
])= "<b>Plot 7: Plastic bags per year</b>", template=template)
fig.update_layout(title fig.show()
Reflections
Google colab appears a good way of getting a jupyter notebook up and running, and accessible on many devices without installing python and dependencies first.
There were actually more issues (related to date formatting and package versions) in running both R and python code in this quarto markdown document. Definitely a learning experience!
Katie Pyper had questions about rules-of-thumb/conventions for defining and using outliers (as shown in the box plots) in regressions etc. An important separate topic!
The same colab/python training will hopefully be of interest to a broader NHS audience