import pandas as pd
import datetime
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
pd.options.display.max_colwidth = 300Introduction
The latest TidyTuesday dataset is on trash collected as part of the Mr Trash Wheel Baltimore Healthy Harbor Initiative.
This session was led by Gatz Osario, and different to previous TardyTuesday sessions in that both Gatz and Jon looked at the dataset and prepared some materials ahead of the session.
Gatz provided an expert introduction to using Python for data science and data visualisation, using the Plotly libraries for interactive visualisation. Gatz used Google Colab for the session itself, which allows jupyter notebooks to be created and run online. In this post the same python chunks are run within Quarto.
Gatz used a subset of the data containing 2 factor regression scores Jon generated in R. The R code for generating this derived dataset is shown below but was not presented at the (already packed) session.
Factor Analysis in R
Jon started by loading tidyverse and the most recent dataset
library(tidyverse)
df <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-03-05/trashwheel.csv')
glimpse(df)
The dataset contains the numbers of different types of item extracted each time the barge went to collect trash. These items aren’t in comparable units (i.e. they weren’t by weight or volume, which could be compared).
Jon looked at one and two factor solutions to see if there are relationships between the types of items that tend to be collected together. First one factor
f_1 <- factanal(~ PlasticBottles + Polystyrene + CigaretteButts + GlassBottles + PlasticBags + Wrappers + SportsBalls, data = df, factors = 1)
f_1
A single factor has a loading of 3.2, meaning (roughly) that it contains about three variables’ worth of informaiton.
Polystyrene, Plastic Bags and Wrappers all had strong factor loadings. The most unique item (i.e. the one least well captured by the factor) was SportsBalls.
Now two factor solution:
f_2 <- factanal(~ PlasticBottles + Polystyrene + CigaretteButts + GlassBottles + PlasticBags + Wrappers + SportsBalls, data = df, factors = 2, scores = "regression")
f_2
The first factor has a strong loading on Polysytrene, Plastic bags, and Wrappers. The secton factor has a strong loading for glass bottles and cigarette butts. (So, smoking and drinking related trash?)
The argument scores = "regression" was added to allow the scores of each factor to be returned and attached to all rows in the original dataframe where it could be calculated.
df2 <- df %>%
filter(complete.cases(.)) |>
mutate(
plastic_dumping_score = f_2$scores[,1],
drinking_smoking_score = f_2$scores[,2]
)
The following shows how the contents returned by the trash barge varied in terms of these two factor scores by year
df2 |>
mutate(density = Weight / Volume) |>
ggplot(aes(x = plastic_dumping_score, y = drinking_smoking_score)) +
geom_point(aes(alpha = density)) +
facet_wrap(~ Year) +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0)
Originally there seemed to be more variation in the types of item returned by the barge, and more glass bottles and cigarettes. Over the first few years the amount of plastic waste returned seemed to increase, but declined afer peaking in 2017.
To make it easier for python to read the file with factor scores we generated, I (Jon) will save it as a csv file
write_csv(df2, here::here("posts", "tardy-tuesday", "tidy-tuesday-trash", "df_with_factor_scores.csv"))
Data manipulation and visualisation in Python
First Gatz imported the relevant libraries
Just checking python works in the quarto document:
1 + 12
Load the data
df = pd.read_csv("df_with_factor_scores.csv")
df['Date'] = pd.to_datetime(df['Date'], format = "%m/%d/%Y", errors = 'coerce')
print(df.shape)
df.head(3)(629, 18)
| ID | Name | Dumpster | Month | Year | Date | Weight | Volume | PlasticBottles | Polystyrene | CigaretteButts | GlassBottles | PlasticBags | Wrappers | SportsBalls | HomesPowered | plastic_dumping_score | drinking_smoking_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | mister | Mister Trash Wheel | 1 | May | 2014 | 2014-05-16 | 4.31 | 18 | 1450 | 1820 | 126000 | 72 | 584 | 1162 | 7 | 0 | -1.261913 | 3.589004 |
| 1 | mister | Mister Trash Wheel | 2 | May | 2014 | 2014-05-16 | 2.74 | 13 | 1120 | 1030 | 91000 | 42 | 496 | 874 | 5 | 0 | -0.927719 | 1.558504 |
| 2 | mister | Mister Trash Wheel | 3 | May | 2014 | 2014-05-16 | 3.45 | 15 | 2450 | 3100 | 105000 | 50 | 1080 | 2032 | 6 | 0 | -0.022253 | 1.875345 |
More information
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 629 entries, 0 to 628
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 629 non-null object
1 Name 629 non-null object
2 Dumpster 629 non-null int64
3 Month 629 non-null object
4 Year 629 non-null int64
5 Date 592 non-null datetime64[ns]
6 Weight 629 non-null float64
7 Volume 629 non-null int64
8 PlasticBottles 629 non-null int64
9 Polystyrene 629 non-null int64
10 CigaretteButts 629 non-null int64
11 GlassBottles 629 non-null int64
12 PlasticBags 629 non-null int64
13 Wrappers 629 non-null int64
14 SportsBalls 629 non-null int64
15 HomesPowered 629 non-null int64
16 plastic_dumping_score 629 non-null float64
17 drinking_smoking_score 629 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(11), object(3)
memory usage: 88.6+ KB
Convert year to integer
df['Year'] = df['Year'].astype(int)Check no missing observations
df.isna().sum()ID 0
Name 0
Dumpster 0
Month 0
Year 0
Date 37
Weight 0
Volume 0
PlasticBottles 0
Polystyrene 0
CigaretteButts 0
GlassBottles 0
PlasticBags 0
Wrappers 0
SportsBalls 0
HomesPowered 0
plastic_dumping_score 0
drinking_smoking_score 0
dtype: int64
Drop NAs (there aren’t any)
df = df.dropna()
print(df.shape)
df.isna().sum()(592, 18)
ID 0
Name 0
Dumpster 0
Month 0
Year 0
Date 0
Weight 0
Volume 0
PlasticBottles 0
Polystyrene 0
CigaretteButts 0
GlassBottles 0
PlasticBags 0
Wrappers 0
SportsBalls 0
HomesPowered 0
plastic_dumping_score 0
drinking_smoking_score 0
dtype: int64
Sort by date
df.sort_values(by=['Date'], inplace=True)Visualisation
Produce list of theme options and select the third
options = ["plotly", "plotly_white", "plotly_dark", "ggplot2", "seaborn", "simple_white"]
template = options[2]Look at the new query syntax
df.query("Year < 2017")| ID | Name | Dumpster | Month | Year | Date | Weight | Volume | PlasticBottles | Polystyrene | CigaretteButts | GlassBottles | PlasticBags | Wrappers | SportsBalls | HomesPowered | plastic_dumping_score | drinking_smoking_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | mister | Mister Trash Wheel | 1 | May | 2014 | 2014-05-16 | 4.31 | 18 | 1450 | 1820 | 126000 | 72 | 584 | 1162 | 7 | 0 | -1.261913 | 3.589004 |
| 1 | mister | Mister Trash Wheel | 2 | May | 2014 | 2014-05-16 | 2.74 | 13 | 1120 | 1030 | 91000 | 42 | 496 | 874 | 5 | 0 | -0.927719 | 1.558504 |
| 2 | mister | Mister Trash Wheel | 3 | May | 2014 | 2014-05-16 | 3.45 | 15 | 2450 | 3100 | 105000 | 50 | 1080 | 2032 | 6 | 0 | -0.022253 | 1.875345 |
| 3 | mister | Mister Trash Wheel | 4 | May | 2014 | 2014-05-17 | 3.10 | 15 | 2380 | 2730 | 100000 | 52 | 896 | 1971 | 6 | 0 | -0.285576 | 2.064766 |
| 4 | mister | Mister Trash Wheel | 5 | May | 2014 | 2014-05-17 | 4.06 | 18 | 980 | 870 | 120000 | 72 | 368 | 753 | 7 | 0 | -1.664942 | 3.678861 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 161 | mister | Mister Trash Wheel | 162 | November | 2016 | 2016-11-30 | 2.75 | 18 | 3460 | 5840 | 16000 | 42 | 3260 | 3430 | 34 | 46 | 2.801646 | 0.685263 |
| 162 | mister | Mister Trash Wheel | 163 | December | 2016 | 2016-12-01 | 3.41 | 15 | 1840 | 4760 | 23000 | 43 | 3470 | 3800 | 6 | 57 | 2.925675 | 0.713372 |
| 163 | mister | Mister Trash Wheel | 164 | December | 2016 | 2016-12-06 | 2.55 | 15 | 1360 | 3850 | 34000 | 39 | 2340 | 4220 | 24 | 43 | 1.972863 | 0.685088 |
| 164 | mister | Mister Trash Wheel | 165 | December | 2016 | 2016-12-16 | 1.74 | 18 | 1880 | 2890 | 26000 | 59 | 2100 | 4040 | 20 | 29 | 1.233737 | 2.149005 |
| 165 | mister | Mister Trash Wheel | 166 | December | 2016 | 2017-01-02 | 2.13 | 15 | 2460 | 2740 | 32000 | 48 | 3250 | 4430 | 15 | 36 | 2.624921 | 1.110940 |
162 rows × 18 columns
Produce the first plot
dftemp = df.query("Year < 2017").copy()
fig = px.box(dftemp, x='Year',y='Volume',color = 'ID')
fig.update_layout(
title = "<b>Plot 1: Volume per Id box plot",
xaxis = dict(title='Years available'),
yaxis = dict(title='Volume (m3)'),
template=template
)
fig.show()Second figure
dftemp = df[['Date','ID','PlasticBottles']].copy()
dftemp['Yearmonth'] = dftemp['Date'].apply(lambda x: x.strftime('%Y-%m'))
del dftemp['Date']
dftemp=dftemp.groupby(['Yearmonth','ID']).sum()
dftemp.reset_index(inplace=True)
fig_line = px.line(dftemp, x = 'Yearmonth',y = 'PlasticBottles',color = 'ID',
labels = {'PlasticBottles': 'N of bottles', 'ID': 'Identifier', 'Yearmonth': 'Year and month'},template=template
)
fig_line.update_layout(
title = "<b>Plot 2: N of bottles per year and month</b>",
xaxis = dict(title='Time series'),
yaxis = dict(title='Amount (units)')
)
fig_line.show()Check which unique years we have
df["Year"].unique()array([2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023])
Produce a subplot for different years
fig = make_subplots(rows=1, cols=2)
c = 1
for year in df["Year"].unique():
if year > 2014 and year < 2017:
dftemp = df.query("Year == {}".format(year)).copy()
dftemp["Month"] = dftemp["Date"].apply(lambda x: x.strftime('%m'))
dftemp = dftemp[['Year','Month','PlasticBottles']].copy()
dftemp = dftemp.groupby(['Year','Month']).sum()
dftemp.reset_index(inplace=True)
fig.add_trace(go.Scatter(x=dftemp.Month, y=dftemp.PlasticBottles, name=str(year)), row=1, col=c)
c = c + 1
fig.update_layout(title_text="<b>Plot 3: Side By Side Subplots</b>", template=template)
fig.show()Stacked subplots
fig = make_subplots(rows=2, cols=1)
r = 1
for year in df["Year"].unique():
if year > 2014 and year < 2017:
dftemp = df.query("Year == {}".format(year)).copy()
dftemp["Month"] = dftemp["Date"].apply(lambda x: x.strftime('%m'))
dftemp = dftemp[['Year','Month','PlasticBottles']].copy()
dftemp = dftemp.groupby(['Year','Month']).sum()
dftemp.reset_index(inplace=True)
fig.append_trace(go.Scatter(x=dftemp.Month, y=dftemp.PlasticBottles, name=str(year)), row=r, col=1)
r = r + 1
fig.update_layout(title_text="<b>Plot 4: Stacked Subplots</b>", template=template)
fig.show()Gridded subplots with made-up data:
fig = make_subplots(rows=2, cols=2)
fig.add_trace(go.Scatter(x=[1, 2, 3], y=[4, 5, 6]), row=1, col=1)
fig.add_trace(go.Scatter(x=[20, 30, 40], y=[50, 60, 70]), row=1, col=2)
fig.add_trace(go.Scatter(x=[300, 400, 500], y=[600, 700, 800]), row=2, col=1)
fig.add_trace(go.Scatter(x=[4000, 5000, 6000], y=[7000, 8000, 9000]), row=2, col=2)
fig.update_layout(title_text="Grid Subplots", template=template)
fig.show()There’s only one barge at the moment. I guess they’re hoping to get more?
df["Name"].unique()array(['Mister Trash Wheel'], dtype=object)
dftemp = df[['Date','plastic_dumping_score','Name']].copy()
dftemp['Yearmonth'] = df['Date'].apply(lambda x: x.strftime('%Y-%m'))
del dftemp['Date']
dftemp=dftemp.groupby(['Yearmonth','Name']).sum()
dftemp.reset_index(inplace=True)
fig_area = px.area(dftemp, x = 'Yearmonth',y = 'plastic_dumping_score',color = 'Name', template=template)
fig_area.update_layout(
title = "<b>Plot 4: Dumping score per year and name</b>",
xaxis = dict(title='Year and Month'),
yaxis = dict(title='Total dumping score')
)
fig_area.show()An interactive treemap
dftemp = df[['Month','Year','PlasticBottles']].copy()
dftemp=dftemp.groupby(['Month','Year']).sum()
dftemp.reset_index(inplace=True)
fig_tree_maps = px.treemap(dftemp, path= ['Year','Month'],values ='PlasticBottles',color_continuous_scale='RdBu', template=template)
fig_tree_maps.update_layout(
title = "<b>Plot 7: Tree map about bottles per year and month</b>"
)
fig_tree_maps.show()And a 3D plot!
dftemp = df[['Year','drinking_smoking_score','plastic_dumping_score','ID']].copy()
dftemp=dftemp.groupby(['Year','ID']).mean()
dftemp.reset_index(inplace=True)
fig_scatter3D = px.scatter_3d(dftemp,x = 'Year',y='drinking_smoking_score', z = 'plastic_dumping_score', color = 'ID',opacity=0.7, template=template)
fig_scatter3D.update_layout(title = "<b>Plot 8: Year and plastic and drinking scores</b>")
fig_scatter3D.show()And a pie chart:
dftemp = df[['Year','PlasticBags']].copy()
dftemp=dftemp.groupby(['Year']).sum()
dftemp.reset_index(inplace=True)
fig = go.Figure(
data=[go.Pie(
labels=dftemp['Year'],
values=dftemp['PlasticBags'],
sort=False)
])
fig.update_layout(title = "<b>Plot 7: Plastic bags per year</b>", template=template)
fig.show()Reflections
Google colab appears a good way of getting a jupyter notebook up and running, and accessible on many devices without installing python and dependencies first.
There were actually more issues (related to date formatting and package versions) in running both R and python code in this quarto markdown document. Definitely a learning experience!
Katie Pyper had questions about rules-of-thumb/conventions for defining and using outliers (as shown in the box plots) in regressions etc. An important separate topic!
The same colab/python training will hopefully be of interest to a broader NHS audience