Tidy Tuesday R performance

R
tidy tuesday
R performance
Authors

Jon Minton

Gatz Osorio

Kennedy Owuso-Afrije

Imran Chowdhury

Myriam Scansetti

Kate Pyper

Andrew Saul

Published

May 22, 2024

Introduction

We used a 1GB reviews dataset which is available on datareviews: books review

There are some examples to test performance which is available on R markdown file:First_test.Rmd

This is a very interesting link with more examples Reference link:More examples

Content

  • We discussed microbenchmark, which helps to test and compare time execution.
    • Example to calculate number of week between 2 variables (casting as dates vs strings) using difftime
    • Example to calculate mean based on a grouped column using (Base r aggregate vs dplyr group_by and summarise_at)
    • Example to compare vector initialisation (x <- c() vs x <- vector(“integer”, n)) to calculate acumulative addition.
    • Example to calculate mean in a dataframe column (mean(dt[dt\(b > .5, ]\)a) vs mean(dt\(a[dt\)b > .5]))
    • Example to compare 1:n and seq(n)
    • Example to compare old pipe and new pipe
  • We discussed data.table, which speeds up data manipulation.
  • We discussed different file format ‘csv’, ‘RDS’ and ‘Parquet’, their compatibilities, vulnerabilities and storage compression.
    • We discussed ‘arrow’ package Parquet compression types: ‘gzip’, ‘snappy’ and ‘uncompressed’.