Tidy Tuesday Extra - Packages and recursive searching
Author
Jon Minton
Published
April 20, 2024
Introduction
The latest tidytuesday dataset contains information on R packages and how they’re related to each other. The relationship information they contain poses some interesting challenges and opportunities. I (Jon), foolishly and/or sadistically, suggested trying to build a recursive algorithm which, given a given package, traces out the other packages that either depends on it, or that it depends on.
We didn’t quite get where we were hoping to, but hopefully in this post I’ll be able to unpack some of the challenges and opportunities this approach could bring.
Unlike most tidy tuesday challenges, my (horrible) suggestion brought us into the land of computer science, rather than the data science challenges that most tidy tuesday sessions tend to be focused on. Rather than cover what we did and didn’t achieve in that session, this post is largely my attempt to think through the challenge of building and developing a recursive function in R for allowing us to trace through a tree starting from a given node. The post isn’t intended to solve the challenge I initially suggested, but to lay out some of the conceptual groundwork required to do so later.
Recursion as a concept
A recursive function is a function that conditionally evokes itself. It’s a beautiful and horrifying idea - coding Inception - and as the link above suggests is often used when a more complex problem needs to be broken down into ever smaller steps. The fifth example in the above link says that it’s great for exploring and parsing tree and graph structures. And indeed that’s the kind of application I was thinking about when I saw the TidyTuesday dataset.
I’ve only found reason to build a recursive algorithm in R once before, perhaps around a decade ago. My problem was that I had a two dimensional regular matrix of values, but some of the cells contained missing values. I wanted to build a function that, for any missing value in the matrix, would impute a value for the missing cell given the average of the values in the eight cells that surrounded it, something known as a Moore Neighbourhood. The Wikipedia example image used is as follows:
If each missing cell \(C\) was surrounded only by non-missing cells, then there would have been no need for recursion. However there were examples in the data where two or more contiguous/neighbouring cells cells were missing. I used recursion to solve this problem by calling the imputation function on any missing neighbours (Say \(NE\)) of the missing target cell \(C\). The missing neighbour cell would then become the new target cell \(C\), and if any of this target cell’s neighbours were missing, then the imputation function would be called once again, with the last stage’s neighbour cell now the new target cell. Only if the condition that a target cell has no missing neighbours would the imputation function actually impute.
In effect, this use of recursion meant that, for a patch of missing cells, the imputation would occur outside-to-inside, i.e. from the cell with the most non-missing neighbours to the cell with the fewest.
Anyway, with that example in mind, let’s look at the data.
Loading the data
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
As we’ve finally looked enough at the dataset and documentation to know that shiny is the root, let’s work out how many packages are children where dependency type is imports and parent is shiny.
There are almost 15000 packages with this relationship as children.
We can now start to think about the recursive search problem by running the get_children function for each child package, with the name of the child now the name of the parent.
Let’s start with the five first packages who are direct children of shiny.
Let’s now think through a trace_to_root function, that uses recursion, and how it will work.
If trace is null, then start trace with node
If node is root, then return trace
If node is not root, then add each child to a trace, and rerun trace_to_root with the current node and trace parameters.
As part of debugging and development, I’ve added an option verbose, which reports on what the function is doing at each step.
trace_to_root <-function(node, trace =NULL, df, verbose =FALSE){if (verbose){message("Current node is ", node)message("Current trace is ", trace) }if (is.null(trace)){ trace <-list(node) }if (is_root(df, node)){if (verbose) {message("node ", node, " is a root, so returning trace") }return(trace) } else {if (verbose) {message("node ", node, " is not a root, so continuing") } children <- df |>filter(parent == node) |>pull("child")if (verbose) {message("have found ", length(children), " children of ", node) } pass_down <-function(child, trace, verbose =TRUE) {if (verbose) {message("current child is ", child)} trace <-c(trace, child)if (verbose) {message("trace is ", trace)}return(trace_to_root(child, trace, df = df, verbose = verbose)) }# This is where recursion happensreturn(map2(children, trace, pass_down) ) }}
As with many complex functions, this was developed through a number of steps, most of which involved extensive debugging and brow-furrowing. The use of the toy example and the graph, along with the verbose mode, made it easier to see whether the function was doing what I wanted it to, even if what it returns isn’t necessarily in the nicest format.
Let’s start with node ‘H’, which should be identified as a root, with no further children. This should mean the number of operations performed and reported should be short:
trace_to_root("H", df = toy_data, verbose =TRUE)
Current node is H
Current trace is
node H is a root, so returning trace
[[1]]
[1] "H"
This seems to work as expected. Node ‘D’ should be similarly simple:
trace_to_root("D", df = toy_data, verbose =TRUE)
Current node is D
Current trace is
node D is a root, so returning trace
[[1]]
[1] "D"
One step up in complexity/number of operations should be node G, which will be the first use-case that will involve some recursion.
trace_to_root("G", df = toy_data, verbose =TRUE)
Current node is G
Current trace is
node G is not a root, so continuing
have found 1 children of G
current child is J
trace is GJ
Current node is J
Current trace is GJ
node J is a root, so returning trace
[[1]]
[1] "G" "J"
The list returned contains two elements, the first of which is G, and the second of which is J. This is the correct trace sequence.
Now let’s look at node E
trace_to_root("E", df = toy_data, verbose =TRUE)
Current node is E
Current trace is
node E is not a root, so continuing
have found 2 children of E
current child is H
trace is EH
Current node is H
Current trace is EH
node H is a root, so returning trace
current child is I
trace is EI
Current node is I
Current trace is EI
node I is a root, so returning trace
[[1]]
[1] "E" "H"
[[2]]
[1] "E" "I"
This time the outer list is of length two, each of whcih containing two elements. The first sublist denotes the path E to H, and the second the path E to I. Once again this first with what we know about the part of the tree starting at E: it splits into two paths.
The list object contains two sublists. The first sublist indicates the path C to F. The second sublist itself contains two sublists: one denoting a path C to J; the second of which denotes a path G to J.
Now let’s look at the tree as a whole, i.e. start at node A:
This structure is more complex. At the outer level there is a list of length three.
Conclusion
Broadly, it appears the information contained in the list structures would allow the tree structure to be recovered. However, currently no trace returned is of length greater than 2. Before applying a recursive algorithm to the real data, more work should probably be done on defining exactly what type of output should be returned, and then implementing this return type. However, the function does appear to use recursion effectively, delving into various tree structures until roots of the trees are found, rather than either just stopping at an arbitrary depth, or never stopping and evaluating.