Tidy Tuesday Extra - Packages and recursive searching

Author

Jon Minton

Published

April 20, 2024

Introduction

The latest tidytuesday dataset contains information on R packages and how they’re related to each other. The relationship information they contain poses some interesting challenges and opportunities. I (Jon), foolishly and/or sadistically, suggested trying to build a recursive algorithm which, given a given package, traces out the other packages that either depends on it, or that it depends on.

We didn’t quite get where we were hoping to, but hopefully in this post I’ll be able to unpack some of the challenges and opportunities this approach could bring.

Unlike most tidy tuesday challenges, my (horrible) suggestion brought us into the land of computer science, rather than the data science challenges that most tidy tuesday sessions tend to be focused on. Rather than cover what we did and didn’t achieve in that session, this post is largely my attempt to think through the challenge of building and developing a recursive function in R for allowing us to trace through a tree starting from a given node. The post isn’t intended to solve the challenge I initially suggested, but to lay out some of the conceptual groundwork required to do so later.

Recursion as a concept

A recursive function is a function that conditionally evokes itself. It’s a beautiful and horrifying idea - coding Inception - and as the link above suggests is often used when a more complex problem needs to be broken down into ever smaller steps. The fifth example in the above link says that it’s great for exploring and parsing tree and graph structures. And indeed that’s the kind of application I was thinking about when I saw the TidyTuesday dataset.

I’ve only found reason to build a recursive algorithm in R once before, perhaps around a decade ago. My problem was that I had a two dimensional regular matrix of values, but some of the cells contained missing values. I wanted to build a function that, for any missing value in the matrix, would impute a value for the missing cell given the average of the values in the eight cells that surrounded it, something known as a Moore Neighbourhood. The Wikipedia example image used is as follows:

Moore Neighbourhood

If each missing cell \(C\) was surrounded only by non-missing cells, then there would have been no need for recursion. However there were examples in the data where two or more contiguous/neighbouring cells cells were missing. I used recursion to solve this problem by calling the imputation function on any missing neighbours (Say \(NE\)) of the missing target cell \(C\). The missing neighbour cell would then become the new target cell \(C\), and if any of this target cell’s neighbours were missing, then the imputation function would be called once again, with the last stage’s neighbour cell now the new target cell. Only if the condition that a target cell has no missing neighbours would the imputation function actually impute.

In effect, this use of recursion meant that, for a patch of missing cells, the imputation would occur outside-to-inside, i.e. from the cell with the most non-missing neighbours to the cell with the fewest.

Anyway, with that example in mind, let’s look at the data.

Loading the data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytuesdayR)

dta_list <- tidytuesdayR::tt_load(2024, week = 16)
--- Compiling #TidyTuesday Information for 2024-04-16 ----
--- There are 2 files available ---
--- Starting Download ---

    Downloading file 1 of 2: `shiny_revdeps.csv`
    Downloading file 2 of 2: `package_details.csv`
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
--- Download complete ---
dta_main <- dta_list$shiny_revdeps
dta_main
# A tibble: 146,135 × 3
   child              dependency_type parent
   <chr>              <chr>           <chr> 
 1 AFheritability     depends         shiny 
 2 AMPLE              depends         shiny 
 3 animalEKF          depends         shiny 
 4 bde                depends         shiny 
 5 BDP2               depends         shiny 
 6 BoneProfileR       depends         shiny 
 7 clinDR             depends         shiny 
 8 CLME               depends         shiny 
 9 cocktailApp        depends         shiny 
10 competitiontoolbox depends         shiny 
# ℹ 146,125 more rows

Of course we didn’t notice the dataset was focused on shiny!

Exploration

What are the types of dependency listed?

unique(dta_main$dependency_type)
[1] "depends"   "imports"   "suggests"  "linkingto"

So, where the parent is shiny, how many types of each dependency are there?

dta_main %>% 
  filter(parent == "shiny") %>% 
  count(dependency_type)
# A tibble: 3 × 2
  dependency_type     n
  <chr>           <int>
1 depends            78
2 imports           793
3 suggests          305

Is shiny its own parent?

dta_main %>% 
  filter(parent == "shiny") |>
  filter(child == "shiny")
# A tibble: 0 × 3
# ℹ 3 variables: child <chr>, dependency_type <chr>, parent <chr>

No, fortunately.

Does the dataset contain examples where shiny is neither the parent nor the child?

dta_main |>
  filter(parent != "shiny" & child != "shiny")
# A tibble: 144,928 × 3
   child                 dependency_type parent  
   <chr>                 <chr>           <chr>   
 1 FAMetA                depends         LipidMS 
 2 teal.modules.clinical depends         teal    
 3 teal.modules.general  depends         teal    
 4 dartR                 depends         adegenet
 5 dartR.base            depends         adegenet
 6 dartR.captive         depends         adegenet
 7 dartR.data            depends         adegenet
 8 dartR.popgen          depends         adegenet
 9 dartR.sim             depends         adegenet
10 dartR.spatial         depends         adegenet
# ℹ 144,918 more rows

Yes it does.

Defining a problem

As we’ve finally looked enough at the dataset and documentation to know that shiny is the root, let’s work out how many packages are children where dependency type is imports and parent is shiny.

dta_main |>
  filter(parent == "shiny" & dependency_type == "imports") |>
  count(child) |>
  nrow()
[1] 793

There appear to be 793 packages that have this relationship.

Let’s say we want to take this list of 793 packages and find all packages that have them as children.

get_children <- function(parent_name) {
    dta_main |> 
    filter(parent == parent_name) |> 
    filter(dependency_type == "imports") |> 
    pull(child) |> 
    unique()
}

child_shiny <- get_children("shiny")

length(child_shiny)
[1] 793

There are almost 15000 packages with this relationship as children.

We can now start to think about the recursive search problem by running the get_children function for each child package, with the name of the child now the name of the parent.

Let’s start with the five first packages who are direct children of shiny.

some_shiny_children <- child_shiny[1:5]

some_shiny_grandchildren <- some_shiny_children |>
  map(~get_children(.))

some_shiny_children
[1] "ABACUS"        "abstractr"     "activAnalyzer" "AdaptGauss"   
[5] "adaptiveGPCA" 
some_shiny_grandchildren
[[1]]
character(0)

[[2]]
character(0)

[[3]]
character(0)

[[4]]
[1] "DistributionOptimization" "opGMMassessment"         
[3] "scapGNN"                  "Umatrix"                 

[[5]]
character(0)

For packages 1, 2, 3 and 5 there are no further children. However for package four there are four packages that are children.

Let’s see if the children of package 4 themselves have children.

great_grandchildren <- some_shiny_grandchildren[[4]] |>
  map(~get_children(.))

great_grandchildren
[[1]]
[1] "opGMMassessment"

[[2]]
[1] "EDOtrans"

[[3]]
character(0)

[[4]]
character(0)

Two of the great grandchildren have children.

Recursive search with a toy example

Let’s try to think through the fundamentals of a recursive function using a toy example.

toy_data <- tribble(
    ~parent, ~child, 
    "A", "B",
    "A", "C",
    "A", "D",
    "B", "E",
    "C", "F",
    "C", "G",
    "G", "J",
    "E", "H",
    "E", "I"
)

This dataset shows the following set of relationships:

flowchart TB

A --> B
A --> C
A --> D
B --> E
C --> F
C --> G
G --> J
E --> H
E --> I

Let’s first see if we can identify which of these nodes are roots. i.e. nodes that are children but have no parents.

is_root <- function(df, node_label){
    res <- df |> filter(parent == node_label) |> nrow()

    if(res == 0){
        return(TRUE)
    } else {
        return(FALSE)
    }
}

Let’s test this for each of the nodes in the toy dataset.

all_nodes <- unique(c(toy_data$parent, toy_data$child))


# run manually for a couple of examples:

is_root(toy_data, "A")
[1] FALSE
is_root(toy_data, "B")
[1] FALSE
is_root(toy_data, "D")
[1] TRUE
is_root(toy_data, "F")
[1] TRUE
# run using functional programming 

roots <-  
  sapply(all_nodes, function(x) is_root(toy_data, x))

roots
    A     B     C     G     E     D     F     J     H     I 
FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE 

Next we want to use the is_root() function inside a find_roots() function that will return all the roots in a dataset.

find_roots <- function(df){
    all_nodes <- unique(c(df$parent, df$child))
    
    roots <-  
      sapply(all_nodes, function(x) is_root(df, x))
    
    return(all_nodes[roots])
}

find_roots(toy_data)
[1] "D" "F" "J" "H" "I"

Recursive Root Finding Function

Let’s now think through a trace_to_root function, that uses recursion, and how it will work.

  • If trace is null, then start trace with node
  • If node is root, then return trace
  • If node is not root, then add each child to a trace, and rerun trace_to_root with the current node and trace parameters.

As part of debugging and development, I’ve added an option verbose, which reports on what the function is doing at each step.

trace_to_root <- function(node, trace = NULL, df, verbose = FALSE){
    if (verbose){
      message("Current node is ", node)
      message("Current trace is ", trace)
    }
   
    if (is.null(trace)){
        trace <- list(node)
    }

    if (is_root(df, node)){
        if (verbose) {
          message("node ", node, " is a root, so returning trace")
        }
        return(trace)
    } else {
        if (verbose) {
          message("node ", node, " is not a root, so continuing")
        }

        children <- df |> filter(parent == node) |> pull("child")

        if (verbose) {
          message("have found ", length(children), " children of ", node)
        }

        pass_down <- function(child, trace, verbose = TRUE) {
            if (verbose) {message("current child is ", child)}
            trace <- c(trace, child)
            if (verbose) {message("trace is ", trace)}
            return(trace_to_root(child, trace, df = df, verbose = verbose))
        }
        # This is where recursion happens
        return(
          map2(children, trace, pass_down)
        )

    }
}

As with many complex functions, this was developed through a number of steps, most of which involved extensive debugging and brow-furrowing. The use of the toy example and the graph, along with the verbose mode, made it easier to see whether the function was doing what I wanted it to, even if what it returns isn’t necessarily in the nicest format.

Let’s start with node ‘H’, which should be identified as a root, with no further children. This should mean the number of operations performed and reported should be short:

trace_to_root("H", df = toy_data, verbose = TRUE)
Current node is H
Current trace is 
node H is a root, so returning trace
[[1]]
[1] "H"

This seems to work as expected. Node ‘D’ should be similarly simple:

trace_to_root("D", df = toy_data, verbose = TRUE)
Current node is D
Current trace is 
node D is a root, so returning trace
[[1]]
[1] "D"

One step up in complexity/number of operations should be node G, which will be the first use-case that will involve some recursion.

trace_to_root("G", df = toy_data, verbose = TRUE)
Current node is G
Current trace is 
node G is not a root, so continuing
have found 1 children of G
current child is J
trace is GJ
Current node is J
Current trace is GJ
node J is a root, so returning trace
[[1]]
[1] "G" "J"

The list returned contains two elements, the first of which is G, and the second of which is J. This is the correct trace sequence.

Now let’s look at node E

trace_to_root("E", df = toy_data, verbose = TRUE)
Current node is E
Current trace is 
node E is not a root, so continuing
have found 2 children of E
current child is H
trace is EH
Current node is H
Current trace is EH
node H is a root, so returning trace
current child is I
trace is EI
Current node is I
Current trace is EI
node I is a root, so returning trace
[[1]]
[1] "E" "H"

[[2]]
[1] "E" "I"

This time the outer list is of length two, each of whcih containing two elements. The first sublist denotes the path E to H, and the second the path E to I. Once again this first with what we know about the part of the tree starting at E: it splits into two paths.

Now node C:

trace_to_root("C", df = toy_data, verbose = TRUE)
Current node is C
Current trace is 
node C is not a root, so continuing
have found 2 children of C
current child is F
trace is CF
Current node is F
Current trace is CF
node F is a root, so returning trace
current child is G
trace is CG
Current node is G
Current trace is CG
node G is not a root, so continuing
have found 1 children of G
current child is J
trace is CJ
Current node is J
Current trace is CJ
node J is a root, so returning trace
current child is J
trace is GJ
Current node is J
Current trace is GJ
node J is a root, so returning trace
[[1]]
[1] "C" "F"

[[2]]
[[2]][[1]]
[1] "C" "J"

[[2]][[2]]
[1] "G" "J"

The list object contains two sublists. The first sublist indicates the path C to F. The second sublist itself contains two sublists: one denoting a path C to J; the second of which denotes a path G to J.

Now let’s look at the tree as a whole, i.e. start at node A:

trace_to_root("A", df = toy_data, verbose = TRUE)
Current node is A
Current trace is 
node A is not a root, so continuing
have found 3 children of A
current child is B
trace is AB
Current node is B
Current trace is AB
node B is not a root, so continuing
have found 1 children of B
current child is E
trace is AE
Current node is E
Current trace is AE
node E is not a root, so continuing
have found 2 children of E
current child is H
trace is AH
Current node is H
Current trace is AH
node H is a root, so returning trace
current child is I
trace is EI
Current node is I
Current trace is EI
node I is a root, so returning trace
current child is E
trace is BE
Current node is E
Current trace is BE
node E is not a root, so continuing
have found 2 children of E
current child is H
trace is BH
Current node is H
Current trace is BH
node H is a root, so returning trace
current child is I
trace is EI
Current node is I
Current trace is EI
node I is a root, so returning trace
current child is C
trace is AC
Current node is C
Current trace is AC
node C is not a root, so continuing
have found 2 children of C
current child is F
trace is AF
Current node is F
Current trace is AF
node F is a root, so returning trace
current child is G
trace is CG
Current node is G
Current trace is CG
node G is not a root, so continuing
have found 1 children of G
current child is J
trace is CJ
Current node is J
Current trace is CJ
node J is a root, so returning trace
current child is J
trace is GJ
Current node is J
Current trace is GJ
node J is a root, so returning trace
current child is D
trace is AD
Current node is D
Current trace is AD
node D is a root, so returning trace
[[1]]
[[1]][[1]]
[[1]][[1]][[1]]
[1] "A" "H"

[[1]][[1]][[2]]
[1] "E" "I"


[[1]][[2]]
[[1]][[2]][[1]]
[1] "B" "H"

[[1]][[2]][[2]]
[1] "E" "I"



[[2]]
[[2]][[1]]
[1] "A" "F"

[[2]][[2]]
[[2]][[2]][[1]]
[1] "C" "J"

[[2]][[2]][[2]]
[1] "G" "J"



[[3]]
[1] "A" "D"

This structure is more complex. At the outer level there is a list of length three.

Conclusion

Broadly, it appears the information contained in the list structures would allow the tree structure to be recovered. However, currently no trace returned is of length greater than 2. Before applying a recursive algorithm to the real data, more work should probably be done on defining exactly what type of output should be returned, and then implementing this return type. However, the function does appear to use recursion effectively, delving into various tree structures until roots of the trees are found, rather than either just stopping at an arbitrary depth, or never stopping and evaluating.