Sunburst plots for taxonomic data

Trees Eukaryota Animalia Chordata

Since version 1.3.1 of galah, it has been possible to download taxonomic data using a ‘tree’ format from the data.tree package. Here I’ll demonstrate some ideas for plotting these trees using circular diagrams.

Martin Westgate
2022-02-17

Author:

Martin Westgate

Date:

17 February, 2022


Taxonomy is pretty important at the ALA. Every occurrence record in the atlas is linked to a unique taxonomic identifier. These identifiers are themselves drawn from expertly curated taxonomic datasets. This system of classification is so important to our infrastructure that we have a special name for it; the ‘taxonomic backbone’. But what does it look like?

Visualising trees is not particularly easy for me; I didn’t train in it, and the data structures involved can be a bit complex. More importantly, until recently it was difficult to download detailed taxonomic information from the ALA. Since version 1.3.1 of galah, however, it has been possible to download taxonomic trees using the atlas_taxonomy() function. Let’s have a go at visualising these trees now.

Downloading taxonomic trees

The first step is to choose a taxonomic group to represent in tree form. I’ve chosen the chordates (Phylum Chordata) because they aren’t too large a group and the names are fairly well-known. We can specify this within galah using the function galah_identify. The second piece of information we need to supply is how far ‘down’ the tree to travel. I’ve chosen the Order level here using galah_down_to(order); while we could have gone to the Family or even Genus, trying to traverse too many levels (i.e. to Genus or Species) would take a very long time. A full list of accepted ranks can be found by calling show_all_ranks().

library(galah)
chordate_orders <- galah_call() |>
  galah_identify("chordata") |>
  galah_down_to(order) |>
  atlas_taxonomy()

The object returned by atlas_taxonomy is slightly unusual; it uses the data.tree package, meaning that the dataset is literally structured like a tree. This is notably different from other representations of networks, such as you might find in igraph, for example. To get an idea of what the data look like, we can use the inbuilt print method for this data type:

library(data.tree)
print(chordate_orders, pruneMethod = "dist", limit = 10)
                            levelName
1  Chordata                          
2   ¦--Cephalochordata               
3   ¦   °--Amphioxi                  
4   ¦       °--... 1 nodes w/ 0 sub  
5   ¦--Tunicata                      
6   ¦   ¦--Appendicularia            
7   ¦   ¦   °--... 1 nodes w/ 0 sub  
8   ¦   ¦--Ascidiacea                
9   ¦   ¦   °--... 5 nodes w/ 0 sub  
10  ¦   °--Thaliacea                 
11  ¦       °--... 3 nodes w/ 0 sub  
12  °--Vertebrata                    
13      ¦--Agnatha                   
14      ¦   °--... 2 nodes w/ 2 sub  
15      °--Gnathostomata             
16          °--... 5 nodes w/ 134 sub

This shows there are three nodes directly beneath Chordata in the taxonomic hierarchy, of which the largest (by number of sub-nodes) is the vertebrates (Vertebrata). There is a lot we could do with this tree; each node contains a unique taxonomic identifer, for example, meaning that we could use individual nodes to make new queries using galah. However, for now a useful task is simply to visualise the structure of the whole tree.

Getting plot-ready data

Taxonomic trees are complex. While all species have a Kingdom, Phylum, Order, Class and Family, there are many intermediate categories that are ‘optional’. In practice, this means that when we convert to a data.frame for plotting, there are a lot of missing values; nodes that apply to some rows but not others.

df_rank <- ToDataFrameTypeCol(chordate_orders, type = "rank")
df_rank[10:20,]
   rank_phylum rank_subphylum rank_class           rank_informal
10    Chordata       Tunicata  Thaliacea                    <NA>
11    Chordata     Vertebrata       <NA>         Myxini, Agnatha
12    Chordata     Vertebrata       <NA> Petromyzontida, Agnatha
13    Chordata     Vertebrata   Amphibia           Gnathostomata
14    Chordata     Vertebrata   Amphibia           Gnathostomata
15    Chordata     Vertebrata   Amphibia           Gnathostomata
16    Chordata     Vertebrata       Aves           Gnathostomata
17    Chordata     Vertebrata       Aves           Gnathostomata
18    Chordata     Vertebrata       Aves           Gnathostomata
19    Chordata     Vertebrata       Aves           Gnathostomata
20    Chordata     Vertebrata       Aves           Gnathostomata
   rank_subclass rank_superorder rank_subinfraclass rank_infraclass
10          <NA>            <NA>               <NA>            <NA>
11          <NA>            <NA>               <NA>            <NA>
12          <NA>            <NA>               <NA>            <NA>
13  Lissamphibia            <NA>               <NA>            <NA>
14          <NA> Labyrinthodonta               <NA>            <NA>
15          <NA>       Salientia               <NA>            <NA>
16    Neognathae            <NA>               <NA>            <NA>
17    Neognathae            <NA>               <NA>            <NA>
18 Palaeognathae            <NA>            Ratitae            <NA>
19 Palaeognathae            <NA>            Ratitae            <NA>
20          <NA>            <NA>               <NA>            <NA>
   rank_subdivision zoology         rank_order
10                     <NA>            Salpida
11                     <NA>       Myxiniformes
12                     <NA> Petromyzontiformes
13                     <NA>              Anura
14                     <NA>      Temnospondyli
15                     <NA>       Sphenodontia
16                     <NA>   Accipititrifomes
17                     <NA>    Phaethontifomes
18                     <NA>      Casuariifomes
19                     <NA>   Dinornithiformes
20                     <NA>       Anseriformes

These missing values will show up as empty sections in the resulting diagram, which isn’t ideal. Instead, we can build this data.frame so as to place all nodes in order by row, with empty ‘levels’ being placed at the end. This also avoids the problem where ‘unnamed’ ranks are grouped in the same column. To achieve this, we simply choose a different node attribute (level in this case) to supply to the type argument.

df_level <- ToDataFrameTypeCol(chordate_orders, type = "level")
df_level[10:20, ]
    level_1    level_2       level_3        level_4
10 Chordata   Tunicata     Thaliacea        Salpida
11 Chordata Vertebrata       Agnatha         Myxini
12 Chordata Vertebrata       Agnatha Petromyzontida
13 Chordata Vertebrata Gnathostomata       Amphibia
14 Chordata Vertebrata Gnathostomata       Amphibia
15 Chordata Vertebrata Gnathostomata       Amphibia
16 Chordata Vertebrata Gnathostomata           Aves
17 Chordata Vertebrata Gnathostomata           Aves
18 Chordata Vertebrata Gnathostomata           Aves
19 Chordata Vertebrata Gnathostomata           Aves
20 Chordata Vertebrata Gnathostomata           Aves
              level_5          level_6          level_7 level_8
10               <NA>             <NA>             <NA>    <NA>
11       Myxiniformes             <NA>             <NA>    <NA>
12 Petromyzontiformes             <NA>             <NA>    <NA>
13       Lissamphibia            Anura             <NA>    <NA>
14    Labyrinthodonta    Temnospondyli             <NA>    <NA>
15          Salientia     Sphenodontia             <NA>    <NA>
16         Neognathae Accipititrifomes             <NA>    <NA>
17         Neognathae  Phaethontifomes             <NA>    <NA>
18      Palaeognathae          Ratitae    Casuariifomes    <NA>
19      Palaeognathae          Ratitae Dinornithiformes    <NA>
20       Anseriformes             <NA>             <NA>    <NA>

Another problem in this dataset is the existence of duplicated taxonomic names. This happens because different authorities place the same taxon in different parts of the tree, and while the ALA tries to clean up these issues, some disagreements remain. The code below assumes that each name is only present once, so we have to remove duplicates to proceed. Fortunately there is a function in package base that flags duplcated values as TRUE and unique values as FALSE. We can use this function to identify rows where order is not unique.

library(dplyr)
keep_rows <- !duplicated(df_rank$rank_order)
df_rank <- filter(df_rank, keep_rows)
df_level <- filter(df_level, keep_rows)

The next step is to determine how to represent this structure in a plot. At the moment we can’t do this, because the data are in ‘wide’ format. Instead, we need to reorder our data so that each node/taxon is represented once, and other plotting aesthetics can be added as additional columns. To achieve this, we first convert to ‘long’ format, preserving information like what row and column each taxonomic label was recorded in.

df_long <- tibble(
  row = rep(seq_len(nrow(df_level)), ncol(df_level)),
  level = rep(seq_len(ncol(df_level)), each = nrow(df_level)),
  taxon = do.call(c, df_level)) |> 
  filter(!is.na(taxon)) # remove missing values

Then, we can summarize this plot so that each row is a single taxon, recording some metadata about rows and columns from the original dataset

df_plot <- df_long |>
  group_by(taxon) |>
  summarize(
    xmin = min(row) - 1, 
    xmax = max(row), 
    ymin = level[1] - 1,
    ymax = level[1])
     
df_plot
# A tibble: 161 x 5
   taxon             xmin  xmax  ymin  ymax
   <chr>            <dbl> <int> <dbl> <int>
 1 Acanthopterygii     66    79     6     7
 2 Accipititrifomes    15    16     5     6
 3 Accipitriformes     21    22     4     5
 4 Actinopterygii      56    96     4     5
 5 Agnatha             10    12     2     3
 6 Albuliformes        62    63     6     7
 7 Amphibia            12    15     3     4
 8 Amphioxi             0     1     2     3
 9 Amphioxiformes       0     1     3     4
10 Anguilliformes      63    64     6     7
# ... with 151 more rows

Drawing

Our dataset now contains all the information we need to plot the structure of our taxonomic tree. As usual, we’re going to plot this with ggplot2.

library(ggplot2)
ggplot(df_plot) +
  geom_rect(
    mapping = aes(
      xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, 
      group = taxon,
      fill = ymax),
    color = "white")

While this is (probably) accurate, it’s not very informative. The most obvious missing element is labels; to add these, we’ll need to determine which nodes are ‘leaves’, and which are ‘branches’. We’ll also want to restrict labelling to larger branches, to avoid the text looking crowded. Finally, there is no need to label leaves with both a rectangle and text; so we’ll remove the leaf rectangles from the plot.

df_plot <- df_plot |> mutate(
  x_dist = xmax - xmin,
  is_leaf = taxon %in% df_rank$rank_order)

p <- ggplot() +
  geom_rect(
    data = filter(df_plot, !is_leaf),
    mapping = aes(
      xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, 
      group = taxon,
      fill = ymax),
    color = "white")

p +
  # branch labels
  geom_text(
    data = filter(df_plot, x_dist > 5),
    mapping = aes(
      x = xmin + (x_dist * 0.5), 
      y = ymin + 0.5,
      label = taxon),
    color  = "white",
    size = 3) +
  # leaf labels
  geom_text(
    data = filter(df_plot, is_leaf),
    aes(x = xmin + 0.5, y = ymin, label = taxon),
    angle = 90,
    hjust = 0,
    size = 2.5,
    color = "grey20")