Sunburst plots for taxonomic data

Since version 1.3.1 of {galah}, it has been possible to download taxonomic data using a ‘tree’ format from the {data.tree} package. Here I’ll demonstrate some ideas for plotting these trees using circular diagrams.

Trees
Eukaryota
Animalia
Chordata
Author

Martin Westgate

Published

February 17, 2022

Author

Martin Westgate

Date

February 17 2022

Taxonomy is pretty important at the ALA. Every occurrence record in the atlas is linked to a unique taxonomic identifier. These identifiers are themselves drawn from expertly curated taxonomic datasets. This system of classification is so important to our infrastructure that we have a special name for it; the ‘taxonomic backbone’. But what does it look like?

Visualising trees is not particularly easy for me; I didn’t train in it, and the data structures involved can be a bit complex. More importantly, until recently it was difficult to download detailed taxonomic information from the ALA. Since version 1.3.1 of galah, however, it has been possible to download taxonomic trees using the atlas_taxonomy() function. Let’s have a go at visualising these trees now.

Downloading taxonomic trees

The first step is to choose a taxonomic group to represent in tree form. I’ve chosen the chordates (Phylum Chordata) because they aren’t too large a group and the names are fairly well-known. We can specify this within galah using the function galah_identify. The second piece of information we need to supply is how far ‘down’ the tree to travel. I’ve chosen the Order level here using galah_down_to(order); while we could have gone to the Family or even Genus, trying to traverse too many levels (i.e. to Genus or Species) would take a very long time. A full list of accepted ranks can be found by calling show_all_ranks().

library(galah)
chordate_orders <- galah_call() |>
  galah_identify("chordata") |>
  galah_down_to(order) |>
  atlas_taxonomy()

The object returned by atlas_taxonomy is slightly unusual; it uses the data.tree package, meaning that the dataset is literally structured like a tree. This is notably different from other representations of networks, such as you might find in igraph, for example. To get an idea of what the data look like, we can use the inbuilt print method for this data type:

library(data.tree)
print(chordate_orders, pruneMethod = "dist", limit = 10)
                            levelName
1  Chordata                          
2   ¦--Cephalochordata               
3   ¦   °--Amphioxi                  
4   ¦       °--... 1 nodes w/ 0 sub  
5   ¦--Tunicata                      
6   ¦   ¦--Appendicularia            
7   ¦   ¦   °--... 1 nodes w/ 0 sub  
8   ¦   ¦--Ascidiacea                
9   ¦   ¦   °--... 5 nodes w/ 0 sub  
10  ¦   °--Thaliacea                 
11  ¦       °--... 3 nodes w/ 0 sub  
12  °--Vertebrata                    
13      ¦--Agnatha                   
14      ¦   °--... 2 nodes w/ 2 sub  
15      °--Gnathostomata             
16          °--... 5 nodes w/ 134 sub

This shows there are three nodes directly beneath Chordata in the taxonomic hierarchy, of which the largest (by number of sub-nodes) is the vertebrates (Vertebrata). There is a lot we could do with this tree; each node contains a unique taxonomic identifer, for example, meaning that we could use individual nodes to make new queries using galah. However, for now a useful task is simply to visualise the structure of the whole tree.

Getting plot-ready data

Taxonomic trees are complex. While all species have a Kingdom, Phylum, Order, Class and Family, there are many intermediate categories that are ‘optional’. In practice, this means that when we convert to a data.frame for plotting, there are a lot of missing values; nodes that apply to some rows but not others.

df_rank <- ToDataFrameTypeCol(chordate_orders, type = "rank")
df_rank[10:20,] |> tibble::as_tibble() |> print(max_footer_lines = 2)
# A tibble: 11 × 10
   rank_phylum rank_su…¹ rank_…² rank_…³ rank_…⁴ rank_…⁵ rank_…⁶ rank_…⁷ rank_…⁸
   <chr>       <chr>     <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
 1 Chordata    Tunicata  Thalia… <NA>    <NA>    <NA>    <NA>    <NA>    <NA>   
 2 Chordata    Vertebra… <NA>    Myxini… <NA>    <NA>    <NA>    <NA>    <NA>   
 3 Chordata    Vertebra… <NA>    Petrom… <NA>    <NA>    <NA>    <NA>    <NA>   
 4 Chordata    Vertebra… Amphib… Gnatho… Lissam… <NA>    <NA>    <NA>    <NA>   
 5 Chordata    Vertebra… Amphib… Gnatho… <NA>    Labyri… <NA>    <NA>    <NA>   
 6 Chordata    Vertebra… Amphib… Gnatho… <NA>    Salien… <NA>    <NA>    <NA>   
 7 Chordata    Vertebra… Aves    Gnatho… Neogna… <NA>    <NA>    <NA>    <NA>   
 8 Chordata    Vertebra… Aves    Gnatho… Neogna… <NA>    <NA>    <NA>    <NA>   
 9 Chordata    Vertebra… Aves    Gnatho… Palaeo… <NA>    Ratitae <NA>    <NA>   
10 Chordata    Vertebra… Aves    Gnatho… Palaeo… <NA>    Ratitae <NA>    <NA>   
11 Chordata    Vertebra… Aves    Gnatho… <NA>    <NA>    <NA>    <NA>    <NA>   
# … with 1 more variable: rank_order <chr>, and abbreviated variable names
#   ¹​rank_subphylum, ²​rank_class, ³​rank_informal, ⁴​rank_subclass, …

These missing values will show up as empty sections in the resulting diagram, which isn’t ideal. Instead, we can build this data.frame so as to place all nodes in order by row, with empty ‘levels’ being placed at the end. This also avoids the problem where ‘unnamed’ ranks are grouped in the same column. To achieve this, we simply choose a different node attribute (level in this case) to supply to the type argument.

df_level <- ToDataFrameTypeCol(chordate_orders, type = "level")
df_level[10:20, ] |> tibble::as_tibble()
# A tibble: 11 × 8
   level_1  level_2    level_3       level_4     level_5 level_6 level_7 level_8
   <chr>    <chr>      <chr>         <chr>       <chr>   <chr>   <chr>   <chr>  
 1 Chordata Tunicata   Thaliacea     Salpida     <NA>    <NA>    <NA>    <NA>   
 2 Chordata Vertebrata Agnatha       Myxini      Myxini… <NA>    <NA>    <NA>   
 3 Chordata Vertebrata Agnatha       Petromyzon… Petrom… <NA>    <NA>    <NA>   
 4 Chordata Vertebrata Gnathostomata Amphibia    Lissam… Anura   <NA>    <NA>   
 5 Chordata Vertebrata Gnathostomata Amphibia    Labyri… Temnos… <NA>    <NA>   
 6 Chordata Vertebrata Gnathostomata Amphibia    Salien… Spheno… <NA>    <NA>   
 7 Chordata Vertebrata Gnathostomata Aves        Neogna… Accipi… <NA>    <NA>   
 8 Chordata Vertebrata Gnathostomata Aves        Neogna… Phaeth… <NA>    <NA>   
 9 Chordata Vertebrata Gnathostomata Aves        Palaeo… Ratitae Casuar… <NA>   
10 Chordata Vertebrata Gnathostomata Aves        Palaeo… Ratitae Dinorn… <NA>   
11 Chordata Vertebrata Gnathostomata Aves        Accipi… <NA>    <NA>    <NA>   

Another problem in this dataset is the existence of duplicated taxonomic names. This happens because different authorities place the same taxon in different parts of the tree, and while the ALA tries to clean up these issues, some disagreements remain. The code below assumes that each name is only present once, so we have to remove duplicates to proceed. Fortunately there is a function in package base that flags duplcated values as TRUE and unique values as FALSE. We can use this function to identify rows where order is not unique.

library(dplyr)
keep_rows <- !duplicated(df_rank$rank_order)
df_rank <- filter(df_rank, keep_rows)
df_level <- filter(df_level, keep_rows)

The next step is to determine how to represent this structure in a plot. At the moment we can’t do this, because the data are in ‘wide’ format. Instead, we need to reorder our data so that each node/taxon is represented once, and other plotting aesthetics can be added as additional columns. To achieve this, we first convert to ‘long’ format, preserving information like what row and column each taxonomic label was recorded in.

df_long <- tibble(
  row = rep(seq_len(nrow(df_level)), ncol(df_level)),
  level = rep(seq_len(ncol(df_level)), each = nrow(df_level)),
  taxon = do.call(c, df_level)) |> 
  filter(!is.na(taxon)) # remove missing values

Then, we can summarize this plot so that each row is a single taxon, recording some metadata about rows and columns from the original dataset

df_plot <- df_long |>
  group_by(taxon) |>
  summarize(
    xmin = min(row) - 1, 
    xmax = max(row), 
    ymin = level[1] - 1,
    ymax = level[1])
     
df_plot
# A tibble: 161 × 5
   taxon             xmin  xmax  ymin  ymax
   <chr>            <dbl> <int> <dbl> <int>
 1 Acanthopterygii     61    74     6     7
 2 Accipititrifomes    15    16     5     6
 3 Accipitriformes     19    20     4     5
 4 Actinopterygii      56    96     4     5
 5 Agnatha             10    12     2     3
 6 Albuliformes        57    58     6     7
 7 Amphibia            12    15     3     4
 8 Amphioxi             0     1     2     3
 9 Amphioxiformes       0     1     3     4
10 Anguilliformes      58    59     6     7
# … with 151 more rows

Drawing

Our dataset now contains all the information we need to plot the structure of our taxonomic tree. As usual, we’re going to plot this with ggplot2.

library(ggplot2)
ggplot(df_plot) +
  geom_rect(
    mapping = aes(
      xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, 
      group = taxon,
      fill = ymax),
    color = "white")

While this is (probably) accurate, it’s not very informative. The most obvious missing element is labels; to add these, we’ll need to determine which nodes are ‘leaves’, and which are ‘branches’. We’ll also want to restrict labelling to larger branches, to avoid the text looking crowded. Finally, there is no need to label leaves with both a rectangle and text; so we’ll remove the leaf rectangles from the plot.

df_plot <- df_plot |> mutate(
  x_dist = xmax - xmin,
  is_leaf = taxon %in% df_rank$rank_order)

p <- ggplot() +
  geom_rect(
    data = filter(df_plot, !is_leaf),
    mapping = aes(
      xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, 
      group = taxon,
      fill = ymax),
    color = "white")

p +
  # branch labels
  geom_text(
    data = filter(df_plot, x_dist > 5),
    mapping = aes(
      x = xmin + (x_dist * 0.5), 
      y = ymin + 0.5,
      label = taxon),
    color  = "white",
    size = 3) +
  # leaf labels
  geom_text(
    data = filter(df_plot, is_leaf),
    aes(x = xmin + 0.5, y = ymin, label = taxon),
    angle = 90,
    hjust = 0,
    size = 2.5,
    color = "grey20") 

This is better, but not ideal. A much more pleasing look is to use coord_polar() to generate a circular plot; but this leads to linear text on a circular plot, which looks messy. Fortunately, the new package geomtextpath solves this problem. All we have to do is replace geom_text with geom_textpath, leaving all other code the same, and add coord_polar() at the end.

library(geomtextpath)

p <- p + 
  geom_textpath(
    data = filter(df_plot, x_dist > 5),
    mapping = aes(
      x = xmin + (x_dist * 0.5), 
      y = ymin + 0.5,
      label = taxon),
    color  = "white",
    size = 2.7) +
  geom_textpath(
    data = filter(df_plot, is_leaf),
    aes(x = xmin + 0.5, y = ymin, label = taxon),
    angle = 90,
    hjust = 0,
    size = 2.3,
    color = "grey20") +
  coord_polar()
p

Finally, we can add some finishing touches by changing the color scheme, hiding the background colors and legend, and resizing the y axis so all the labels are visible.

library(viridis)

p +
  scale_fill_viridis(begin = 0, end = 0.9, direction = -1) +
  lims(y = c(0, 9)) +
  theme_void() + 
  theme(legend.position = "none")

Done! This is a fun plot, but there are ways it could be expanded or improved, the most obvious of which is to find ways to add supplementary information. Wouldn’t it be great, for example, to add leaf-level record counts as marginal barplots? Or scale the size of segments to the number of records, rather than the number of clades? While none of these are impossible, I’m going to leave this here for now. I hope you like the result!

Expand for session info
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.2 (2022-10-31 ucrt)
 os       Windows 10 x64 (build 19044)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_Australia.utf8
 ctype    English_Australia.utf8
 tz       Australia/Sydney
 date     2023-03-29
 pandoc   2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package      * version date (UTC) lib source
 data.tree    * 1.0.0   2020-08-03 [1] CRAN (R 4.2.1)
 dplyr        * 1.1.0   2023-01-29 [1] CRAN (R 4.2.2)
 galah        * 1.5.2   2023-03-20 [1] Github (AtlasOfLivingAustralia/galah@1b35520)
 geomtextpath * 0.1.1   2022-08-30 [1] CRAN (R 4.2.1)
 ggplot2      * 3.4.1   2023-02-10 [1] CRAN (R 4.2.2)
 htmltools    * 0.5.4   2022-12-07 [1] CRAN (R 4.2.2)
 sessioninfo  * 1.2.2   2021-12-06 [1] CRAN (R 4.2.1)
 viridis      * 0.6.2   2021-10-13 [1] CRAN (R 4.2.1)
 viridisLite  * 0.4.1   2022-08-22 [1] CRAN (R 4.2.1)

 [1] C:/Users/KEL329/R-packages
 [2] C:/Users/KEL329/AppData/Local/Programs/R/R-4.2.2/library

──────────────────────────────────────────────────────────────────────────────