Taxonomy is pretty important at the ALA. Every occurrence record in the atlas is linked to a unique taxonomic identifier. These identifiers are themselves drawn from expertly curated taxonomic datasets. This system of classification is so important to our infrastructure that we have a special name for it; the ‘taxonomic backbone’. But what does it look like?
Visualising trees is not particularly easy for me; I didn’t train in it, and the data structures involved can be a bit complex. More importantly, until recently it was difficult to download detailed taxonomic information from the ALA. Since version 1.3.1 of galah
, however, it has been possible to download taxonomic trees using the atlas_taxonomy()
function. Let’s have a go at visualising these trees now.
Downloading taxonomic trees
The first step is to choose a taxonomic group to represent in tree form. I’ve chosen the chordates (Phylum Chordata) because they aren’t too large a group and the names are fairly well-known. We can specify this within galah
using the function galah_identify
. The second piece of information we need to supply is how far ‘down’ the tree to travel. I’ve chosen the Order level here using galah_down_to(order)
; while we could have gone to the Family or even Genus, trying to traverse too many levels (i.e. to Genus or Species) would take a very long time. A full list of accepted ranks can be found by calling show_all_ranks()
.
library(galah)
<- galah_call() |>
chordate_orders galah_identify("chordata") |>
galah_down_to(order) |>
atlas_taxonomy()
The object returned by atlas_taxonomy
is slightly unusual; it uses the data.tree
package, meaning that the dataset is literally structured like a tree. This is notably different from other representations of networks, such as you might find in igraph
, for example. To get an idea of what the data look like, we can use the inbuilt print
method for this data type:
library(data.tree)
print(chordate_orders, pruneMethod = "dist", limit = 10)
levelName
1 Chordata
2 ¦--Cephalochordata
3 ¦ °--Amphioxi
4 ¦ °--... 1 nodes w/ 0 sub
5 ¦--Tunicata
6 ¦ ¦--Appendicularia
7 ¦ ¦ °--... 1 nodes w/ 0 sub
8 ¦ ¦--Ascidiacea
9 ¦ ¦ °--... 5 nodes w/ 0 sub
10 ¦ °--Thaliacea
11 ¦ °--... 3 nodes w/ 0 sub
12 °--Vertebrata
13 ¦--Agnatha
14 ¦ °--... 2 nodes w/ 2 sub
15 °--Gnathostomata
16 °--... 5 nodes w/ 134 sub
This shows there are three nodes directly beneath Chordata in the taxonomic hierarchy, of which the largest (by number of sub-nodes) is the vertebrates (Vertebrata). There is a lot we could do with this tree; each node contains a unique taxonomic identifer, for example, meaning that we could use individual nodes to make new queries using galah
. However, for now a useful task is simply to visualise the structure of the whole tree.
Getting plot-ready data
Taxonomic trees are complex. While all species have a Kingdom, Phylum, Order, Class and Family, there are many intermediate categories that are ‘optional’. In practice, this means that when we convert to a data.frame
for plotting, there are a lot of missing values; nodes that apply to some rows but not others.
<- ToDataFrameTypeCol(chordate_orders, type = "rank")
df_rank 10:20,] |> tibble::as_tibble() |> print(max_footer_lines = 2) df_rank[
# A tibble: 11 × 10
rank_phylum rank_su…¹ rank_…² rank_…³ rank_…⁴ rank_…⁵ rank_…⁶ rank_…⁷ rank_…⁸
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Chordata Tunicata Thalia… <NA> <NA> <NA> <NA> <NA> <NA>
2 Chordata Vertebra… <NA> Myxini… <NA> <NA> <NA> <NA> <NA>
3 Chordata Vertebra… <NA> Petrom… <NA> <NA> <NA> <NA> <NA>
4 Chordata Vertebra… Amphib… Gnatho… Lissam… <NA> <NA> <NA> <NA>
5 Chordata Vertebra… Amphib… Gnatho… <NA> Labyri… <NA> <NA> <NA>
6 Chordata Vertebra… Amphib… Gnatho… <NA> Salien… <NA> <NA> <NA>
7 Chordata Vertebra… Aves Gnatho… Neogna… <NA> <NA> <NA> <NA>
8 Chordata Vertebra… Aves Gnatho… Neogna… <NA> <NA> <NA> <NA>
9 Chordata Vertebra… Aves Gnatho… Palaeo… <NA> Ratitae <NA> <NA>
10 Chordata Vertebra… Aves Gnatho… Palaeo… <NA> Ratitae <NA> <NA>
11 Chordata Vertebra… Aves Gnatho… <NA> <NA> <NA> <NA> <NA>
# … with 1 more variable: rank_order <chr>, and abbreviated variable names
# ¹rank_subphylum, ²rank_class, ³rank_informal, ⁴rank_subclass, …
These missing values will show up as empty sections in the resulting diagram, which isn’t ideal. Instead, we can build this data.frame
so as to place all nodes in order by row, with empty ‘levels’ being placed at the end. This also avoids the problem where ‘unnamed’ ranks are grouped in the same column. To achieve this, we simply choose a different node attribute (level
in this case) to supply to the type
argument.
<- ToDataFrameTypeCol(chordate_orders, type = "level")
df_level 10:20, ] |> tibble::as_tibble() df_level[
# A tibble: 11 × 8
level_1 level_2 level_3 level_4 level_5 level_6 level_7 level_8
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Chordata Tunicata Thaliacea Salpida <NA> <NA> <NA> <NA>
2 Chordata Vertebrata Agnatha Myxini Myxini… <NA> <NA> <NA>
3 Chordata Vertebrata Agnatha Petromyzon… Petrom… <NA> <NA> <NA>
4 Chordata Vertebrata Gnathostomata Amphibia Lissam… Anura <NA> <NA>
5 Chordata Vertebrata Gnathostomata Amphibia Labyri… Temnos… <NA> <NA>
6 Chordata Vertebrata Gnathostomata Amphibia Salien… Spheno… <NA> <NA>
7 Chordata Vertebrata Gnathostomata Aves Neogna… Accipi… <NA> <NA>
8 Chordata Vertebrata Gnathostomata Aves Neogna… Phaeth… <NA> <NA>
9 Chordata Vertebrata Gnathostomata Aves Palaeo… Ratitae Casuar… <NA>
10 Chordata Vertebrata Gnathostomata Aves Palaeo… Ratitae Dinorn… <NA>
11 Chordata Vertebrata Gnathostomata Aves Accipi… <NA> <NA> <NA>
Another problem in this dataset is the existence of duplicated taxonomic names. This happens because different authorities place the same taxon in different parts of the tree, and while the ALA tries to clean up these issues, some disagreements remain. The code below assumes that each name is only present once, so we have to remove duplicates to proceed. Fortunately there is a function in package base
that flags duplcated values as TRUE
and unique values as FALSE
. We can use this function to identify rows where order
is not unique.
library(dplyr)
<- !duplicated(df_rank$rank_order)
keep_rows <- filter(df_rank, keep_rows)
df_rank <- filter(df_level, keep_rows) df_level
The next step is to determine how to represent this structure in a plot. At the moment we can’t do this, because the data are in ‘wide’ format. Instead, we need to reorder our data so that each node/taxon is represented once, and other plotting aesthetics can be added as additional columns. To achieve this, we first convert to ‘long’ format, preserving information like what row and column each taxonomic label was recorded in.
<- tibble(
df_long row = rep(seq_len(nrow(df_level)), ncol(df_level)),
level = rep(seq_len(ncol(df_level)), each = nrow(df_level)),
taxon = do.call(c, df_level)) |>
filter(!is.na(taxon)) # remove missing values
Then, we can summarize this plot so that each row is a single taxon, recording some metadata about rows and columns from the original dataset
<- df_long |>
df_plot group_by(taxon) |>
summarize(
xmin = min(row) - 1,
xmax = max(row),
ymin = level[1] - 1,
ymax = level[1])
df_plot
# A tibble: 161 × 5
taxon xmin xmax ymin ymax
<chr> <dbl> <int> <dbl> <int>
1 Acanthopterygii 61 74 6 7
2 Accipititrifomes 15 16 5 6
3 Accipitriformes 19 20 4 5
4 Actinopterygii 56 96 4 5
5 Agnatha 10 12 2 3
6 Albuliformes 57 58 6 7
7 Amphibia 12 15 3 4
8 Amphioxi 0 1 2 3
9 Amphioxiformes 0 1 3 4
10 Anguilliformes 58 59 6 7
# … with 151 more rows
Drawing
Our dataset now contains all the information we need to plot the structure of our taxonomic tree. As usual, we’re going to plot this with ggplot2
.
library(ggplot2)
ggplot(df_plot) +
geom_rect(
mapping = aes(
xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax,
group = taxon,
fill = ymax),
color = "white")
While this is (probably) accurate, it’s not very informative. The most obvious missing element is labels; to add these, we’ll need to determine which nodes are ‘leaves’, and which are ‘branches’. We’ll also want to restrict labelling to larger branches, to avoid the text looking crowded. Finally, there is no need to label leaves with both a rectangle and text; so we’ll remove the leaf rectangles from the plot.
<- df_plot |> mutate(
df_plot x_dist = xmax - xmin,
is_leaf = taxon %in% df_rank$rank_order)
<- ggplot() +
p geom_rect(
data = filter(df_plot, !is_leaf),
mapping = aes(
xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax,
group = taxon,
fill = ymax),
color = "white")
+
p # branch labels
geom_text(
data = filter(df_plot, x_dist > 5),
mapping = aes(
x = xmin + (x_dist * 0.5),
y = ymin + 0.5,
label = taxon),
color = "white",
size = 3) +
# leaf labels
geom_text(
data = filter(df_plot, is_leaf),
aes(x = xmin + 0.5, y = ymin, label = taxon),
angle = 90,
hjust = 0,
size = 2.5,
color = "grey20")
This is better, but not ideal. A much more pleasing look is to use coord_polar()
to generate a circular plot; but this leads to linear text on a circular plot, which looks messy. Fortunately, the new package geomtextpath
solves this problem. All we have to do is replace geom_text
with geom_textpath
, leaving all other code the same, and add coord_polar()
at the end.
library(geomtextpath)
<- p +
p geom_textpath(
data = filter(df_plot, x_dist > 5),
mapping = aes(
x = xmin + (x_dist * 0.5),
y = ymin + 0.5,
label = taxon),
color = "white",
size = 2.7) +
geom_textpath(
data = filter(df_plot, is_leaf),
aes(x = xmin + 0.5, y = ymin, label = taxon),
angle = 90,
hjust = 0,
size = 2.3,
color = "grey20") +
coord_polar()
p
Finally, we can add some finishing touches by changing the color scheme, hiding the background colors and legend, and resizing the y axis so all the labels are visible.
library(viridis)
+
p scale_fill_viridis(begin = 0, end = 0.9, direction = -1) +
lims(y = c(0, 9)) +
theme_void() +
theme(legend.position = "none")
Done! This is a fun plot, but there are ways it could be expanded or improved, the most obvious of which is to find ways to add supplementary information. Wouldn’t it be great, for example, to add leaf-level record counts as marginal barplots? Or scale the size of segments to the number of records, rather than the number of clades? While none of these are impossible, I’m going to leave this here for now. I hope you like the result!
Expand for session info
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.2.2 (2022-10-31 ucrt)
os Windows 10 x64 (build 19044)
system x86_64, mingw32
ui RTerm
language (EN)
collate English_Australia.utf8
ctype English_Australia.utf8
tz Australia/Sydney
date 2023-03-29
pandoc 2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
data.tree * 1.0.0 2020-08-03 [1] CRAN (R 4.2.1)
dplyr * 1.1.0 2023-01-29 [1] CRAN (R 4.2.2)
galah * 1.5.2 2023-03-20 [1] Github (AtlasOfLivingAustralia/galah@1b35520)
geomtextpath * 0.1.1 2022-08-30 [1] CRAN (R 4.2.1)
ggplot2 * 3.4.1 2023-02-10 [1] CRAN (R 4.2.2)
htmltools * 0.5.4 2022-12-07 [1] CRAN (R 4.2.2)
sessioninfo * 1.2.2 2021-12-06 [1] CRAN (R 4.2.1)
viridis * 0.6.2 2021-10-13 [1] CRAN (R 4.2.1)
viridisLite * 0.4.1 2022-08-22 [1] CRAN (R 4.2.1)
[1] C:/Users/KEL329/R-packages
[2] C:/Users/KEL329/AppData/Local/Programs/R/R-4.2.2/library
──────────────────────────────────────────────────────────────────────────────