Since version 1.3.1 of galah
, it has been possible to download taxonomic data using a ‘tree’ format from the data.tree
package. Here I’ll demonstrate some ideas for plotting these trees using circular diagrams.
Taxonomy is pretty important at the ALA. Every occurrence record in the atlas is linked to a unique taxonomic identifier. These identifiers are themselves drawn from expertly curated taxonomic datasets. This system of classification is so important to our infrastructure that we have a special name for it; the ‘taxonomic backbone’. But what does it look like?
Visualising trees is not particularly easy for me; I didn’t train in it, and the data structures involved can be a bit complex. More importantly, until recently it was difficult to download detailed taxonomic information from the ALA. Since version 1.3.1 of galah
, however, it has been possible to download taxonomic trees using the atlas_taxonomy()
function. Let’s have a go at visualising these trees now.
The first step is to choose a taxonomic group to represent in tree form. I’ve chosen the chordates (Phylum Chordata) because they aren’t too large a group and the names are fairly well-known. We can specify this within galah
using the function galah_identify
. The second piece of information we need to supply is how far ‘down’ the tree to travel. I’ve chosen the Order level here using galah_down_to(order)
; while we could have gone to the Family or even Genus, trying to traverse too many levels (i.e. to Genus or Species) would take a very long time. A full list of accepted ranks can be found by calling show_all_ranks()
.
library(galah)
chordate_orders <- galah_call() |>
galah_identify("chordata") |>
galah_down_to(order) |>
atlas_taxonomy()
The object returned by atlas_taxonomy
is slightly unusual; it uses the data.tree
package, meaning that the dataset is literally structured like a tree. This is notably different from other representations of networks, such as you might find in igraph
, for example. To get an idea of what the data look like, we can use the inbuilt print
method for this data type:
levelName
1 Chordata
2 ¦--Cephalochordata
3 ¦ °--Amphioxi
4 ¦ °--... 1 nodes w/ 0 sub
5 ¦--Tunicata
6 ¦ ¦--Appendicularia
7 ¦ ¦ °--... 1 nodes w/ 0 sub
8 ¦ ¦--Ascidiacea
9 ¦ ¦ °--... 5 nodes w/ 0 sub
10 ¦ °--Thaliacea
11 ¦ °--... 3 nodes w/ 0 sub
12 °--Vertebrata
13 ¦--Agnatha
14 ¦ °--... 2 nodes w/ 2 sub
15 °--Gnathostomata
16 °--... 5 nodes w/ 134 sub
This shows there are three nodes directly beneath Chordata in the taxonomic hierarchy, of which the largest (by number of sub-nodes) is the vertebrates (Vertebrata). There is a lot we could do with this tree; each node contains a unique taxonomic identifer, for example, meaning that we could use individual nodes to make new queries using galah
. However, for now a useful task is simply to visualise the structure of the whole tree.
Taxonomic trees are complex. While all species have a Kingdom, Phylum, Order, Class and Family, there are many intermediate categories that are ‘optional’. In practice, this means that when we convert to a data.frame
for plotting, there are a lot of missing values; nodes that apply to some rows but not others.
df_rank <- ToDataFrameTypeCol(chordate_orders, type = "rank")
df_rank[10:20,]
rank_phylum rank_subphylum rank_class rank_informal
10 Chordata Tunicata Thaliacea <NA>
11 Chordata Vertebrata <NA> Myxini, Agnatha
12 Chordata Vertebrata <NA> Petromyzontida, Agnatha
13 Chordata Vertebrata Amphibia Gnathostomata
14 Chordata Vertebrata Amphibia Gnathostomata
15 Chordata Vertebrata Amphibia Gnathostomata
16 Chordata Vertebrata Aves Gnathostomata
17 Chordata Vertebrata Aves Gnathostomata
18 Chordata Vertebrata Aves Gnathostomata
19 Chordata Vertebrata Aves Gnathostomata
20 Chordata Vertebrata Aves Gnathostomata
rank_subclass rank_superorder rank_subinfraclass rank_infraclass
10 <NA> <NA> <NA> <NA>
11 <NA> <NA> <NA> <NA>
12 <NA> <NA> <NA> <NA>
13 Lissamphibia <NA> <NA> <NA>
14 <NA> Labyrinthodonta <NA> <NA>
15 <NA> Salientia <NA> <NA>
16 Neognathae <NA> <NA> <NA>
17 Neognathae <NA> <NA> <NA>
18 Palaeognathae <NA> Ratitae <NA>
19 Palaeognathae <NA> Ratitae <NA>
20 <NA> <NA> <NA> <NA>
rank_subdivision zoology rank_order
10 <NA> Salpida
11 <NA> Myxiniformes
12 <NA> Petromyzontiformes
13 <NA> Anura
14 <NA> Temnospondyli
15 <NA> Sphenodontia
16 <NA> Accipititrifomes
17 <NA> Phaethontifomes
18 <NA> Casuariifomes
19 <NA> Dinornithiformes
20 <NA> Anseriformes
These missing values will show up as empty sections in the resulting diagram, which isn’t ideal. Instead, we can build this data.frame
so as to place all nodes in order by row, with empty ‘levels’ being placed at the end. This also avoids the problem where ‘unnamed’ ranks are grouped in the same column. To achieve this, we simply choose a different node attribute (level
in this case) to supply to the type
argument.
df_level <- ToDataFrameTypeCol(chordate_orders, type = "level")
df_level[10:20, ]
level_1 level_2 level_3 level_4
10 Chordata Tunicata Thaliacea Salpida
11 Chordata Vertebrata Agnatha Myxini
12 Chordata Vertebrata Agnatha Petromyzontida
13 Chordata Vertebrata Gnathostomata Amphibia
14 Chordata Vertebrata Gnathostomata Amphibia
15 Chordata Vertebrata Gnathostomata Amphibia
16 Chordata Vertebrata Gnathostomata Aves
17 Chordata Vertebrata Gnathostomata Aves
18 Chordata Vertebrata Gnathostomata Aves
19 Chordata Vertebrata Gnathostomata Aves
20 Chordata Vertebrata Gnathostomata Aves
level_5 level_6 level_7 level_8
10 <NA> <NA> <NA> <NA>
11 Myxiniformes <NA> <NA> <NA>
12 Petromyzontiformes <NA> <NA> <NA>
13 Lissamphibia Anura <NA> <NA>
14 Labyrinthodonta Temnospondyli <NA> <NA>
15 Salientia Sphenodontia <NA> <NA>
16 Neognathae Accipititrifomes <NA> <NA>
17 Neognathae Phaethontifomes <NA> <NA>
18 Palaeognathae Ratitae Casuariifomes <NA>
19 Palaeognathae Ratitae Dinornithiformes <NA>
20 Anseriformes <NA> <NA> <NA>
Another problem in this dataset is the existence of duplicated taxonomic names. This happens because different authorities place the same taxon in different parts of the tree, and while the ALA tries to clean up these issues, some disagreements remain. The code below assumes that each name is only present once, so we have to remove duplicates to proceed. Fortunately there is a function in package base
that flags duplcated values as TRUE
and unique values as FALSE
. We can use this function to identify rows where order
is not unique.
library(dplyr)
keep_rows <- !duplicated(df_rank$rank_order)
df_rank <- filter(df_rank, keep_rows)
df_level <- filter(df_level, keep_rows)
The next step is to determine how to represent this structure in a plot. At the moment we can’t do this, because the data are in ‘wide’ format. Instead, we need to reorder our data so that each node/taxon is represented once, and other plotting aesthetics can be added as additional columns. To achieve this, we first convert to ‘long’ format, preserving information like what row and column each taxonomic label was recorded in.
Then, we can summarize this plot so that each row is a single taxon, recording some metadata about rows and columns from the original dataset
df_plot <- df_long |>
group_by(taxon) |>
summarize(
xmin = min(row) - 1,
xmax = max(row),
ymin = level[1] - 1,
ymax = level[1])
df_plot
# A tibble: 161 x 5
taxon xmin xmax ymin ymax
<chr> <dbl> <int> <dbl> <int>
1 Acanthopterygii 66 79 6 7
2 Accipititrifomes 15 16 5 6
3 Accipitriformes 21 22 4 5
4 Actinopterygii 56 96 4 5
5 Agnatha 10 12 2 3
6 Albuliformes 62 63 6 7
7 Amphibia 12 15 3 4
8 Amphioxi 0 1 2 3
9 Amphioxiformes 0 1 3 4
10 Anguilliformes 63 64 6 7
# ... with 151 more rows
Our dataset now contains all the information we need to plot the structure of our taxonomic tree. As usual, we’re going to plot this with ggplot2
.
library(ggplot2)
ggplot(df_plot) +
geom_rect(
mapping = aes(
xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax,
group = taxon,
fill = ymax),
color = "white")
While this is (probably) accurate, it’s not very informative. The most obvious missing element is labels; to add these, we’ll need to determine which nodes are ‘leaves’, and which are ‘branches’. We’ll also want to restrict labelling to larger branches, to avoid the text looking crowded. Finally, there is no need to label leaves with both a rectangle and text; so we’ll remove the leaf rectangles from the plot.
df_plot <- df_plot |> mutate(
x_dist = xmax - xmin,
is_leaf = taxon %in% df_rank$rank_order)
p <- ggplot() +
geom_rect(
data = filter(df_plot, !is_leaf),
mapping = aes(
xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax,
group = taxon,
fill = ymax),
color = "white")
p +
# branch labels
geom_text(
data = filter(df_plot, x_dist > 5),
mapping = aes(
x = xmin + (x_dist * 0.5),
y = ymin + 0.5,
label = taxon),
color = "white",
size = 3) +
# leaf labels
geom_text(
data = filter(df_plot, is_leaf),
aes(x = xmin + 0.5, y = ymin, label = taxon),
angle = 90,
hjust = 0,
size = 2.5,
color = "grey20")