Since version 1.3.1 of
galah, it has been possible to download taxonomic data using a ‘tree’ format from the
data.tree package. Here I’ll demonstrate some ideas for plotting these trees using circular diagrams.
Taxonomy is pretty important at the ALA. Every occurrence record in the atlas is linked to a unique taxonomic identifier. These identifiers are themselves drawn from expertly curated taxonomic datasets. This system of classification is so important to our infrastructure that we have a special name for it; the ‘taxonomic backbone’. But what does it look like?
Visualising trees is not particularly easy for me; I didn’t train in it, and the data structures involved can be a bit complex. More importantly, until recently it was difficult to download detailed taxonomic information from the ALA. Since version 1.3.1 of
galah, however, it has been possible to download taxonomic trees using the
atlas_taxonomy() function. Let’s have a go at visualising these trees now.
The first step is to choose a taxonomic group to represent in tree form. I’ve chosen the chordates (Phylum Chordata) because they aren’t too large a group and the names are fairly well-known. We can specify this within
galah using the function
galah_identify. The second piece of information we need to supply is how far ‘down’ the tree to travel. I’ve chosen the Order level here using
galah_down_to(order); while we could have gone to the Family or even Genus, trying to traverse too many levels (i.e. to Genus or Species) would take a very long time. A full list of accepted ranks can be found by calling
The object returned by
atlas_taxonomy is slightly unusual; it uses the
data.tree package, meaning that the dataset is literally structured like a tree. This is notably different from other representations of networks, such as you might find in
igraph, for example. To get an idea of what the data look like, we can use the inbuilt
levelName 1 Chordata 2 ¦--Cephalochordata 3 ¦ °--Amphioxi 4 ¦ °--... 1 nodes w/ 0 sub 5 ¦--Tunicata 6 ¦ ¦--Appendicularia 7 ¦ ¦ °--... 1 nodes w/ 0 sub 8 ¦ ¦--Ascidiacea 9 ¦ ¦ °--... 5 nodes w/ 0 sub 10 ¦ °--Thaliacea 11 ¦ °--... 3 nodes w/ 0 sub 12 °--Vertebrata 13 ¦--Agnatha 14 ¦ °--... 2 nodes w/ 2 sub 15 °--Gnathostomata 16 °--... 5 nodes w/ 134 sub
This shows there are three nodes directly beneath Chordata in the taxonomic hierarchy, of which the largest (by number of sub-nodes) is the vertebrates (Vertebrata). There is a lot we could do with this tree; each node contains a unique taxonomic identifer, for example, meaning that we could use individual nodes to make new queries using
galah. However, for now a useful task is simply to visualise the structure of the whole tree.
Taxonomic trees are complex. While all species have a Kingdom, Phylum, Order, Class and Family, there are many intermediate categories that are ‘optional’. In practice, this means that when we convert to a
data.frame for plotting, there are a lot of missing values; nodes that apply to some rows but not others.
df_rank <- ToDataFrameTypeCol(chordate_orders, type = "rank") df_rank[10:20,]
rank_phylum rank_subphylum rank_class rank_informal 10 Chordata Tunicata Thaliacea <NA> 11 Chordata Vertebrata <NA> Myxini, Agnatha 12 Chordata Vertebrata <NA> Petromyzontida, Agnatha 13 Chordata Vertebrata Amphibia Gnathostomata 14 Chordata Vertebrata Amphibia Gnathostomata 15 Chordata Vertebrata Amphibia Gnathostomata 16 Chordata Vertebrata Aves Gnathostomata 17 Chordata Vertebrata Aves Gnathostomata 18 Chordata Vertebrata Aves Gnathostomata 19 Chordata Vertebrata Aves Gnathostomata 20 Chordata Vertebrata Aves Gnathostomata rank_subclass rank_superorder rank_subinfraclass rank_infraclass 10 <NA> <NA> <NA> <NA> 11 <NA> <NA> <NA> <NA> 12 <NA> <NA> <NA> <NA> 13 Lissamphibia <NA> <NA> <NA> 14 <NA> Labyrinthodonta <NA> <NA> 15 <NA> Salientia <NA> <NA> 16 Neognathae <NA> <NA> <NA> 17 Neognathae <NA> <NA> <NA> 18 Palaeognathae <NA> Ratitae <NA> 19 Palaeognathae <NA> Ratitae <NA> 20 <NA> <NA> <NA> <NA> rank_subdivision zoology rank_order 10 <NA> Salpida 11 <NA> Myxiniformes 12 <NA> Petromyzontiformes 13 <NA> Anura 14 <NA> Temnospondyli 15 <NA> Sphenodontia 16 <NA> Accipititrifomes 17 <NA> Phaethontifomes 18 <NA> Casuariifomes 19 <NA> Dinornithiformes 20 <NA> Anseriformes
These missing values will show up as empty sections in the resulting diagram, which isn’t ideal. Instead, we can build this
data.frame so as to place all nodes in order by row, with empty ‘levels’ being placed at the end. This also avoids the problem where ‘unnamed’ ranks are grouped in the same column. To achieve this, we simply choose a different node attribute (
level in this case) to supply to the
df_level <- ToDataFrameTypeCol(chordate_orders, type = "level") df_level[10:20, ]
level_1 level_2 level_3 level_4 10 Chordata Tunicata Thaliacea Salpida 11 Chordata Vertebrata Agnatha Myxini 12 Chordata Vertebrata Agnatha Petromyzontida 13 Chordata Vertebrata Gnathostomata Amphibia 14 Chordata Vertebrata Gnathostomata Amphibia 15 Chordata Vertebrata Gnathostomata Amphibia 16 Chordata Vertebrata Gnathostomata Aves 17 Chordata Vertebrata Gnathostomata Aves 18 Chordata Vertebrata Gnathostomata Aves 19 Chordata Vertebrata Gnathostomata Aves 20 Chordata Vertebrata Gnathostomata Aves level_5 level_6 level_7 level_8 10 <NA> <NA> <NA> <NA> 11 Myxiniformes <NA> <NA> <NA> 12 Petromyzontiformes <NA> <NA> <NA> 13 Lissamphibia Anura <NA> <NA> 14 Labyrinthodonta Temnospondyli <NA> <NA> 15 Salientia Sphenodontia <NA> <NA> 16 Neognathae Accipititrifomes <NA> <NA> 17 Neognathae Phaethontifomes <NA> <NA> 18 Palaeognathae Ratitae Casuariifomes <NA> 19 Palaeognathae Ratitae Dinornithiformes <NA> 20 Anseriformes <NA> <NA> <NA>
Another problem in this dataset is the existence of duplicated taxonomic names. This happens because different authorities place the same taxon in different parts of the tree, and while the ALA tries to clean up these issues, some disagreements remain. The code below assumes that each name is only present once, so we have to remove duplicates to proceed. Fortunately there is a function in package
base that flags duplcated values as
TRUE and unique values as
FALSE. We can use this function to identify rows where
order is not unique.
The next step is to determine how to represent this structure in a plot. At the moment we can’t do this, because the data are in ‘wide’ format. Instead, we need to reorder our data so that each node/taxon is represented once, and other plotting aesthetics can be added as additional columns. To achieve this, we first convert to ‘long’ format, preserving information like what row and column each taxonomic label was recorded in.
Then, we can summarize this plot so that each row is a single taxon, recording some metadata about rows and columns from the original dataset
# A tibble: 161 x 5 taxon xmin xmax ymin ymax <chr> <dbl> <int> <dbl> <int> 1 Acanthopterygii 66 79 6 7 2 Accipititrifomes 15 16 5 6 3 Accipitriformes 21 22 4 5 4 Actinopterygii 56 96 4 5 5 Agnatha 10 12 2 3 6 Albuliformes 62 63 6 7 7 Amphibia 12 15 3 4 8 Amphioxi 0 1 2 3 9 Amphioxiformes 0 1 3 4 10 Anguilliformes 63 64 6 7 # ... with 151 more rows
Our dataset now contains all the information we need to plot the structure of our taxonomic tree. As usual, we’re going to plot this with
While this is (probably) accurate, it’s not very informative. The most obvious missing element is labels; to add these, we’ll need to determine which nodes are ‘leaves’, and which are ‘branches’. We’ll also want to restrict labelling to larger branches, to avoid the text looking crowded. Finally, there is no need to label leaves with both a rectangle and text; so we’ll remove the leaf rectangles from the plot.
df_plot <- df_plot |> mutate( x_dist = xmax - xmin, is_leaf = taxon %in% df_rank$rank_order) p <- ggplot() + geom_rect( data = filter(df_plot, !is_leaf), mapping = aes( xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, group = taxon, fill = ymax), color = "white") p + # branch labels geom_text( data = filter(df_plot, x_dist > 5), mapping = aes( x = xmin + (x_dist * 0.5), y = ymin + 0.5, label = taxon), color = "white", size = 3) + # leaf labels geom_text( data = filter(df_plot, is_leaf), aes(x = xmin + 0.5, y = ymin, label = taxon), angle = 90, hjust = 0, size = 2.5, color = "grey20")