In ecology and biology research, box plots are a wonderfully simple and efficient way to summarise data from different groups (e.g., species, populations, experimental conditions). However, this simplicity can sometimes hide the underlying structure of data, unintentionally misleading readers.
Luckily, there are alternatives to displaying data with box plots, and these options have grown increasingly easy to make in R. This post shows more transparent ways to summarise data, seen through an ecological lens.
In our example, we’ll compare several species of plants using a leaf trait commonly used to predict a plant’s performance: leaf dry mass per area (or more simply, the size of a leaf relative to its surface area). We’ll explain how to summarise this trait across a set of species using box plots, and show two ways of also displaying the distribution of the data to be more transparent.
Why not box plots?
Box plots can mask differences in sample size and underlying data structure, which can make them prone to misinterpretation. One such example, shown below, is based on Cedric Scherer’s blog post on the same topic, which I encourage you to check out.
Code
library(ggplot2)
library(dplyr)
library(glue)
library(ggtext)
library(gganimate)
# generate sample data
set.seed(2021)
<- tibble(
data group = factor(c(rep("Group 1", 100), rep("Group 2", 250), rep("Group 3", 25))),
value = c(seq(0, 20, length.out = 100),
c(rep(0, 5), rnorm(30, 2, .1), rnorm(90, 5.4, .1), rnorm(90, 14.6, .1), rnorm(30, 18, .1), rep(20, 5)),
rep(seq(0, 20, length.out = 5), 5))
%>%
) rowwise() |>
mutate(value = if_else(group == "Group 2", value + rnorm(1, 0, .4), value)) |>
group_by(group) |>
mutate(sample_size = n())
<- ggplot(data) +
anim geom_boxplot(aes(y = value),
fill = "grey92") +
geom_point(aes(x = 0, y = value),
size = 2,
alpha = .3,
position = position_jitter(seed = 1, width = .2)) +
labs(title = "N = {closest_state}") +
::theme_pilot() +
pilotscale_x_discrete() +
theme(axis.title.y = element_blank(),
axis.title.x = element_blank(),
axis.text.x = element_markdown(),
plot.title = element_text(hjust = 0.5)) +
transition_states(sample_size,
transition_length = 1,
state_length = 2) +
ease_aes('cubic-in-out') +
enter_fly(x_loc = 0, y_loc = 15) +
exit_fly(x_loc = 0, y_loc = 10)
To understand how to summarise data more transparently, let’s see how adding data points and distributions can help improve our visualisations.
Download data
Let’s start by downloading some data. We’ll use the {austraits} package to get data of Australian plant traits from AusTraits, an open-source database of nearly 500 traits across more than 30,000 taxa from field surveys, published papers, and taxonomic books.
We can download the entire austraits dataset with the load_austraits()
function.
To make analyses more reproducible, load_austraits()
requires users to specify a version. You can see available versions using the get_versions()
function.
library(tidyverse)
library(here)
library(austraits)
<- load_austraits(version = "4.1.0",
austraits path = here::here("posts", "data", "austraits"))
There’s a huge range of available plant traits, and if you are interested in seeing what else there is, you should check out their AusTraits Plant Dictionary for a complete list. For a brief breakdown of trait names and the data available for each, you can use summarise_austraits("trait_name")
. Rather than showing the entire list, here’s a random sample to give you an idea.
# show a sample of available traits
%>%
austraits summarise_austraits("trait_name") |>
slice_sample(n = 10) |>
as_tibble()
# A tibble: 10 × 5
trait_name n_records n_dataset n_taxa percent_total
<chr> <int> <int> <int> <dbl>
1 leaf_work_to_shear 192 5 137 0.000153
2 leaf_lobation 4309 5 4193 0.00344
3 soil_seedbank 522 4 515 0.000417
4 leaf_cell_wall_N_per_cell_wall_dry_… 29 1 22 0.0000231
5 seed_shape 2965 8 2722 0.00237
6 fire_and_establishing 1618 2 1586 0.00129
7 leaflet_area 424 5 51 0.000338
8 flower_perianth_merism 268 1 202 0.000214
9 water_potential_88percent_lost_cond… 81 2 79 0.0000646
10 wood_delta15N 256 3 35 0.000204
Leaf mass per area (LMA)
The plant trait data we’ll download is leaf dry mass per area (LMA), which measures how big and dense a leaf is compared to its surface area. This example from Butrim & Royer (2020) shows that the small, dense leaf on the top has much higher LMA than the two large, light leaves on the bottom.
Leaf mass per area is a morphological trait that can indicate a plant’s survival strategy. LMA can vary widely across species:
High-LMA
Medium-LMA
Low-LMA
On one end of the spectrum, plants with small, dense leaves (High-LMA) acquire resources (like nutrients and energy) gradually and grow slowly, aiming to conserve what resources they have. These plants tend to have an advantage in unproductive environments, where they can efficiently store whatever limited resources are available.
On the other end of the spectrum, plants with big, light leaves (Low-LMA) acquire resources quickly and grow fast, aiming to out-compete others for the resources on offer. These plants tend to have an advantage in highly productive environments (where they can get lots of resources and use them quickly) (Poorter et al. 2009).
Extract trait data
To download LMA trait data, we’ll use extract_trait()
. This downloads a list
of data and metadata about authors, collection methods, locations, taxonomic information, and data structure. We’ll then use purrr::pluck()
to grab the data.frame
we want from our list.
You can use lookup_trait()
to search for traits
# lookup_trait(austraits, "leaf_mass")
# Get trait data
<- austraits |>
leaf_mass extract_trait("leaf_mass_per_area") |>
pluck("traits")
leaf_mass
# A tibble: 27,946 × 24
dataset_id taxon_name observation_id trait_name value unit entity_type
<chr> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Ahrens_2019 Corymbia calop… 001 leaf_mass… 185. g/m2 individual
2 Ahrens_2019 Corymbia calop… 002 leaf_mass… 138. g/m2 individual
3 Ahrens_2019 Corymbia calop… 003 leaf_mass… 169. g/m2 individual
4 Ahrens_2019 Corymbia calop… 004 leaf_mass… 151. g/m2 individual
5 Ahrens_2019 Corymbia calop… 005 leaf_mass… 142. g/m2 individual
6 Ahrens_2019 Corymbia calop… 006 leaf_mass… 156. g/m2 individual
7 Ahrens_2019 Corymbia calop… 007 leaf_mass… 150. g/m2 individual
8 Ahrens_2019 Corymbia calop… 008 leaf_mass… 175. g/m2 individual
9 Ahrens_2019 Corymbia calop… 009 leaf_mass… 148. g/m2 individual
10 Ahrens_2019 Corymbia calop… 010 leaf_mass… 135. g/m2 individual
# ℹ 27,936 more rows
# ℹ 17 more variables: value_type <chr>, basis_of_value <chr>,
# replicates <chr>, basis_of_record <chr>, life_stage <chr>,
# population_id <chr>, individual_id <chr>, temporal_id <chr>,
# source_id <chr>, location_id <chr>, entity_context_id <chr>, plot_id <chr>,
# treatment_id <chr>, collection_date <chr>, measurement_remarks <chr>,
# method_id <chr>, original_name <chr>
Sample species
Let’s get a sample of our 6 species to summarise. To make sure that our sample species have enough data points to summarise, let’s filter leaf_mass
to only taxa with at least 10 data points. To do this, we’ll count the number of data points by taxon_name
, filter to only those with 10 data points or more, and use pull()
to extract them. We’ll save this list of names as filtered_names
.
# get species with more than 10 records
<- leaf_mass |>
filtered_names group_by(taxon_name) |>
count() |>
filter(n >= 10) |>
pull(taxon_name)
|> head(10L) filtered_names
[1] "Acacia acinacea" "Acacia acuminata" "Acacia anceps"
[4] "Acacia aneura" "Acacia argyrophylla" "Acacia aulacocarpa"
[7] "Acacia auriculiformis" "Acacia baileyana" "Acacia bidwillii"
[10] "Acacia brachybotrya"
Next we’ll filter our leaf_mass
data to include only those in filtered_names
.
<- leaf_mass |>
leaf_mass_filtered filter(taxon_name %in% filtered_names)
This has removed more than 3,000 taxa from our dataset.
Code
<- leaf_mass |> distinct(taxon_name) |> count()
n_taxa <- leaf_mass_filtered |> distinct(taxon_name) |> count()
n_taxa_filtered
- n_taxa_filtered n_taxa
n
1 3196
Next, let’s get a sample of these taxa to plot. One way to get a random sample of 6 taxa is to use filter
to return data of only 6 unique taxa names.
|>
leaf_mass_filtered filter(taxon_name %in% sample(unique(taxon_name), 6))
Here’s a random sample of taxa we prepared earlier, which we’ll specify just so we can look at some extra maps after we summarise our plant trait data as well. We’ll call our sample leaf_mass_sample
.
<- c("Cryptocarya rigida", "Pteridium esculentum",
sample_names "Eucalyptus baxteri", "Melaleuca armillaris",
"Eucalyptus wandoo", "Eucalyptus piperita")
<- leaf_mass_filtered |>
leaf_mass_sample filter(taxon_name %in% sample_names)
Beeswarm plot
The first plot we’ll learn to make is a beeswarm plot using the {ggbeeswarm} package. Beeswarm plots are useful because they allow you to plot points next to each other that would normally overlap. These plots are especially nice because the points are plotted in a way that visualises density much like a violin plot while still showing each individual point.
Let’s load {ggbeeswarm} and try plotting our plant species’ trait values using geom_quasirandom()
, which uses a sequencing algorithm to place points nicely next to each other.
library(ggbeeswarm)
ggplot(data = leaf_mass_sample,
aes(x = taxon_name,
y = value,
colour = taxon_name)) +
::geom_quasirandom(size = 2) ggbeeswarm
Now let’s take a few steps to clean this plot up. We’ll use one of our favourite theme packages, {pilot}, to get some nice colours and make the plot look neater. We’ll also use stringr::str_wrap()
and reorder()
to wrap the long species names on the x-axis and order our species in ascending order.
# remotes::install_github("olihawkins/pilot")
library(pilot)
library(stringr)
ggplot(data = leaf_mass_sample,
aes(x = taxon_name |> stringr::str_wrap(10) |> reorder(value),
y = value,
colour = taxon_name)) +
::geom_quasirandom(size = 2, # size of points
ggbeeswarmwidth = .3, # width of points
alpha = .7) + # opacity of points
labs(y = "Leaf mass per area (g/m<sup>2</sup>)",
x = "Species") +
::scale_color_pilot() +
pilot::theme_pilot(grid = "h",
pilotaxes = "b") +
theme(legend.position = "none",
axis.title.y = ggtext::element_markdown()) # allows html in axis title
Because {ggbeeswarm} places points in a semi-random but still calculated spot, the appearance is less confusing to readers than placing points randomly using other functions like geom_jitter()
.
For extra readability, we can add a box plot along with our points to help summarise their spread, too. To help make our boxes stand out while still maintaining our colour palette, we’ll use the {colorspace} package to lighten each species’ colour.
The {colorspace} package is a toolbox designed to select clear colour palettes while accounting for visual colour deficiencies. To use {colorspace} to adjust colours after calculating each box, we’ll use after_scale()
and lighten()
on our fill
and colour
arguments in geom_boxplot()
.
ggplot(data = leaf_mass_sample,
aes(x = taxon_name |> stringr::str_wrap(10) |> reorder(value),
y = value,
colour = taxon_name,
fill = taxon_name)) +
geom_boxplot(
aes(fill = taxon_name,
fill = after_scale(colorspace::lighten(fill, .9)),
colour = taxon_name,
colour = after_scale(colorspace::lighten(colour, .3))),
size = 1,
outlier.shape = NA
+
) ::geom_quasirandom(size = 4,
ggbeeswarmwidth = .25,
alpha = .7) +
# scale_y_continuous(expand=c(0,0)) +
scale_y_continuous(breaks = c(0, 100, 200, 300, 400),
labels = c(0, 100, 200, 300, 400),
limits = c(0, 400),
expand = c(0,0)) +
labs(y = "Leaf mass per area (g/m<sup>2</sup>)",
x = "Species") +
::scale_color_pilot() +
pilot::scale_fill_pilot() +
pilot::theme_pilot(grid = "h",
pilotaxes = "b") +
theme(legend.position = "none",
axis.title.y = ggtext::element_markdown(),
axis.text.x = element_text(face = "italic"))
Raincloud plots
In 2019, raincloud plots were proposed as one way to visualise data with “maximal statistical information while preserving the desired ‘inference at a glance’ nature of barplots and other similar visualization devices.” By displaying points, densities and summary stats like median, mean and confidence intervals, raincloud plots are robust, informative, and (dare I say it) stunning.
Let’s plot the same data of our 6 species above as a raincloud plot. We’ll load the {ggdist} package and the {gghalves} package to allow us to plot different parts of our raincloud plot.
library(ggdist)
library(gghalves)
To begin, let’s plot the distribution on one half of our raincloud plot using stat_halfeye()
, and the points that make up that distribution on the other half of our plot with geom_half_point()
. We’ll also flip our plot on its side using coord_flip()
to give our plot and labels more space (and to eventually help make our raincloud effect)1.
ggplot(data = leaf_mass_sample,
aes(x = taxon_name |> stringr::str_wrap(10) |> reorder(value),
y = value,
colour = taxon_name,
fill = taxon_name)) +
::stat_halfeye(adjust = .4, # smoothness of distribution
ggdistwidth = .87, # height of distribution
colour = NA) +
::geom_half_point(side = "l", # choose right or left side
gghalvesrange_scale = .3, # spread of points
alpha = .6,
size = 2.2) +
coord_flip() +
labs(x = "Species",
y = "Leaf mass to area")
Now we can add our box plot, using our after_stat()
colorspace trick within geom_boxplot()
, this time to darken the lines of each box. We’ll also use {pilot} to add some nice colours and styling to our plot.
And just like that we can make it rain!2
ggplot(data = leaf_mass_sample,
aes(x = taxon_name |> stringr::str_wrap(10) |> reorder(value),
y = value,
colour = taxon_name,
fill = taxon_name)) +
::stat_halfeye(adjust = .4,
ggdistwidth = .87,
colour = NA) +
::geom_half_point(side = "l",
gghalvesrange_scale = .3,
alpha = .6,
size = 2.2) +
geom_boxplot(
aes(colour = taxon_name,
colour = after_scale(colorspace::darken(colour, .7))),
width = .12, # adjust box width
fill = NA,
size = 1.1, # size of box line
outlier.shape = NA # remove outlier points
+
) coord_flip() +
labs(x = "Species",
y = "Leaf mass per area (g/m<sup>2</sup>)") +
scale_y_continuous(breaks = c(0, 100, 200, 300, 400),
labels = c(0, 100, 200, 300, 400),
limits = c(0, 400),
expand = c(0,0)) +
::scale_color_pilot() +
pilot::scale_fill_pilot() +
pilot::theme_pilot(grid = "",
pilotaxes = "b") +
theme(legend.position = "none",
axis.title.x = ggtext::element_markdown(),
axis.text.y = element_text(face = "italic"))
Bonus: Plant distribution
Let’s briefly revisit leaf dry mass per area (LMA) before ending. The relationship between LMA and plant functioning means we can infer the type of ecosystem a plant lives in based on its LMA. So let’s see where the 6 plant species that we summarised above live!
Remember, low-LMA species like Cryptocarya rigida have big, light leaves that thrive in wetter ecosystems with lots of nutrients. Alternatively, high-LMA species like Eucalyptus wandoo have small, dense leaves that thrive in drier ecosystems with little nutrients. We’ve added the median LMA of each species to help compare.
Is this where you expected these plants to be found?
Code
library(galah)
library(sf)
library(ozmaps)
# Download data
galah_config(email = "dax.kellie@csiro.au", verbose = FALSE)
<- galah_call() |>
plants galah_identify(sample_names) |>
galah_apply_profile(ALA) |>
atlas_occurrences()
# Recategorise subspecies into species categories
<- plants |>
plants drop_na(decimalLatitude, decimalLatitude) |>
mutate(names = case_when(
str_detect(scientificName, "Eucalyptus wandoo") ~ "Eucalyptus wandoo",
str_detect(scientificName, "Pentameris airoides") ~ "Pentameris airoides",
str_detect(scientificName, "Melaleuca armillaris") ~ "Melaleuca armillaris",
str_detect(scientificName, "Pteridium esculentum") ~ "Pteridium esculentum",
.default = scientificName)
)
# Join median LMAs for each species to `plants` tibble
<- leaf_mass_sample |>
plants_lma group_by(taxon_name) |>
summarise(median_lma = median(value) |> round(1)) |>
right_join(plants, by = join_by(taxon_name == scientificName)) |>
rename(scientificName = taxon_name) |>
drop_na(median_lma) # remove NAs for unmatched subspecies
# Australia map
<- ozmaps::ozmap_country |>
aus st_transform(crs = st_crs(4326))
# Map points
ggplot() +
geom_sf(data = aus,
colour = "grey60",
fill = NA) +
geom_point(data = plants_lma,
aes(x = decimalLongitude,
y = decimalLatitude,
colour = names),
shape = 16,
alpha = 0.4) +
::scale_color_pilot() +
pilot::theme_pilot() +
pilotcoord_sf(xlim = c(110, 155),
ylim = c(-45, -10)) +
facet_wrap( ~ names, ncol = 3) +
geom_text(data = plants_lma,
mapping = aes(x = 116, y = -11,
label = glue("LMA = {median_lma}"),
group = names),
colour = "grey40",
family = theme_get()$text$family, # use theme settings
size = 3.5,
lineheight = 0.92) +
theme(legend.position = "none",
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
panel.border = element_rect(linewidth = 1,
colour = "grey90",
fill = NA)
)
Final thoughts
Box plots are great because they are simple and can be interpreted quickly. These days, there are alternative options that are still simple but more robust and transparent. We hope you feel inspired to try out a beeswarm or raincloud plot yourself! (Maybe you even learned a little something about plants and their leaves as well)
There might be other statistics you might like to use to summarise your data. It’s possible to add means and confidence intervals to raincloud plots, too, though methods differ slightly to other visualisation tools3. Experiment with what works best with your data.
For other great resources on alternatives to box and bar plots in R, check out this article by Cedric Scherer which we found really helpful.
Expand for session info
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.2 (2023-10-31 ucrt)
os Windows 10 x64 (build 19045)
system x86_64, mingw32
ui RTerm
language (EN)
collate English_Australia.utf8
ctype English_Australia.utf8
tz Australia/Sydney
date 2024-03-07
pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
austraits * 2.1.2 2023-07-11 [1] Github (traitecoevo/austraits@53b637c)
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.2)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.2)
galah * 2.0.1 2024-02-06 [1] CRAN (R 4.3.2)
gganimate * 1.0.8 2022-09-08 [1] CRAN (R 4.3.2)
ggbeeswarm * 0.7.2 2023-04-29 [1] CRAN (R 4.3.2)
ggdist * 3.3.1 2023-11-27 [1] CRAN (R 4.3.2)
gghalves * 0.1.4 2022-11-20 [1] CRAN (R 4.3.2)
ggplot2 * 3.4.4 2023-10-12 [1] CRAN (R 4.3.1)
ggtext * 0.1.2 2022-09-16 [1] CRAN (R 4.3.2)
glue * 1.6.2 2022-02-24 [1] CRAN (R 4.3.2)
here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.2)
htmltools * 0.5.7 2023-11-03 [1] CRAN (R 4.3.2)
lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.3.2)
ozmaps * 0.4.5 2021-08-03 [1] CRAN (R 4.3.2)
pilot * 4.0.0 2022-07-13 [1] Github (olihawkins/pilot@f08cc16)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.2)
readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.2)
RefManageR * 1.4.0 2022-09-30 [1] CRAN (R 4.3.2)
sessioninfo * 1.2.2 2021-12-06 [1] CRAN (R 4.3.2)
sf * 1.0-14 2023-07-11 [1] CRAN (R 4.3.2)
stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.3.2)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.2)
tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.2)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.2)
[1] C:/Users/KEL329/R-packages
[2] C:/Users/KEL329/AppData/Local/Programs/R/R-4.3.2/library
──────────────────────────────────────────────────────────────────────────────
Footnotes
Are plots with bars or boxes any easier to read when oriented up and down (ie. vertically) than side-to-side (ie. horizontally)? You might not have an immediate answer, and so it’s slightly strange (when you think about it) how deep-set the trend of vertical barplots and box plots is in science. Flipping your plot can make longer names easier to read, make group differences easier to spot when stacked in order, and give space to other elements in your plot!↩︎
Fun fact: “How to make it rain” is an actual subheading in the paper↩︎
If you would like to add mean and confidence intervals (rather than median and quantile intervals) to your raincloud plot, you can use
stat_pointinterval()
to do this. However, {ggdist} calculates confidence intervals using a Bayesian method to find point estimates and Highest Density Intervals (HDI). This method returns different estimates to Frequentist confidence intervals, so it’s worth looking up the difference before using HDIs. If you are plotting estimates after running a model, the suggested way by the creator of {ggdist} is to pass this information usingdist_student_t()
and model parameters frombroom::tidy()
output. This stack overflow thread we found helpful for getting started.↩︎