Author

Date

28 August 2023

In ecology and biology research, box plots are a wonderfully simple and efficient way to summarise data from different groups (e.g., species, populations, experimental conditions). However, this simplicity can sometimes hide the underlying structure of data, unintentionally misleading readers.

Luckily, there are alternatives to displaying data with box plots, and these options have grown increasingly easy to make in R. This post shows more transparent ways to summarise data, seen through an ecological lens.

In our example, we’ll compare several species of plants using a leaf trait commonly used to predict a plant’s performance: leaf dry mass per area (or more simply, the size of a leaf relative to its surface area). We’ll explain how to summarise this trait across a set of species using box plots, and show two ways of also displaying the distribution of the data to be more transparent.

Why not box plots?

Box plots can mask differences in sample size and underlying data structure, which can make them prone to misinterpretation. One such example, shown below, is based on Cedric Scherer’s blog post on the same topic, which I encourage you to check out.

Code

library(ggplot2)
library(dplyr)
library(glue)
library(ggtext)
library(gganimate)

# generate sample data
set.seed(2021)

data <- tibble(
  group = factor(c(rep("Group 1", 100), rep("Group 2", 250), rep("Group 3", 25))),
  value = c(seq(0, 20, length.out = 100),
            c(rep(0, 5), rnorm(30, 2, .1), rnorm(90, 5.4, .1), rnorm(90, 14.6, .1), rnorm(30, 18, .1), rep(20, 5)),
            rep(seq(0, 20, length.out = 5), 5))
  ) %>% 
  rowwise() |>
  mutate(value = if_else(group == "Group 2", value + rnorm(1, 0, .4), value)) |> 
  group_by(group) |> 
  mutate(sample_size = n())

anim <- ggplot(data) +
  geom_boxplot(aes(y = value), 
               fill = "grey92") +
   geom_point(aes(x = 0, y = value),
              size = 2,
              alpha = .3,
              position = position_jitter(seed = 1, width = .2)) +
  labs(title = "N = {closest_state}") +
  pilot::theme_pilot() +
  scale_x_discrete() +
  theme(axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_markdown(),
        plot.title = element_text(hjust = 0.5)) +
  transition_states(sample_size,
                    transition_length = 1,
                    state_length = 2) +
  ease_aes('cubic-in-out') +
  enter_fly(x_loc = 0, y_loc = 15) +
  exit_fly(x_loc = 0, y_loc = 10)

To understand how to summarise data more transparently, let’s see how adding data points and distributions can help improve our visualisations.

Download data

Let’s start by downloading some data. We’ll use the {austraits} package to get data of Australian plant traits from AusTraits, an open-source database of nearly 500 traits across more than 30,000 taxa from field surveys, published papers, and taxonomic books.

We can download the entire austraits dataset with the load_austraits() function.

Note

To make analyses more reproducible, load_austraits() requires users to specify a version. You can see available versions using the get_versions() function.

library(tidyverse)
library(here)
library(austraits)

austraits <- load_austraits(version = "4.1.0", 
                            path = here::here("posts", "data", "austraits"))

There’s a huge range of available plant traits, and if you are interested in seeing what else there is, you should check out their AusTraits Plant Dictionary for a complete list. For a brief breakdown of trait names and the data available for each, you can use summarise_austraits("trait_name"). Rather than showing the entire list, here’s a random sample to give you an idea.

# show a sample of available traits
austraits %>% 
  summarise_austraits("trait_name") |>
  slice_sample(n = 10) |>
  as_tibble()

# A tibble: 10 × 5
   trait_name                           n_records n_dataset n_taxa percent_total
   <chr>                                    <int>     <int>  <int>         <dbl>
 1 leaf_work_to_shear                         192         5    137     0.000153 
 2 leaf_lobation                             4309         5   4193     0.00344  
 3 soil_seedbank                              522         4    515     0.000417 
 4 leaf_cell_wall_N_per_cell_wall_dry_…        29         1     22     0.0000231
 5 seed_shape                                2965         8   2722     0.00237  
 6 fire_and_establishing                     1618         2   1586     0.00129  
 7 leaflet_area                               424         5     51     0.000338 
 8 flower_perianth_merism                     268         1    202     0.000214 
 9 water_potential_88percent_lost_cond…        81         2     79     0.0000646
10 wood_delta15N                              256         3     35     0.000204

Leaf mass per area (LMA)

The plant trait data we’ll download is leaf dry mass per area (LMA), which measures how big and dense a leaf is compared to its surface area. This example from Butrim & Royer (2020) shows that the small, dense leaf on the top has much higher LMA than the two large, light leaves on the bottom.

Leaves from 34-33 million years ago, taken from a post by the Botanical Society of America

Leaf mass per area is a morphological trait that can indicate a plant’s survival strategy. LMA can vary widely across species:

High-LMA

Medium-LMA

Low-LMA

Left: Dodonaea stenozyga (Arthur Chapman CC-BY-NC-SA 4.0 (Int), Middle: Melaleuca decora (Greg Tasney CC-BY-SA 4.0 (Int)), Right: Drymophila cyanocarpa (Pete Woodall CC-BY-NC 4.0 (Int))

On one end of the spectrum, plants with small, dense leaves (High-LMA) acquire resources (like nutrients and energy) gradually and grow slowly, aiming to conserve what resources they have. These plants tend to have an advantage in unproductive environments, where they can efficiently store whatever limited resources are available.

On the other end of the spectrum, plants with big, light leaves (Low-LMA) acquire resources quickly and grow fast, aiming to out-compete others for the resources on offer. These plants tend to have an advantage in highly productive environments (where they can get lots of resources and use them quickly) (Poorter et al. 2009).

Extract trait data

To download LMA trait data, we’ll use extract_trait(). This downloads a list of data and metadata about authors, collection methods, locations, taxonomic information, and data structure. We’ll then use purrr::pluck() to grab the data.frame we want from our list.

Tip

You can use lookup_trait() to search for traits

# lookup_trait(austraits, "leaf_mass")

# Get trait data
leaf_mass <- austraits |> 
  extract_trait("leaf_mass_per_area") |> 
  pluck("traits")

leaf_mass

# A tibble: 27,946 × 24
   dataset_id  taxon_name      observation_id trait_name value unit  entity_type
   <chr>       <chr>           <chr>          <chr>      <dbl> <chr> <chr>      
 1 Ahrens_2019 Corymbia calop… 001            leaf_mass…  185. g/m2  individual 
 2 Ahrens_2019 Corymbia calop… 002            leaf_mass…  138. g/m2  individual 
 3 Ahrens_2019 Corymbia calop… 003            leaf_mass…  169. g/m2  individual 
 4 Ahrens_2019 Corymbia calop… 004            leaf_mass…  151. g/m2  individual 
 5 Ahrens_2019 Corymbia calop… 005            leaf_mass…  142. g/m2  individual 
 6 Ahrens_2019 Corymbia calop… 006            leaf_mass…  156. g/m2  individual 
 7 Ahrens_2019 Corymbia calop… 007            leaf_mass…  150. g/m2  individual 
 8 Ahrens_2019 Corymbia calop… 008            leaf_mass…  175. g/m2  individual 
 9 Ahrens_2019 Corymbia calop… 009            leaf_mass…  148. g/m2  individual 
10 Ahrens_2019 Corymbia calop… 010            leaf_mass…  135. g/m2  individual 
# ℹ 27,936 more rows
# ℹ 17 more variables: value_type <chr>, basis_of_value <chr>,
#   replicates <chr>, basis_of_record <chr>, life_stage <chr>,
#   population_id <chr>, individual_id <chr>, temporal_id <chr>,
#   source_id <chr>, location_id <chr>, entity_context_id <chr>, plot_id <chr>,
#   treatment_id <chr>, collection_date <chr>, measurement_remarks <chr>,
#   method_id <chr>, original_name <chr>

Sample species

Let’s get a sample of our 6 species to summarise. To make sure that our sample species have enough data points to summarise, let’s filter leaf_mass to only taxa with at least 10 data points. To do this, we’ll count the number of data points by taxon_name, filter to only those with 10 data points or more, and use pull() to extract them. We’ll save this list of names as filtered_names.

# get species with more than 10 records
filtered_names <- leaf_mass |>
  group_by(taxon_name) |>
  count() |>
  filter(n >= 10) |>
  pull(taxon_name)

filtered_names |> head(10L)

 [1] "Acacia acinacea"       "Acacia acuminata"      "Acacia anceps"        
 [4] "Acacia aneura"         "Acacia argyrophylla"   "Acacia aulacocarpa"   
 [7] "Acacia auriculiformis" "Acacia baileyana"      "Acacia bidwillii"     
[10] "Acacia brachybotrya"

Next we’ll filter our leaf_mass data to include only those in filtered_names.

leaf_mass_filtered <- leaf_mass |>
  filter(taxon_name %in% filtered_names)

This has removed more than 3,000 taxa from our dataset.

Code

n_taxa <- leaf_mass |> distinct(taxon_name) |> count()
n_taxa_filtered <- leaf_mass_filtered |> distinct(taxon_name) |> count()

n_taxa - n_taxa_filtered

     n
1 3196

Next, let’s get a sample of these taxa to plot. One way to get a random sample of 6 taxa is to use filter to return data of only 6 unique taxa names.

leaf_mass_filtered |> 
  filter(taxon_name %in% sample(unique(taxon_name), 6))

Here’s a random sample of taxa we prepared earlier, which we’ll specify just so we can look at some extra maps after we summarise our plant trait data as well. We’ll call our sample leaf_mass_sample.

sample_names <- c("Cryptocarya rigida", "Pteridium esculentum", 
                  "Eucalyptus baxteri", "Melaleuca armillaris",
                  "Eucalyptus wandoo", "Eucalyptus piperita")

leaf_mass_sample <- leaf_mass_filtered |>
  filter(taxon_name %in% sample_names)

Beeswarm plot

The first plot we’ll learn to make is a beeswarm plot using the {ggbeeswarm} package. Beeswarm plots are useful because they allow you to plot points next to each other that would normally overlap. These plots are especially nice because the points are plotted in a way that visualises density much like a violin plot while still showing each individual point.

Let’s load {ggbeeswarm} and try plotting our plant species’ trait values using geom_quasirandom(), which uses a sequencing algorithm to place points nicely next to each other.

library(ggbeeswarm)

ggplot(data = leaf_mass_sample, 
       aes(x = taxon_name, 
           y = value, 
           colour = taxon_name)) +
  ggbeeswarm::geom_quasirandom(size = 2)

Now let’s take a few steps to clean this plot up. We’ll use one of our favourite theme packages, {pilot}, to get some nice colours and make the plot look neater. We’ll also use stringr::str_wrap() and reorder() to wrap the long species names on the x-axis and order our species in ascending order.

# remotes::install_github("olihawkins/pilot")
library(pilot) 
library(stringr)

ggplot(data = leaf_mass_sample, 
       aes(x = taxon_name |> stringr::str_wrap(10) |> reorder(value), 
           y = value, 
           colour = taxon_name)) +
  ggbeeswarm::geom_quasirandom(size = 2,      # size of points
                               width = .3,    # width of points
                               alpha = .7) +  # opacity of points
  labs(y = "Leaf mass per area (g/m<sup>2</sup>)",
       x = "Species") +
  pilot::scale_color_pilot() +
  pilot::theme_pilot(grid = "h",
                     axes = "b") +
  theme(legend.position = "none",
        axis.title.y = ggtext::element_markdown()) # allows html in axis title

Because {ggbeeswarm} places points in a semi-random but still calculated spot, the appearance is less confusing to readers than placing points randomly using other functions like geom_jitter().

For extra readability, we can add a box plot along with our points to help summarise their spread, too. To help make our boxes stand out while still maintaining our colour palette, we’ll use the {colorspace} package to lighten each species’ colour.

The {colorspace} package is a toolbox designed to select clear colour palettes while accounting for visual colour deficiencies. To use {colorspace} to adjust colours after calculating each box, we’ll use after_scale() and lighten() on our fill and colour arguments in geom_boxplot().

ggplot(data = leaf_mass_sample, 
       aes(x = taxon_name |> stringr::str_wrap(10) |> reorder(value), 
           y = value, 
           colour = taxon_name, 
           fill = taxon_name)) +
  geom_boxplot(
    aes(fill = taxon_name,
        fill = after_scale(colorspace::lighten(fill, .9)),
        colour = taxon_name,
        colour = after_scale(colorspace::lighten(colour, .3))),
    size = 1,
    outlier.shape = NA
    ) +
  ggbeeswarm::geom_quasirandom(size = 4, 
                               width = .25, 
                               alpha = .7) +
  # scale_y_continuous(expand=c(0,0)) +
  scale_y_continuous(breaks = c(0, 100, 200, 300, 400),
                     labels = c(0, 100, 200, 300, 400),
                     limits = c(0, 400),
                     expand = c(0,0)) +
  labs(y = "Leaf mass per area (g/m<sup>2</sup>)",
       x = "Species") +
  pilot::scale_color_pilot() +
  pilot::scale_fill_pilot() +
  pilot::theme_pilot(grid = "h",
                     axes = "b") + 
  theme(legend.position = "none",
        axis.title.y = ggtext::element_markdown(),
        axis.text.x = element_text(face = "italic"))

Raincloud plots

In 2019, raincloud plots were proposed as one way to visualise data with “maximal statistical information while preserving the desired ‘inference at a glance’ nature of barplots and other similar visualization devices.” By displaying points, densities and summary stats like median, mean and confidence intervals, raincloud plots are robust, informative, and (dare I say it) stunning.

Let’s plot the same data of our 6 species above as a raincloud plot. We’ll load the {ggdist} package and the {gghalves} package to allow us to plot different parts of our raincloud plot.

library(ggdist)
library(gghalves)

To begin, let’s plot the distribution on one half of our raincloud plot using stat_halfeye(), and the points that make up that distribution on the other half of our plot with geom_half_point(). We’ll also flip our plot on its side using coord_flip() to give our plot and labels more space (and to eventually help make our raincloud effect)¹.

ggplot(data = leaf_mass_sample, 
       aes(x = taxon_name |> stringr::str_wrap(10) |> reorder(value), 
           y = value, 
           colour = taxon_name, 
           fill = taxon_name)) +
  ggdist::stat_halfeye(adjust = .4,           # smoothness of distribution
                       width = .87,           # height of distribution
                       colour = NA) +
  gghalves::geom_half_point(side = "l",       # choose right or left side
                            range_scale = .3, # spread of points
                            alpha = .6,
                            size = 2.2) +
  coord_flip() +
  labs(x = "Species",
       y = "Leaf mass to area")

Now we can add our box plot, using our after_stat() colorspace trick within geom_boxplot(), this time to darken the lines of each box. We’ll also use {pilot} to add some nice colours and styling to our plot.

And just like that we can make it rain!²

ggplot(data = leaf_mass_sample, 
       aes(x = taxon_name |> stringr::str_wrap(10) |> reorder(value), 
           y = value, 
           colour = taxon_name, 
           fill = taxon_name)) +
  ggdist::stat_halfeye(adjust = .4,
                       width = .87,
                       colour = NA) +
  gghalves::geom_half_point(side = "l",
                            range_scale = .3,
                            alpha = .6,
                            size = 2.2) +
  geom_boxplot(
    aes(colour = taxon_name,
        colour = after_scale(colorspace::darken(colour, .7))),
    width = .12,        # adjust box width
    fill = NA,
    size = 1.1,         # size of box line
    outlier.shape = NA  # remove outlier points
    ) +
  coord_flip() +
  labs(x = "Species",
       y = "Leaf mass per area (g/m<sup>2</sup>)") +
  scale_y_continuous(breaks = c(0, 100, 200, 300, 400),
                     labels = c(0, 100, 200, 300, 400),
                     limits = c(0, 400),
                     expand = c(0,0)) +
  pilot::scale_color_pilot() +
  pilot::scale_fill_pilot() +
  pilot::theme_pilot(grid = "",
                     axes = "b") + 
  theme(legend.position = "none",
        axis.title.x = ggtext::element_markdown(),
        axis.text.y = element_text(face = "italic"))

Bonus: Plant distribution

Let’s briefly revisit leaf dry mass per area (LMA) before ending. The relationship between LMA and plant functioning means we can infer the type of ecosystem a plant lives in based on its LMA. So let’s see where the 6 plant species that we summarised above live!

Remember, low-LMA species like Cryptocarya rigida have big, light leaves that thrive in wetter ecosystems with lots of nutrients. Alternatively, high-LMA species like Eucalyptus wandoo have small, dense leaves that thrive in drier ecosystems with little nutrients. We’ve added the median LMA of each species to help compare.

Is this where you expected these plants to be found?

Code

library(galah)
library(sf)
library(ozmaps)

# Download data
galah_config(email = "dax.kellie@csiro.au", verbose = FALSE)

plants <- galah_call() |>
  galah_identify(sample_names) |>
  galah_apply_profile(ALA) |>
  atlas_occurrences()

# Recategorise subspecies into species categories
plants <- plants |>
  drop_na(decimalLatitude, decimalLatitude) |>
  mutate(names = case_when(
    str_detect(scientificName, "Eucalyptus wandoo") ~ "Eucalyptus wandoo",
    str_detect(scientificName, "Pentameris airoides") ~ "Pentameris airoides",
    str_detect(scientificName, "Melaleuca armillaris") ~ "Melaleuca armillaris",
    str_detect(scientificName, "Pteridium esculentum") ~ "Pteridium esculentum",
    .default = scientificName)
    )

# Join median LMAs for each species to `plants` tibble
plants_lma <- leaf_mass_sample |>
  group_by(taxon_name) |>
  summarise(median_lma = median(value) |> round(1)) |>
  right_join(plants, by = join_by(taxon_name == scientificName)) |>
  rename(scientificName = taxon_name) |>
  drop_na(median_lma) # remove NAs for unmatched subspecies

# Australia map
aus <- ozmaps::ozmap_country |>
  st_transform(crs = st_crs(4326))

# Map points
ggplot() +
  geom_sf(data = aus,
          colour = "grey60",
          fill = NA) +
  geom_point(data = plants_lma,
             aes(x = decimalLongitude,
                 y = decimalLatitude,
                 colour = names),
             shape = 16,
             alpha = 0.4) +
  pilot::scale_color_pilot() +
  pilot::theme_pilot() +
  coord_sf(xlim = c(110, 155), 
           ylim = c(-45, -10)) +
  facet_wrap( ~ names, ncol = 3) + 
  geom_text(data = plants_lma,
            mapping = aes(x = 116, y = -11, 
                          label = glue("LMA = {median_lma}"), 
                          group = names),
            colour = "grey40",
            family = theme_get()$text$family, # use theme settings
            size = 3.5,
            lineheight = 0.92) +
  theme(legend.position = "none",
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.border = element_rect(linewidth = 1, 
                                    colour = "grey90", 
                                    fill = NA)
        )

Figures display each species’ median LMA value.

Final thoughts

Box plots are great because they are simple and can be interpreted quickly. These days, there are alternative options that are still simple but more robust and transparent. We hope you feel inspired to try out a beeswarm or raincloud plot yourself! (Maybe you even learned a little something about plants and their leaves as well)

There might be other statistics you might like to use to summarise your data. It’s possible to add means and confidence intervals to raincloud plots, too, though methods differ slightly to other visualisation tools³. Experiment with what works best with your data.

For other great resources on alternatives to box and bar plots in R, check out this article by Cedric Scherer which we found really helpful.

Expand for session info

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31 ucrt)
 os       Windows 10 x64 (build 19045)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_Australia.utf8
 ctype    English_Australia.utf8
 tz       Australia/Sydney
 date     2024-03-07
 pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 austraits   * 2.1.2   2023-07-11 [1] Github (traitecoevo/austraits@53b637c)
 dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.3.2)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.2)
 galah       * 2.0.1   2024-02-06 [1] CRAN (R 4.3.2)
 gganimate   * 1.0.8   2022-09-08 [1] CRAN (R 4.3.2)
 ggbeeswarm  * 0.7.2   2023-04-29 [1] CRAN (R 4.3.2)
 ggdist      * 3.3.1   2023-11-27 [1] CRAN (R 4.3.2)
 gghalves    * 0.1.4   2022-11-20 [1] CRAN (R 4.3.2)
 ggplot2     * 3.4.4   2023-10-12 [1] CRAN (R 4.3.1)
 ggtext      * 0.1.2   2022-09-16 [1] CRAN (R 4.3.2)
 glue        * 1.6.2   2022-02-24 [1] CRAN (R 4.3.2)
 here        * 1.0.1   2020-12-13 [1] CRAN (R 4.3.2)
 htmltools   * 0.5.7   2023-11-03 [1] CRAN (R 4.3.2)
 lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.3.2)
 ozmaps      * 0.4.5   2021-08-03 [1] CRAN (R 4.3.2)
 pilot       * 4.0.0   2022-07-13 [1] Github (olihawkins/pilot@f08cc16)
 purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.2)
 readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.2)
 RefManageR  * 1.4.0   2022-09-30 [1] CRAN (R 4.3.2)
 sessioninfo * 1.2.2   2021-12-06 [1] CRAN (R 4.3.2)
 sf          * 1.0-14  2023-07-11 [1] CRAN (R 4.3.2)
 stringr     * 1.5.1   2023-11-14 [1] CRAN (R 4.3.2)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.2)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.2)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.2)

 [1] C:/Users/KEL329/R-packages
 [2] C:/Users/KEL329/AppData/Local/Programs/R/R-4.3.2/library

──────────────────────────────────────────────────────────────────────────────

Footnotes

Are plots with bars or boxes any easier to read when oriented up and down (ie. vertically) than side-to-side (ie. horizontally)? You might not have an immediate answer, and so it’s slightly strange (when you think about it) how deep-set the trend of vertical barplots and box plots is in science. Flipping your plot can make longer names easier to read, make group differences easier to spot when stacked in order, and give space to other elements in your plot!↩︎
Fun fact: “How to make it rain” is an actual subheading in the paper↩︎
If you would like to add mean and confidence intervals (rather than median and quantile intervals) to your raincloud plot, you can use stat_pointinterval() to do this. However, {ggdist} calculates confidence intervals using a Bayesian method to find point estimates and Highest Density Intervals (HDI). This method returns different estimates to Frequentist confidence intervals, so it’s worth looking up the difference before using HDIs. If you are plotting estimates after running a model, the suggested way by the creator of {ggdist} is to pass this information using dist_student_t() and model parameters from broom::tidy() output. This stack overflow thread we found helpful for getting started.↩︎