Quantify geographic sampling bias with {sampbias}

Human biases play a large role in the data we collect about species. Here we show a simple method to quantify the bias of roads, cities, rivers and airports on species observations of legless lizards in the Northern Territory

Eukaryota
Animalia
Chordata
Summaries
R
Author

Dax Kellie

Published

August 8, 2022

Author

Dax Kellie

Date

8 August 2022

Being human plays a big role in the species we observe, when we observe them and where we observe them. In particular, we tend to collect more data in areas that are closer to places we live (or have access to) because there are more opportunities to see species in areas we spend more time in than areas that are far away or inaccessible.

Large, public datasets like the Atlas of Living Australia are especially prone to this sampling bias because they largely reflect opportunistic observations rather than systematic monitoring programs. However, not all species observations are affected equally by these biases, and it’s useful to quantify how biased data are before interpreting them.

Thanks to the sampbias package, we can easily quantify and compare the effects of these biases on our data, specifically whether data are influenced by cities, roads, airports and rivers.

This post expands on a Twitter thread by Dr Ian Brennan to show how sampling bias affects museum records of reptiles. Dr Brennan is currently a Post Doctoral researcher at the Australian National University (ANU). Check out his website to learn more about him and his cool research.

Download data

To begin, we will look at some of Dr Brennan’s favourite reptiles: legless lizards from the genus Delma.

First let’s load the packages we need.

library(sampbias)
library(galah)
library(viridis)
library(tidyverse)
library(ozmaps)
library(sf)

Next we will use the galah package to download occurrence records of Delma in the Northern Territory from the Atlas of Living Australia (ALA).

# Add registered email (register at ala.org.au)
galah_config(email = "your-email@email.com")
# Download Delma occurrence records in NT
delma_records <- galah_call() |>
  galah_identify("Delma") |>
  galah_filter(stateProvince == "Northern Territory") |>
  atlas_occurrences()
Retrying in 1 seconds.
# See first 10 rows
delma_records |> head(10L)
# A tibble: 10 × 8
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 001354e3-f7f1… Delma haroldi  https://biodi…           -25.2             131.
 2 0028eb21-485d… Delma austral… https://biodi…           -25.2             131.
 3 0049d92b-0a54… Delma tincta   https://biodi…           -18.8             136.
 4 00aaf4b3-310a… Delma borea    https://biodi…           -20.5             130.
 5 00acfb8c-e37f… Delma nasuta   https://biodi…           -23.7             134.
 6 00d25271-7a9e… Delma tincta   https://biodi…           -13.7             137.
 7 00e1247b-e1f1… Delma borea    https://biodi…           -12.4             131.
 8 00f3df89-04b6… Delma borea    https://biodi…           -14.1             134.
 9 00f653d4-49c7… Delma borea    https://biodi…           -11.9             136.
10 010f5f58-750b… Delma borea    https://biodi…           -12.2             137.
# ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>

Now we can edit our data to select the columns we want, remove pesky NA values, and filter our records to make sure there are no outliers outside of the Northern Territory

delma_records_filtered <- delma_records |>
  dplyr::select(scientificName, decimalLatitude, decimalLongitude) |>
  drop_na() |>
  filter(decimalLatitude < -10,
         decimalLatitude >= -26,
         decimalLongitude >= 129,
         decimalLongitude <= 138) 

Make a map

The next step is to get a map of the Northern Territory that we will use to calculate sample bias in our data. The ozmaps package has excellent quality datasets of the Australian coastline, state outlines, local municipality boundaries and electoral boundaries. We can use filter() to extract mapping info of the Northern Territory, and st_transform() and st_crs() to make sure the map has the correct WGS84 projection for ALA data.

# Get map
nt_wgs84 <- ozmap_data(data = "states") |>
  filter(NAME == "Northern Territory") |>
  sf::st_transform(crs = sf::st_crs("WGS84"))

## check map
ggplot(nt_wgs84) + geom_sf(fill = "transparent")

Run the model

It’s time to quantify the influence of roads, cities, airports and rivers on Delma records!

{sampbias} uses a probabilistic method to quantify accessibility bias. You can check out their published paper if you’d like to know more about the Bayesian methods underlying these estimates.

We’ll use the calculate_bias() function to run our model. We’ll specify that our model make estimates at a fine spatial scale resolution of res = 0.05 and add a buffer, which helps our model account for any neighbouring features just outside of our specified area that might affect our data. We’ll also make sure to restrict our model to within our map of the Northern Territory, ensuring it is in the correct format by using sf:::as_Spatial(nt_wgs84).

Important Note: Fine scale resolutions increase how long it takes for the model to run. This example took around 30 minutes. If you are short on time, I’d suggest starting with a lower resolution like res = 0.5.

model_bias_delma <- sampbias::calculate_bias(
  x = delma_records_filtered,
  res = 0.05,   # scale of spatial resolution
  buffer = 0.5, # account for neighbouring features
  restrict_sample = sf:::as_Spatial(nt_wgs84)
)

Check your bias

Once the model finishes running, we can use plot() to see the results.

Plot A shows the strength of sampling bias from each spatial feature. A higher posterior weight indicates greater sampling bias. For Delma records, we can see airports (in yellow) are the largest source of bias, followed by roads (in green), with rivers and cities adding relatively smaller amounts of bias.

Plot B shows the effect of each biasing factor on the sampling rate. Lines that curve more steeply indicate a larger drop-off in the likelihood of an observation as we get further from each spatial feature. For Delma, getting further than 250 km from an airport effectively cuts the expected number of observations by around half.

It’s worth pointing out that {sampbias} uses data of most major roads in Australia, but there are many smaller roads missing from the data informing this model (which you’ll be able to see in the next section). As a result, the output may not be a perfect reflection of movement or bias in this region.

Note: It’s good to keep in mind the scale of the x axis when reading Plot A, as small numbers suggest these factors don’t explain much of the overall variation in our data. A posterior weight of ~0.004 suggests that airports explain some variation, but not very much, of Delma observations.

plot(model_bias_delma)

You can also use summary() to view the model’s summary statistics.

summary(model_bias_delma)
Number of occurences:  1900 
Raster resolution:  0.05 
Convexhull:  
Geographic extent:
class      : Extent 
xmin       : 129.03 
xmax       : 137.98 
ymin       : -26.02 
ymax       : -11.02 
Bias weights:
            bias_weight      std_dev
w_airports 4.212005e-03 1.657795e-04
w_cities   2.926758e-04 1.817098e-04
w_rivers   4.331625e-04 8.453850e-05
w_roads    1.903174e-03 3.787051e-04
hp_rate    6.438165e+02 2.931110e+02

Map the bias

Before viewing the mapped output from {sampbias}, it would be useful to see where the airports, cities, roads and rivers are. Observations of Delma are in orange.

Code
# Load in data for landscape features
data(airports)
data(waterbodies)
data(cities)
data(roads)


# Combine data
features <- c(airports, cities, waterbodies, roads)

# Convert to spatial features, set coordinate system, filter to within NT
# Add feature ID to each for plotting
features_sf <- features |>
  set_names(c("Airport", "City", "River", "Road")) |>
  map_dfr(~ st_as_sf(.) |> 
            st_set_crs(st_crs("WGS84")) |>
            st_intersection(nt_wgs84),
          .id = "feature")


# Plot all the points on a map alongside the features
features_map <- ggplot() +
  # NT map
  geom_sf(data = nt_wgs84,
          fill = "grey98", color = "grey40") +
  # Rivers & Roads
  geom_sf(data = features_sf |> filter(feature == "River" | feature == "Road"),
          mapping = aes(color = feature),
          size = 1.1,
          show.legend = "line") +
  # Airports
  geom_sf(data = features_sf |> filter(feature == "Airport" ),
          mapping = aes(color = feature),
          shape = 17,
          size = 6,
          show.legend = "point") +
  # Cities
  geom_sf(data = features_sf |> filter(feature == "City"),
          mapping = aes(color = feature),
          shape = 16,
          size = 4,
          show.legend = "point") +
  # Observations
  geom_point(data = delma_records_filtered,
             mapping = aes(x = decimalLongitude, y = decimalLatitude),
             color = "#E06E53",
             size = 1.1,
             alpha = 0.3) +
  # Specify colours
  scale_color_manual(values = c(
    "River" = "#249db5",
    "Road" = "#ffc517",
    "Airport" = "#9956db",
    "City" = "#30c788"
  ),
  # Create custom line/point legend
  guide = guide_legend(
    override.aes = list(linetype = c("solid", "solid", "blank", "blank"), 
                        shape = c(NA, NA, 17, 16),
                        size = c(1.1, 1.1, 6, 4)),
    title = NULL)) +
  theme_void() +
  theme(legend.position = "bottom")

You can use the map_bias() function to see the projected effect of bias on Delma records.

The map_bias() function has several types of outputs that you can experiment with (e.g. diff_to_max, sampling_rate, log_sampling_rate). In our case, because the difference in record numbers is quite large between locations with lots of records and locations with few, the log_sampling_rate plot gives a more balanced view of where observations are missing.

First, you’ll notice that the highest sampling rates centre around airports, highlighted in bright yellow and green. You will also notice a dark blue line between the two major NT airports, showing a somewhat surprisingly clear spatial gap in observations of Delma between the two major Northern Territory airports! Even accounting for roads, rivers and cities does little to erase this dark blue line.

# Project the bias effect in space
delma <- sampbias::project_bias(model_bias_delma)

# Map
sampbias::map_bias(delma, type="log_sampling_rate")

Other species

What about other species? Here are a few I have selected for comparison.

Select each tab to view their results.

Final thoughts

The {sampbias} package offers a cool, simple way to understand spatial biases in species observations. Research has found these data biases are very common across taxa, with an estimated 90% of global observations recorded within 2.5 km of a road (Hughes et al. 2021)! So, it is worth considering this when interpreting any data on species distributions.

Spatial biases aren’t the only ones that affect observations either. There are potential taxonomic, temporal and environmental biases that can influence them, too. You might find other R packages like the {ocuAccess} package more useful if you are interested in calculating the effects of other types of biases on your data.

We hope you’ll be able to use {sampbias} in your own analyses so you can check yourself1 before you wreck yourself2!

Expand for session info
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31 ucrt)
 os       Windows 10 x64 (build 19045)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_Australia.utf8
 ctype    English_Australia.utf8
 tz       Australia/Sydney
 date     2024-02-12
 pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.3.2)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.2)
 galah       * 2.0.1   2024-02-06 [1] CRAN (R 4.3.2)
 ggplot2     * 3.4.4   2023-10-12 [1] CRAN (R 4.3.1)
 htmltools   * 0.5.7   2023-11-03 [1] CRAN (R 4.3.2)
 lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.3.2)
 ozmaps      * 0.4.5   2021-08-03 [1] CRAN (R 4.3.2)
 purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.2)
 readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.2)
 rgeos       * 0.5-9   2021-12-15 [1] CRAN (R 4.2.1)
 sampbias    * 1.0.4   2022-07-07 [1] Github (azizka/sampbias@b1d951b)
 sessioninfo * 1.2.2   2021-12-06 [1] CRAN (R 4.3.2)
 sf          * 1.0-14  2023-07-11 [1] CRAN (R 4.3.2)
 sp          * 2.1-2   2023-11-26 [1] CRAN (R 4.3.2)
 stringr     * 1.5.1   2023-11-14 [1] CRAN (R 4.3.2)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.2)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.2)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.2)
 viridis     * 0.6.4   2023-07-22 [1] CRAN (R 4.3.2)
 viridisLite * 0.4.2   2023-05-02 [1] CRAN (R 4.3.1)

 [1] C:/Users/KEL329/R-packages
 [2] C:/Users/KEL329/AppData/Local/Programs/R/R-4.3.2/library

──────────────────────────────────────────────────────────────────────────────

Footnotes

  1. for biases↩︎

  2. from over-interpreting your data↩︎