Being human plays a big role in the species we observe, when we observe them and where we observe them. In particular, we tend to collect more data in areas that are closer to places we live (or have access to) because there are more opportunities to see species in areas we spend more time in than areas that are far away or inaccessible.
Large, public datasets like the Atlas of Living Australia are especially prone to this sampling bias because they largely reflect opportunistic observations rather than systematic monitoring programs. However, not all species observations are affected equally by these biases, and it’s useful to quantify how biased data are before interpreting them.
Thanks to the sampbias
package, we can easily quantify and compare the effects of these biases on our data, specifically whether data are influenced by cities, roads, airports and rivers.
This post expands on a Twitter thread by Dr Ian Brennan to show how sampling bias affects museum records of reptiles. Dr Brennan is currently a Post Doctoral researcher at the Australian National University (ANU). Check out his website to learn more about him and his cool research.
Download data
To begin, we will look at some of Dr Brennan’s favourite reptiles: legless lizards from the genus Delma.
First let’s load the packages we need.
library(sampbias)
library(galah)
library(viridis)
library(tidyverse)
library(ozmaps)
library(sf)
Next we will use the galah package to download occurrence records of Delma in the Northern Territory from the Atlas of Living Australia (ALA).
# Add registered email (register at ala.org.au)
galah_config(email = "your-email@email.com")
# Download Delma occurrence records in NT
<- galah_call() |>
delma_records galah_identify("Delma") |>
galah_filter(stateProvince == "Northern Territory") |>
atlas_occurrences()
Retrying in 1 seconds.
# See first 10 rows
|> head(10L) delma_records
# A tibble: 10 × 8
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 001354e3-f7f1… Delma haroldi https://biodi… -25.2 131.
2 0028eb21-485d… Delma austral… https://biodi… -25.2 131.
3 0049d92b-0a54… Delma tincta https://biodi… -18.8 136.
4 00aaf4b3-310a… Delma borea https://biodi… -20.5 130.
5 00acfb8c-e37f… Delma nasuta https://biodi… -23.7 134.
6 00d25271-7a9e… Delma tincta https://biodi… -13.7 137.
7 00e1247b-e1f1… Delma borea https://biodi… -12.4 131.
8 00f3df89-04b6… Delma borea https://biodi… -14.1 134.
9 00f653d4-49c7… Delma borea https://biodi… -11.9 136.
10 010f5f58-750b… Delma borea https://biodi… -12.2 137.
# ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>
Now we can edit our data to select the columns we want, remove pesky NA
values, and filter our records to make sure there are no outliers outside of the Northern Territory
<- delma_records |>
delma_records_filtered ::select(scientificName, decimalLatitude, decimalLongitude) |>
dplyrdrop_na() |>
filter(decimalLatitude < -10,
>= -26,
decimalLatitude >= 129,
decimalLongitude <= 138) decimalLongitude
Make a map
The next step is to get a map of the Northern Territory that we will use to calculate sample bias in our data. The ozmaps
package has excellent quality datasets of the Australian coastline, state outlines, local municipality boundaries and electoral boundaries. We can use filter()
to extract mapping info of the Northern Territory, and st_transform()
and st_crs()
to make sure the map has the correct WGS84 projection for ALA data.
# Get map
<- ozmap_data(data = "states") |>
nt_wgs84 filter(NAME == "Northern Territory") |>
::st_transform(crs = sf::st_crs("WGS84"))
sf
## check map
ggplot(nt_wgs84) + geom_sf(fill = "transparent")
Run the model
It’s time to quantify the influence of roads, cities, airports and rivers on Delma records!
{sampbias
} uses a probabilistic method to quantify accessibility bias. You can check out their published paper if you’d like to know more about the Bayesian methods underlying these estimates.
We’ll use the calculate_bias()
function to run our model. We’ll specify that our model make estimates at a fine spatial scale resolution of res = 0.05
and add a buffer, which helps our model account for any neighbouring features just outside of our specified area that might affect our data. We’ll also make sure to restrict our model to within our map of the Northern Territory, ensuring it is in the correct format by using sf:::as_Spatial(nt_wgs84)
.
Important Note: Fine scale resolutions increase how long it takes for the model to run. This example took around 30 minutes. If you are short on time, I’d suggest starting with a lower resolution like res = 0.5
.
<- sampbias::calculate_bias(
model_bias_delma x = delma_records_filtered,
res = 0.05, # scale of spatial resolution
buffer = 0.5, # account for neighbouring features
restrict_sample = sf:::as_Spatial(nt_wgs84)
)
Check your bias
Once the model finishes running, we can use plot()
to see the results.
Plot A shows the strength of sampling bias from each spatial feature. A higher posterior weight indicates greater sampling bias. For Delma records, we can see airports (in yellow) are the largest source of bias, followed by roads (in green), with rivers and cities adding relatively smaller amounts of bias.
Plot B shows the effect of each biasing factor on the sampling rate. Lines that curve more steeply indicate a larger drop-off in the likelihood of an observation as we get further from each spatial feature. For Delma, getting further than 250 km from an airport effectively cuts the expected number of observations by around half.
It’s worth pointing out that {sampbias
} uses data of most major roads in Australia, but there are many smaller roads missing from the data informing this model (which you’ll be able to see in the next section). As a result, the output may not be a perfect reflection of movement or bias in this region.
Note: It’s good to keep in mind the scale of the x axis when reading Plot A, as small numbers suggest these factors don’t explain much of the overall variation in our data. A posterior weight of ~0.004 suggests that airports explain some variation, but not very much, of Delma observations.
plot(model_bias_delma)
You can also use summary()
to view the model’s summary statistics.
summary(model_bias_delma)
Number of occurences: 1900
Raster resolution: 0.05
Convexhull:
Geographic extent:
class : Extent
xmin : 129.03
xmax : 137.98
ymin : -26.02
ymax : -11.02
Bias weights:
bias_weight std_dev
w_airports 4.212005e-03 1.657795e-04
w_cities 2.926758e-04 1.817098e-04
w_rivers 4.331625e-04 8.453850e-05
w_roads 1.903174e-03 3.787051e-04
hp_rate 6.438165e+02 2.931110e+02
Map the bias
Before viewing the mapped output from {sampbias
}, it would be useful to see where the airports, cities, roads and rivers are. Observations of Delma are in orange.
Code
# Load in data for landscape features
data(airports)
data(waterbodies)
data(cities)
data(roads)
# Combine data
<- c(airports, cities, waterbodies, roads)
features
# Convert to spatial features, set coordinate system, filter to within NT
# Add feature ID to each for plotting
<- features |>
features_sf set_names(c("Airport", "City", "River", "Road")) |>
map_dfr(~ st_as_sf(.) |>
st_set_crs(st_crs("WGS84")) |>
st_intersection(nt_wgs84),
.id = "feature")
# Plot all the points on a map alongside the features
<- ggplot() +
features_map # NT map
geom_sf(data = nt_wgs84,
fill = "grey98", color = "grey40") +
# Rivers & Roads
geom_sf(data = features_sf |> filter(feature == "River" | feature == "Road"),
mapping = aes(color = feature),
size = 1.1,
show.legend = "line") +
# Airports
geom_sf(data = features_sf |> filter(feature == "Airport" ),
mapping = aes(color = feature),
shape = 17,
size = 6,
show.legend = "point") +
# Cities
geom_sf(data = features_sf |> filter(feature == "City"),
mapping = aes(color = feature),
shape = 16,
size = 4,
show.legend = "point") +
# Observations
geom_point(data = delma_records_filtered,
mapping = aes(x = decimalLongitude, y = decimalLatitude),
color = "#E06E53",
size = 1.1,
alpha = 0.3) +
# Specify colours
scale_color_manual(values = c(
"River" = "#249db5",
"Road" = "#ffc517",
"Airport" = "#9956db",
"City" = "#30c788"
),# Create custom line/point legend
guide = guide_legend(
override.aes = list(linetype = c("solid", "solid", "blank", "blank"),
shape = c(NA, NA, 17, 16),
size = c(1.1, 1.1, 6, 4)),
title = NULL)) +
theme_void() +
theme(legend.position = "bottom")
You can use the map_bias()
function to see the projected effect of bias on Delma records.
The map_bias()
function has several types of outputs that you can experiment with (e.g. diff_to_max
, sampling_rate
, log_sampling_rate
). In our case, because the difference in record numbers is quite large between locations with lots of records and locations with few, the log_sampling_rate
plot gives a more balanced view of where observations are missing.
First, you’ll notice that the highest sampling rates centre around airports, highlighted in bright yellow and green. You will also notice a dark blue line between the two major NT airports, showing a somewhat surprisingly clear spatial gap in observations of Delma between the two major Northern Territory airports! Even accounting for roads, rivers and cities does little to erase this dark blue line.
# Project the bias effect in space
<- sampbias::project_bias(model_bias_delma)
delma
# Map
::map_bias(delma, type="log_sampling_rate") sampbias
Other species
What about other species? Here are a few I have selected for comparison.
Select each tab to view their results.
Quolls
Quolls & Mulgaras
Little Kingfisher
Little Kingfisher
Mantids
Mantids
Green birdflower
Green birdflower
Final thoughts
The {sampbias
} package offers a cool, simple way to understand spatial biases in species observations. Research has found these data biases are very common across taxa, with an estimated 90% of global observations recorded within 2.5 km of a road (Hughes et al. 2021)! So, it is worth considering this when interpreting any data on species distributions.
Spatial biases aren’t the only ones that affect observations either. There are potential taxonomic, temporal and environmental biases that can influence them, too. You might find other R packages like the {ocuAccess
} package more useful if you are interested in calculating the effects of other types of biases on your data.
We hope you’ll be able to use {sampbias
} in your own analyses so you can check yourself1 before you wreck yourself2!
Expand for session info
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.2 (2023-10-31 ucrt)
os Windows 10 x64 (build 19045)
system x86_64, mingw32
ui RTerm
language (EN)
collate English_Australia.utf8
ctype English_Australia.utf8
tz Australia/Sydney
date 2024-02-12
pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.2)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.2)
galah * 2.0.1 2024-02-06 [1] CRAN (R 4.3.2)
ggplot2 * 3.4.4 2023-10-12 [1] CRAN (R 4.3.1)
htmltools * 0.5.7 2023-11-03 [1] CRAN (R 4.3.2)
lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.3.2)
ozmaps * 0.4.5 2021-08-03 [1] CRAN (R 4.3.2)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.2)
readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.2)
rgeos * 0.5-9 2021-12-15 [1] CRAN (R 4.2.1)
sampbias * 1.0.4 2022-07-07 [1] Github (azizka/sampbias@b1d951b)
sessioninfo * 1.2.2 2021-12-06 [1] CRAN (R 4.3.2)
sf * 1.0-14 2023-07-11 [1] CRAN (R 4.3.2)
sp * 2.1-2 2023-11-26 [1] CRAN (R 4.3.2)
stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.3.2)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.2)
tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.2)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.2)
viridis * 0.6.4 2023-07-22 [1] CRAN (R 4.3.2)
viridisLite * 0.4.2 2023-05-02 [1] CRAN (R 4.3.1)
[1] C:/Users/KEL329/R-packages
[2] C:/Users/KEL329/AppData/Local/Programs/R/R-4.3.2/library
──────────────────────────────────────────────────────────────────────────────