An introduction to species distribution modelling using {tidysdm} & {tidymodels}

Author

Date

30 April 2024

Species distribution models (SDMs) offer a way to quantify relationships between biodiversity observations and environmental variables, and are widely used in biogeography, conservation biology, and macroecology. SDMs allow us to assess how suitable an area is for a species, which has implications for predicting range shifts of invasive or threatened species, and understanding habitat suitability under different climate change scenarios.

The process of fitting and choosing the best model is iterative and, often, different modelling approaches are used to make multiple predictions that can be averaged to get a final result. Here, we walk through the steps of building a SDM for the laughing kookaburra in a small area of New South Wales, Australia, using tidymodels and tidysdm. tidymodels is a collection of packages for modelling using tidyverse principles, and tidysdm implements SDMs using the tidymodels framework.

Pros & cons of tidymodels

Pros

Machine learning models are easier to build and use
Tidymodels packages are good for machine learning models in R. Many types of SDMs like MaxEnt models are well-suited for tidymodels.
Data transformations are more transparent
Steps like log-transformations, scaling and centring are occur within the model building workflow (rather than before during data wrangling), making these data manipulations more transparent to the model’s results.
Functions are modular
Tidymodels functions are modular, so many different types of models can be run with the same functions and workflow.
Many different models can be one at once
Many different types of models can be run at once using a workflow_set(). These different models can also be compared and even combined into a final predictive model.
Auto-detects visualisations
Helper functions like autoplot() make visualising model performance less cumbersome by determining which visual to display based on the previous object’s output.
Make something pretty good quickly
Users can build fairly good predictive models quite quickly, streamlining the process of going from raw data to prediction.

Cons

Learning the workflow can be hard
Learning tidymodels can be confusing and time consuming because the workflow differs quite a lot from other popular statistical modelling packages in R.
Confusing function names
The {recipes} package uses function names like prep(), bake() and juice() which makes them difficult to know what they do or when to use them.
Learning function outputs takes time
Learning what output to expect from each function, and what helper functions are available for each step takes time.
Tidymodels isn’t good at everything
Tidymodels is optimised for machine learning, but it isn’t necessarily “better” than other packages like {dismo}, {lme4}, {biomod2}, or packages for bayesian models like {tidybayes} or {brms}.

To begin, we can load some packages.

library(galah)
library(tidyverse)
library(tidymodels) 
library(tidysdm) # devtools::install_github("EvolEcolGroup/tidysdm")
library(terra)
library(tidyterra)
library(here)
library(sf)
library(ozmaps)

Download data

Download biological data

Our model will use occurrence data of laughing kookaburras (Dacelo novaeguineae) in a small area in New South Wales. The laughing kookaburra is the largest Kingfisher in the world.

Left: Dacelo (Dacelo) novaeguineae (craigc_86 CC-BY-NC 4.0 (Int)), Middle: Dacelo (Dacelo) novaeguineae (Wildash | questagame.com CC-BY-NC 4.0 (Int)), Right: Dacelo (Dacelo) novaeguineae (c_a_critter CC-BY-NC 4.0 (Int))

The laughing kookaburra also has one of the most recognisable bird calls. This video is an amazing example.

We’ll choose a region in New South Wales…

# define geographic region
custom_bbox <- tibble(ymin = -35, 
                      ymax = -32, 
                      xmin = 149, 
                      xmax = 152.1)

…and download records from the ALA.

galah_config(email = "your-email-here") # Registered ALA email

# download occurrence records
kookaburras <- galah_call() |>
  identify("Dacelo novaeguineae") |>
  filter(year == 2023) |>
  galah_apply_profile(ALA) |>
  galah_geolocate(custom_bbox, type = "bbox") |>
  atlas_occurrences()

For many of the later steps we’ll need the coordinates formatted as a spatial object (i.e., geometry). So, let’s convert our occurrence data to a spatial object (sf) defined by the longitude and latitude coordinates, and set the Coordinate Reference System (CRS) to EPSG:4326¹ (WGS84).

# convert to sf
kookaburras_sf <- kookaburras |>
  st_as_sf(coords = c("decimalLongitude", "decimalLatitude")) |>
  st_set_crs(4326)

Download environmental data

To help with our model prediction, we’ll also download the BioClim variables, a list of 19 biologically relevant environmental variables, for all of Australia as a raster.

What’s a raster?

A raster is a spatial grid of cells, where each cell contains a value representing information such as temperature or elevation. This information is often visualised by mapping colours to values in the raster (see image below). The resolution of the raster depends on the size of cells within the grid, with smaller cells corresponding to higher resolution (just like how the resolution of a television screen is determined by the number of pixels).

Code

plot_raster <- function(r) {
  plot(r, axes = FALSE, legend = FALSE)
  plot(as.polygons(r, dissolve = FALSE, trunc = FALSE), add = TRUE)
  text(r, digits = 2)
}

# Create a 4 x 4 matrix
m <- matrix(1:16, ncol = 4, nrow = 4)
# Convert the matrix into a raster
r16 <- rast(m)

plot_raster(r16)

The colour of the cell/pixel is determined by the value assigned to it

# Download world climate data
bioclim <- geodata::worldclim_country(
    country = "Australia",
    var = "bio",
    res = 5,
    path = here::here("folder-name", 
                      "subfolder-name")
  )

To narrow our BioClim data to only within the extent of our defined bounding box, we’ll create an extent object bbox_ext, then crop our bioclim layers to within bbox_ext and project our cropped BioClim data to the same CRS as our kookaburra occurrence points.

# Set the coordinates to our bounding box
bbox_ext <- terra::ext(
  c(custom_bbox[["xmin"]], 
    custom_bbox[["xmax"]], 
    custom_bbox[["ymin"]], 
    custom_bbox[["ymax"]]
    ))

# Crop our worldclim data to within our bounding box coordinates
aus <- bioclim |>
  terra::crop(bbox_ext) |>
  terra::project(crs("EPSG:4326"))

To make sure everything looks correct, let’s plot one of the variables with geom_spatraster() from the tidyterra package ².

# Download NSW map, set CRS projection
nsw <- ozmaps::ozmap_states |>
  filter(NAME == "New South Wales") |>
  st_transform(crs = st_crs(4326))

# Map of Annual temperature + points
first_map <- ggplot() +
  geom_spatraster(data = aus,
                  aes(fill = wc2.1_30s_bio_1)) +
  geom_sf(data = kookaburras_sf,
          colour = "#312108",
          size = 2) +
  scale_fill_whitebox_c(palette = "muted",
                        na.value = NA) +
  guides(fill = guide_colorbar(title = "Annual Mean\nTemperature")) +
  theme_void()

first_map

Prepare data

Now that we have our occurrence data and environmental data, there are a few steps we’ll complete to prepare our data for modelling.

Thinning

The first step is to remove data where many points are overlapping. The resolution of our prediction is dependent on the resolution of our environmental data. If each cell in our environmental data defines the average value of one square kilometre, even if we had kookaburra observations at a higher resolution (for example, every square metre of our defined area), we can only detect differences to the lowest resolution of our grid cells.

So, we can thin our data so that there is only one observation per cell of our raster³. This step reduces spatial bias and lowers the risk that autocorrelation affects final predictions of our model.

# thin
set.seed(12345)
kookaburras_thin <- tidysdm::thin_by_cell(kookaburras_sf, 
                                          raster = aus
                                          )

# number of observations
tibble(
  before = nrow(kookaburras_sf),
  after = nrow(kookaburras_thin)
  )

# A tibble: 1 × 2
  before after
   <int> <int>
1   5624  1575

Pseudo-absences

Our data from the ALA is presence-only data. So, the second step is to create pseudo-absences (also called background points) that represent the full extent of the area where kookaburras haven’t been observed (yet) in our data. Importantly, these are not the same as true absences and this should be taken into account when interpreting results⁴.

Let’s add 3 times the number of presences to fill our grid, making sure they aren’t closer than 5 km to another point like our occurrence points.

kookaburras_pseudoabs <- tidysdm::sample_pseudoabs(
  kookaburras_thin,
  n = 3 * nrow(kookaburras_thin),
  raster = aus,
  method = c("dist_min", km2m(5))
)

Extract environmental values

Now that we have our presence and pseudo-absence points, the third step is to extract the environmental data for each point location in aus and bind the resulting values to our points data in kookaburras_psuedoabs. The result is a tibble with our points, their class, and the specific values of all 19 BioClim variables.

kookaburras_bioclim <- kookaburras_pseudoabs |> 
  bind_cols(
    terra::extract(aus, 
                   kookaburras_pseudoabs, 
                   ID = FALSE)
    )
kookaburras_bioclim

Simple feature collection with 6300 features and 20 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 149.0042 ymin: -34.99952 xmax: 152.0958 ymax: -32.00211
Geodetic CRS:  WGS 84
# A tibble: 6,300 × 21
   class                geometry wc2.1_30s_bio_1 wc2.1_30s_bio_2
 * <fct>             <POINT [°]>           <dbl>           <dbl>
 1 presence (151.1115 -33.66282)            16.6            9.32
 2 presence (151.1556 -33.68888)            16.7            9.15
 3 presence (149.7309 -34.75422)            13.4           12.8 
 4 presence (150.4131 -33.72526)            13.2            9.78
 5 presence (151.1531 -33.82185)            17.8            9.29
 6 presence  (150.6711 -34.0971)            16.7           12.0 
 7 presence (151.0117 -33.64909)            16.6           10.5 
 8 presence (150.2013 -33.38858)            11.2            9.94
 9 presence (150.4419 -34.62589)            12.9           10.1 
10 presence  (150.1324 -33.3652)            11.2           10.1 
# ℹ 6,290 more rows
# ℹ 17 more variables: wc2.1_30s_bio_3 <dbl>, wc2.1_30s_bio_4 <dbl>,
#   wc2.1_30s_bio_5 <dbl>, wc2.1_30s_bio_6 <dbl>, wc2.1_30s_bio_7 <dbl>,
#   wc2.1_30s_bio_8 <dbl>, wc2.1_30s_bio_9 <dbl>, wc2.1_30s_bio_10 <dbl>,
#   wc2.1_30s_bio_11 <dbl>, wc2.1_30s_bio_12 <dbl>, wc2.1_30s_bio_13 <dbl>,
#   wc2.1_30s_bio_14 <dbl>, wc2.1_30s_bio_15 <dbl>, wc2.1_30s_bio_16 <dbl>,
#   wc2.1_30s_bio_17 <dbl>, wc2.1_30s_bio_18 <dbl>, wc2.1_30s_bio_19 <dbl>

Select predictor variables

The fourth and final step is to choose predictor variables for our model. These are variables we think explain variation in our outcome (i.e., the probability a kookaburra could live in a given location). When choosing predictor variables, it is good practice to use theory and previous research to inform what variables you choose as predictors⁵.

Choosing variables and avoiding multicollinearity

In species distribution models, multicollinearity—high correlation between several independent variables in a model—can have unintended effects that bias predictions⁶. Data science tools can also help refine your predictor variable choices, too, including some functions in tidysdm that we used below.

A good start is to choose variables that differentiate between presences and pseudo-absences, which in the plot below are variables that have less overlap between red and blue distributions. To help choose variables with the highest non-overlapping distribution, we can decide on a percentage cut-off of 55% non-overlap, leaving us with the top 3 variables in the table below.

Plot
Table

Code

kookaburras_bioclim |>
  rename_with(
    ~ str_remove_all(.x, "wc2.1_30s_"), 
    starts_with("wc2.1_30s_")) |>
  plot_pres_vs_bg(class)

Code

overlap <- kookaburras_bioclim |>
  dist_pres_vs_bg(class) |>
  enframe("bioclim", "percent_non_overlap") |>
  arrange(desc(percent_non_overlap))

overlap |> gt::gt()

bioclim	percent_non_overlap
wc2.1_30s_bio_12	0.5638973
wc2.1_30s_bio_4	0.5600276
wc2.1_30s_bio_16	0.5534120
wc2.1_30s_bio_11	0.5394637
wc2.1_30s_bio_6	0.5210402
wc2.1_30s_bio_7	0.5113068
wc2.1_30s_bio_13	0.4997009
wc2.1_30s_bio_9	0.4827577
wc2.1_30s_bio_18	0.4820482
wc2.1_30s_bio_1	0.4599437
wc2.1_30s_bio_2	0.4581933
wc2.1_30s_bio_17	0.4464991
wc2.1_30s_bio_19	0.4221545
wc2.1_30s_bio_15	0.4202146
wc2.1_30s_bio_14	0.4099419
wc2.1_30s_bio_8	0.3411940
wc2.1_30s_bio_10	0.3182850
wc2.1_30s_bio_5	0.2343475
wc2.1_30s_bio_3	0.1002332

Now we can use pair plots to view the relationship between each pair of variables. Variables bio_12 and bio_16 are very highly correlated (94%). To avoid multicollinearity in our model, we’ll include only bio_12 (annual precipitation) and bio_4 (temperature seasonality).

Code

aus |>
  select(wc2.1_30s_bio_12, 
         wc2.1_30s_bio_4, 
         wc2.1_30s_bio_16) |>
  terra::pairs()

Here, we selected two BioClim variables that we thought were reasonable environmental predictors (using mainly data science techniques):

BIO4: Temperature Seasonality⁷
BIO12: Annual Precipitation

We’ll filter our point data and BioClim raster data to only include our two variables.

# predictor variable names
vars <- c("wc2.1_30s_bio_4", "wc2.1_30s_bio_12")

# filter point data columns
kookaburras_bioclim_filtered <- 
  kookaburras_bioclim |> 
  select(all_of(c(vars, "class")))

kookaburras_bioclim_filtered |> head(5L)

Simple feature collection with 5 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 149.7309 ymin: -34.75422 xmax: 151.1556 ymax: -33.66282
Geodetic CRS:  WGS 84
# A tibble: 5 × 4
  wc2.1_30s_bio_4 wc2.1_30s_bio_12 class                geometry
            <dbl>            <dbl> <fct>             <POINT [°]>
1            391.             1204 presence (151.1115 -33.66282)
2            390.             1275 presence (151.1556 -33.68888)
3            504.              673 presence (149.7309 -34.75422)
4            424.             1273 presence (150.4131 -33.72526)
5            393.             1164 presence (151.1531 -33.82185)

# filter bioclim data columns
aus_filtered <- aus[[vars]]
aus_filtered

class       : SpatRaster 
dimensions  : 360, 372, 2  (nrow, ncol, nlyr)
resolution  : 0.008333333, 0.008333333  (x, y)
extent      : 149, 152.1, -35, -32  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : memory
names       : wc2.1_30s_bio_4, wc2.1_30s_bio_12 
min values  :        306.3223,              620 
max values  :        582.4574,             1657

Code

bioclim4 <- ggplot() +
  geom_spatraster(data = aus_filtered,
                  aes(fill = wc2.1_30s_bio_4)) +
  geom_rect(data = custom_bbox,
            mapping = aes(xmin = xmin, 
                          ymin = ymin, 
                          xmax = xmax, 
                          ymax = ymax),
            colour = "grey50",
            fill = NA) +  
  scale_fill_whitebox_c(palette = "muted",
                        na.value = NA) +
  guides(fill = guide_colorbar(title = "Annual Range in\nTemperature\n(°C)")) +
  theme_void()

bioclim12 <- ggplot() +
  geom_spatraster(data = aus_filtered,
                  aes(fill = wc2.1_30s_bio_12)) +
  geom_rect(data = custom_bbox,
            mapping = aes(xmin = xmin, 
                          ymin = ymin, 
                          xmax = xmax, 
                          ymax = ymax),
            colour = "grey50",
            fill = NA) +  
  scale_fill_whitebox_c(palette = "deep",
                        na.value = NA) +
  guides(fill = guide_colorbar(title = "Precipitation (mm)")) +
  theme_void()

Fit model

Tidymodels is designed to build a model workflow, train the model’s performance, then test the model’s ability to predict data accurately. This workflow might be slightly different to what many research scientists are used to.

In machine learning models, data is like a limited resource that we must divide using a “data budget” for two main purposes: training a reasonable model, and testing the final model.

Split data

The first step to allocating your “data budget” is splitting your data. We can use initial_split() to allocate a reasonable “data budget” into these categories (typically a 75-25% split).

# set training and testing data
set.seed(100)

kookaburras_split <- 
  kookaburras_bioclim_filtered |>
  initial_split()
kookaburras_split

<Training/Testing/Total>
<4725/1575/6300>

Now we can save these data as separate data objects for training() and testing().

kookaburras_train <- training(kookaburras_split)
kookaburras_test <- testing(kookaburras_split)

We are left with two dataframes with identical columns but different points.

Train
Test

kookaburras_train |> head(5L)

Simple feature collection with 5 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 149.0125 ymin: -34.21327 xmax: 150.894 ymax: -32.3875
Geodetic CRS:  WGS 84
# A tibble: 5 × 4
  wc2.1_30s_bio_4 wc2.1_30s_bio_12 class                 geometry
            <dbl>            <dbl> <fct>              <POINT [°]>
1            498.              648 pseudoabs  (150.4792 -32.3875)
2            374.             1442 presence   (150.894 -34.21327)
3            561.              693 pseudoabs  (149.0125 -32.7875)
4            486.              835 pseudoabs (150.1625 -32.72917)
5            435.              890 pseudoabs  (150.7542 -33.1125)

kookaburras_test |> head(5L)

Simple feature collection with 5 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 149.6909 ymin: -33.82185 xmax: 151.6858 ymax: -32.9229
Geodetic CRS:  WGS 84
# A tibble: 5 × 4
  wc2.1_30s_bio_4 wc2.1_30s_bio_12 class                geometry
            <dbl>            <dbl> <fct>             <POINT [°]>
1            393.             1164 presence (151.1531 -33.82185)
2            461.             1085 presence (150.2013 -33.38858)
3            513.              802 presence (149.6909 -33.14061)
4            408.             1144 presence  (151.6858 -32.9229)
5            463.             1073 presence (150.1981 -33.37893)

Resampling

Now let’s resample our training data so we can use it to optimise and evaluate our model. One way to resample is using cross-validation, a well-established method of resampling that randomly assigns points to analysis and assessment groups. These randomly resampled and split data sets are known as folds. We can use the spatialsample package to create 5 v-folds with spatial_block_cv(), a function for resampling spatial data.

set.seed(100)
kookaburras_cv <- spatial_block_cv(kookaburras_train, v = 5)

spatial_block_cv() uses a type of resampling called block cross-validation, which creates a grid of “blocks” and attempts to maintain these blocked groups when resampling data points. Block cross-validation is important because spatial data is not completely random; data from neighbouring locations probably relate in some way (they aren’t completely random), and block cross-validation attempts to preserve this spatial relationship. The plots below demonstrate the general process. The plot on the left shows the blocks, the animation on the right shows the resulting 5 folds.

Define our model

Next let’s make our model’s “recipe”. This is the tidymodels term for any pre-processing steps that happen to our data before adding them to a model. A recipe includes our model formula, and any transformations or standardisations we might wish to do⁸.

In our case, let’s define that our model’s outcome variable is the class of presence or absence. We’ll then add our predictor variables to our model, with the formula class ~ ., equivalent to class ~ bio4 + bio12.

kookaburras_recipe <- recipe(
  kookaburras_train, 
  formula = class ~ .
  )
kookaburras_recipe

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 2
coords:    2

Now we can set our workflow, which merges our formula, any data pre-processing, and specifies which models we’ll use. One of the strengths of using tidymodels is that we can run several different types of models in a single workflow_set() to train and optimise them.

kookaburras_models <-
  # create the workflow_set
  workflow_set(
    preproc = list(default = kookaburras_recipe),
    models = list(
      glm = sdm_spec_glm(),        # the standard glm specs
      rf = sdm_spec_rf(),          # rf specs with tuning
      gbm = sdm_spec_boost_tree(), # boosted tree model (gbm) specs with tuning
      maxent = sdm_spec_maxent()   # maxent specs with tuning
    ),
    cross = TRUE # make all combinations of preproc and models
  ) |>
  # tweak controls to store information needed later to create the ensemble
  option_add(control = control_ensemble_grid())

kookaburras_models

# A workflow set/tibble: 4 × 4
  wflow_id       info             option    result    
  <chr>          <list>           <list>    <list>    
1 default_glm    <tibble [1 × 4]> <opts[1]> <list [0]>
2 default_rf     <tibble [1 × 4]> <opts[1]> <list [0]>
3 default_gbm    <tibble [1 × 4]> <opts[1]> <list [0]>
4 default_maxent <tibble [1 × 4]> <opts[1]> <list [0]>

Fit our model

Next we will determine what parameters optimise our model’s performance by tuning our model. Tuning uses trial-and-error to figure out which type of model under what hyperparameters makes reasonable predictions.

Let’s tune our models using our resampled folds (kookaburras_cv).

How does tuning work?

Tuning is the process of simulating many different ways to fit lines made by our models to our training data. The number of curves in a model’s line of best fit increases as a model becomes more complex, determined by its degrees of freedom. By using different functions to set reasonable weighting parameters that penalize a model when lines curve too much, tuning optimises the model’s performance by finding the balance between fitting our current training data and predicting new values correctly. For more information and a visual of this, check out the tunes package Getting Started vignette.

set.seed(2345678) # for reproducability

kookaburras_models_tune <-
  kookaburras_models |>
  workflow_map("tune_grid",
    resamples = kookaburras_cv, 
    grid = 6,                   # increase for more iterations
    metrics = sdm_metric_set(),
    verbose = TRUE,
    control = stacks::control_stack_grid()
  )

kookaburras_models_tune

# A workflow set/tibble: 4 × 4
  wflow_id       info             option    result   
  <chr>          <list>           <list>    <list>   
1 default_glm    <tibble [1 × 4]> <opts[4]> <rsmp[+]>
2 default_rf     <tibble [1 × 4]> <opts[4]> <tune[+]>
3 default_gbm    <tibble [1 × 4]> <opts[4]> <tune[+]>
4 default_maxent <tibble [1 × 4]> <opts[4]> <tune[+]>

beepr::beep(2)

We can use autoplot() to visualise which models performed best by a set of common performance metrics for species distribution models. Models with higher values and smaller confidence intervals performed better.

autoplot(kookaburras_models_tune)

We can also collect each model’s metrics and rank the models by performance.

# see metrics
1collect_metrics(kookaburras_models_tune)

1: As a general tidymodels tip, many columns with a . at the start of its column name can be retrieved with a collect_ function (e.g., collect_metrics(), collect_parameters()).

# A tibble: 45 × 9
   wflow_id    .config      preproc model .metric .estimator  mean     n std_err
   <chr>       <chr>        <chr>   <chr> <chr>   <chr>      <dbl> <int>   <dbl>
 1 default_glm Preprocesso… spatia… logi… boyce_… binary     0.890     5  0.0374
 2 default_glm Preprocesso… spatia… logi… roc_auc binary     0.861     5  0.0327
 3 default_glm Preprocesso… spatia… logi… tss_max binary     0.633     5  0.0617
 4 default_rf  Preprocesso… spatia… rand… boyce_… binary     0.753     5  0.0907
 5 default_rf  Preprocesso… spatia… rand… roc_auc binary     0.792     5  0.0400
 6 default_rf  Preprocesso… spatia… rand… tss_max binary     0.493     5  0.0681
 7 default_rf  Preprocesso… spatia… rand… boyce_… binary     0.761     5  0.0841
 8 default_rf  Preprocesso… spatia… rand… roc_auc binary     0.799     5  0.0394
 9 default_rf  Preprocesso… spatia… rand… tss_max binary     0.505     5  0.0670
10 default_gbm Preprocesso… spatia… boos… boyce_… binary     0.745     5  0.107 
# ℹ 35 more rows

Our tuning results show that several types of models and parameters performed quite well. Rather than choosing only one model to use for predictions, it’s possible to use several as a “stacked ensemble model”! The stacks package in tidymodels let’s us blend predictions of a few good candidate models (based on whatever metric you choose) to make better overall estimates⁹.

library(stacks)
set.seed(123456)

kookaburras_stacked <- 
  stacks() |>                                # initialize the stack
  add_candidates(kookaburras_models_tune) |> # add candidate members
  blend_predictions() |>                     # determine how to combine their predictions
  fit_members()                              # fit the candidates with nonzero stacking coefficients

kookaburras_stacked

# A tibble: 2 × 3
  member                             type         weight
  <chr>                              <chr>         <dbl>
1 .pred_pseudoabs_default_maxent_1_1 maxent        2.25 
2 .pred_pseudoabs_default_glm_1_1    logistic_reg  0.875

Here is a nice visual of how these two member models are weighted to inform our predictions.

autoplot(kookaburras_stacked, type = "weights")

Assess model

Our model kookaburras_stacked is now ready, and we can use it to make predictions about our test data. We’ll bind the predicted values to the true values of our test data…

kookaburras_test_predictions <-
  kookaburras_test %>%
  bind_cols(predict(kookaburras_stacked, ., 
                    type = "prob", 
                    save_pred = TRUE))

…which allows us to assess how good our model is at making correct predictions of the “true” classes.

kookaburras_test_predictions |> 
  sdm_metric_set()(truth = class, .pred_presence)

# A tibble: 3 × 3
  .metric    .estimator .estimate
  <chr>      <chr>          <dbl>
1 boyce_cont binary         0.968
2 roc_auc    binary         0.848
3 tss_max    binary         0.565

We can also visualise how well the model has correctly predicted the class of each point by predicting "class" rather than "prob" and mapping correct vs incorrect predictions.

Code

# predict class
kookaburras_test_predictions_class <-
  kookaburras_test %>%
  bind_cols(predict(kookaburras_stacked, ., 
                    type = "class", 
                    save_pred = TRUE))

# plot correct vs incorrect predictions
kookaburras_test_predictions_class |>
  mutate(correct = case_when(
    class == .pred_class ~ "Correct",
    TRUE ~ "Incorrect"
  )) |>
  ggplot() +
  geom_sf(aes(geometry = geometry, colour = correct)) +
  labs(color = NULL) +
  scale_color_manual(values = c("darkred", "lightpink")) + 
  geom_spatraster(data = aus,
                  aes(fill = wc2.1_30s_bio_4),
                  alpha = 0.1) +
  scale_fill_whitebox_c(palette = "muted",
                        na.value = NA) +
  theme_void()

tidysdm wrapper functions

tidysdm offers its own wrapper function simple_ensemble() to run a stack model workflow and some helpful ways to assess their performance.

Final prediction

Finally, we can use our model to predict the habitat suitability of laughing kookaburras over our area. We’ll predict an entire surface of values within our aus_filtered area using the incredible predict_raster() function from tidysdm (which saves us quite a few wrangling steps to work nicely with terra).

# predict
prediction_present <- predict_raster(kookaburras_stacked, 
                                     aus_filtered, 
                                     type = "prob")

# map
ggplot() +
  geom_spatraster(data = prediction_present, 
                  aes(fill = .pred_presence)) +
  scale_fill_whitebox_c(palette = "purple",
                        na.value = NA) +
  guides(
    fill = guide_colorbar(title="Relative\nHabitat\nSuitability")
    ) +

  # plot presences used in the model
  geom_sf(data = kookaburras_sf,
          alpha = 0.3) +
  labs(title="Predicted distribution of laughing kookaburras") +
  pilot::theme_pilot(grid="hv") +
  theme(
    legend.text = element_text(hjust = 0.5)
  )

And there we have our predictions of our species distribution model!

Final thoughts

We hope this article has made the steps of species distribution modelling and interpretation clearer. Species distribution models remain one of the most powerful statistical tools for making inferences about species and their habitat range. tidymodels, tidysdm, and tidyterra offer a useful toolset for running these models in R.

Although our model performed decently under several model performance metrics, no model is perfect. For example, you can see that many of the values towards the centre of Australia have low relative habitat suitability despite quite a few kookaburra occurrences. This is a limitation likely caused by our data and our choice of predictor variables. Testing different subsets of predictors and trying environmental layers at different levels of spatial resolution will help to improve the performance of the model. Another option (if collecting more data is not feasible) is to explore other datasets that could be aggregated to enhance the quality of training data.

Our model above is quite minimalist (for simplicity). If you’d like an example list of variables used in more performant models, check out this paper.

To learn more on ALA Labs, check out our posts on spatial bias and mapping multiple overlapping species distributions.

Expand for session info

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31 ucrt)
 os       Windows 10 x64 (build 19045)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_Australia.utf8
 ctype    English_Australia.utf8
 tz       Australia/Sydney
 date     2024-05-02
 pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package       * version date (UTC) lib source
 broom         * 1.0.5   2023-06-09 [1] CRAN (R 4.3.1)
 dials         * 1.2.0   2023-04-03 [1] CRAN (R 4.3.2)
 dplyr         * 1.1.4   2023-11-17 [1] CRAN (R 4.3.2)
 forcats       * 1.0.0   2023-01-29 [1] CRAN (R 4.3.2)
 galah         * 2.0.2   2024-04-12 [1] CRAN (R 4.3.3)
 ggplot2       * 3.4.4   2023-10-12 [1] CRAN (R 4.3.1)
 glmnet        * 4.1-8   2023-08-22 [1] CRAN (R 4.3.1)
 here          * 1.0.1   2020-12-13 [1] CRAN (R 4.3.2)
 htmltools     * 0.5.7   2023-11-03 [1] CRAN (R 4.3.2)
 infer         * 1.0.5   2023-09-06 [1] CRAN (R 4.3.1)
 lubridate     * 1.9.3   2023-09-27 [1] CRAN (R 4.3.2)
 Matrix        * 1.6-4   2023-11-30 [1] CRAN (R 4.3.2)
 maxnet        * 0.1.4   2021-07-09 [1] CRAN (R 4.3.2)
 modeldata     * 1.2.0   2023-08-09 [1] CRAN (R 4.3.1)
 ozmaps        * 0.4.5   2021-08-03 [1] CRAN (R 4.3.2)
 parsnip       * 1.1.1   2023-08-17 [1] CRAN (R 4.3.1)
 purrr         * 1.0.2   2023-08-10 [1] CRAN (R 4.3.2)
 ranger        * 0.16.0  2023-11-12 [1] CRAN (R 4.3.2)
 readr         * 2.1.5   2024-01-10 [1] CRAN (R 4.3.3)
 recipes       * 1.0.8   2023-08-25 [1] CRAN (R 4.3.1)
 rsample       * 1.2.0   2023-08-23 [1] CRAN (R 4.3.1)
 scales        * 1.3.0   2023-11-28 [1] CRAN (R 4.3.2)
 sessioninfo   * 1.2.2   2021-12-06 [1] CRAN (R 4.3.2)
 sf            * 1.0-16  2024-03-24 [1] CRAN (R 4.3.3)
 spatialsample * 0.5.1   2023-11-08 [1] CRAN (R 4.3.2)
 stacks        * 1.0.3   2023-11-06 [1] CRAN (R 4.3.3)
 stringr       * 1.5.1   2023-11-14 [1] CRAN (R 4.3.2)
 terra         * 1.7-55  2023-10-13 [1] CRAN (R 4.3.1)
 tibble        * 3.2.1   2023-03-20 [1] CRAN (R 4.3.2)
 tidymodels    * 1.1.1   2023-08-24 [1] CRAN (R 4.3.1)
 tidyr         * 1.3.1   2024-01-24 [1] CRAN (R 4.3.3)
 tidysdm       * 0.9.2   2023-11-13 [1] CRAN (R 4.3.2)
 tidyterra     * 0.5.0   2023-11-21 [1] CRAN (R 4.3.2)
 tidyverse     * 2.0.0   2023-02-22 [1] CRAN (R 4.3.2)
 tune          * 1.1.2   2023-08-23 [1] CRAN (R 4.3.1)
 workflows     * 1.1.3   2023-02-22 [1] CRAN (R 4.3.2)
 workflowsets  * 1.0.1   2023-04-06 [1] CRAN (R 4.3.2)
 xgboost       * 1.7.6.1 2023-12-06 [1] CRAN (R 4.3.1)
 yardstick     * 1.2.0   2023-04-21 [1] CRAN (R 4.3.2)

 [1] C:/Users/KEL329/R-packages
 [2] C:/Users/KEL329/AppData/Local/Programs/R/R-4.3.2/library

──────────────────────────────────────────────────────────────────────────────

Footnotes

ALA data is projected using CRS EPSG:4326 (the same one used by Google Earth).↩︎
tidyterra follows the Grammar of Graphics made popular in R by ggplot2 and allows rasters to be plotted using the same syntax. In contrast, the terra package requires users to plot using base R styling (using plot()).↩︎
It’s also possible to thin data by distance rather than cell size using tidysdm::thin_by_dist().↩︎
A true absence has quite a different meaning than a pseudo-absence to a species distribution model. The main difference is in the value a known absence provides compared to a simulated one for our interpretation of the results.

A true absence is a point where, at a specific time, an organism was not found there. Alternatively, a pseudo-absence is a point that acts like we haven’t found an animal there, but we don’t actually have data for that location! The model, however, doesn’t know if a point represents a true absence or a psuedo-absence. It only knows the information it is given and will interpret that information using the parameters it is provided (in this way, models are a reflection of the real-world, but never a substitute).

Collecting true absence data is difficult, typically requiring expert knowledge, surveys with stricter methodologies, and repeated measures of the same areas over time. Pseudo-absences are much easier to collect—you simply simulate them on a computer—but they are less informative. Keep this trade-off in mind as you interpret your model’s results.↩︎
Keep in mind that the strength of variables depends on the scale of your prediction. If you wish to make predictions at a broad-scale, variables like temperature and rainfall will likely be strong predictors, whereas if you wish to make predictions at a fine-scale, variables like food scarcity and competition might be stronger predictors for your outcome.↩︎
A model can only use the information it is provided to make inferences about the world. If multiple variables in a model correlate, the model can place too much weight on those values to determine the outcome! The model isn’t aware of the many other environmental variables that affect the real world outcome.↩︎
Measured as the standard deviation of the mean monthly temperature↩︎
For example, log transformation, centring scales, setting dummy variables↩︎
To learn more about how putting together a stack works, check out this helpful article on the stacks website.↩︎