Species distribution models (SDMs) offer a way to quantify relationships between biodiversity observations and environmental variables, and are widely used in biogeography, conservation biology, and macroecology. SDMs allow us to assess how suitable an area is for a species, which has implications for predicting range shifts of invasive or threatened species, and understanding habitat suitability under different climate change scenarios.
The process of fitting and choosing the best model is iterative and, often, different modelling approaches are used to make multiple predictions that can be averaged to get a final result. Here, we walk through the steps of building a SDM for the laughing kookaburra in a small area of New South Wales, Australia, using tidymodels and tidysdm. tidymodels is a collection of packages for modelling using tidyverse principles, and tidysdm implements SDMs using the tidymodels framework.
Pros
- Machine learning models are easier to build and use
Tidymodels packages are good for machine learning models in R. Many types of SDMs like MaxEnt models are well-suited for tidymodels. - Data transformations are more transparent
Steps like log-transformations, scaling and centring are occur within the model building workflow (rather than before during data wrangling), making these data manipulations more transparent to the model’s results. - Functions are modular
Tidymodels functions are modular, so many different types of models can be run with the same functions and workflow. - Many different models can be one at once
Many different types of models can be run at once using aworkflow_set()
. These different models can also be compared and even combined into a final predictive model. - Auto-detects visualisations
Helper functions likeautoplot()
make visualising model performance less cumbersome by determining which visual to display based on the previous object’s output. - Make something pretty good quickly
Users can build fairly good predictive models quite quickly, streamlining the process of going from raw data to prediction.
Cons
- Learning the workflow can be hard
Learning tidymodels can be confusing and time consuming because the workflow differs quite a lot from other popular statistical modelling packages in R. - Confusing function names
The {recipes} package uses function names likeprep()
,bake()
andjuice()
which makes them difficult to know what they do or when to use them. - Learning function outputs takes time
Learning what output to expect from each function, and what helper functions are available for each step takes time. - Tidymodels isn’t good at everything
Tidymodels is optimised for machine learning, but it isn’t necessarily “better” than other packages like {dismo}, {lme4}, {biomod2}, or packages for bayesian models like {tidybayes} or {brms}.
To begin, we can load some packages.
library(galah)
library(tidyverse)
library(tidymodels)
library(tidysdm) # devtools::install_github("EvolEcolGroup/tidysdm")
library(terra)
library(tidyterra)
library(here)
library(sf)
library(ozmaps)
Download data
Download biological data
Our model will use occurrence data of laughing kookaburras (Dacelo novaeguineae) in a small area in New South Wales. The laughing kookaburra is the largest Kingfisher in the world.
The laughing kookaburra also has one of the most recognisable bird calls. This video is an amazing example.
We’ll choose a region in New South Wales…
# define geographic region
<- tibble(ymin = -35,
custom_bbox ymax = -32,
xmin = 149,
xmax = 152.1)
…and download records from the ALA.
galah_config(email = "your-email-here") # Registered ALA email
# download occurrence records
<- galah_call() |>
kookaburras identify("Dacelo novaeguineae") |>
filter(year == 2023) |>
galah_apply_profile(ALA) |>
galah_geolocate(custom_bbox, type = "bbox") |>
atlas_occurrences()
For many of the later steps we’ll need the coordinates formatted as a spatial object (i.e., geometry
). So, let’s convert our occurrence data to a spatial object (sf
) defined by the longitude and latitude coordinates, and set the Coordinate Reference System (CRS) to EPSG:43261 (WGS84).
# convert to sf
<- kookaburras |>
kookaburras_sf st_as_sf(coords = c("decimalLongitude", "decimalLatitude")) |>
st_set_crs(4326)
Download environmental data
To help with our model prediction, we’ll also download the BioClim variables, a list of 19 biologically relevant environmental variables, for all of Australia as a raster.
A raster is a spatial grid of cells, where each cell contains a value representing information such as temperature or elevation. This information is often visualised by mapping colours to values in the raster (see image below). The resolution of the raster depends on the size of cells within the grid, with smaller cells corresponding to higher resolution (just like how the resolution of a television screen is determined by the number of pixels).
Code
<- function(r) {
plot_raster plot(r, axes = FALSE, legend = FALSE)
plot(as.polygons(r, dissolve = FALSE, trunc = FALSE), add = TRUE)
text(r, digits = 2)
}
# Create a 4 x 4 matrix
<- matrix(1:16, ncol = 4, nrow = 4)
m # Convert the matrix into a raster
<- rast(m)
r16
plot_raster(r16)
# Download world climate data
<- geodata::worldclim_country(
bioclim country = "Australia",
var = "bio",
res = 5,
path = here::here("folder-name",
"subfolder-name")
)
To narrow our BioClim data to only within the extent of our defined bounding box, we’ll create an extent object bbox_ext
, then crop our bioclim
layers to within bbox_ext
and project our cropped BioClim data to the same CRS as our kookaburra occurrence points.
# Set the coordinates to our bounding box
<- terra::ext(
bbox_ext c(custom_bbox[["xmin"]],
"xmax"]],
custom_bbox[["ymin"]],
custom_bbox[["ymax"]]
custom_bbox[[
))
# Crop our worldclim data to within our bounding box coordinates
<- bioclim |>
aus ::crop(bbox_ext) |>
terra::project(crs("EPSG:4326")) terra
To make sure everything looks correct, let’s plot one of the variables with geom_spatraster()
from the tidyterra package2.
# Download NSW map, set CRS projection
<- ozmaps::ozmap_states |>
nsw filter(NAME == "New South Wales") |>
st_transform(crs = st_crs(4326))
# Map of Annual temperature + points
<- ggplot() +
first_map geom_spatraster(data = aus,
aes(fill = wc2.1_30s_bio_1)) +
geom_sf(data = kookaburras_sf,
colour = "#312108",
size = 2) +
scale_fill_whitebox_c(palette = "muted",
na.value = NA) +
guides(fill = guide_colorbar(title = "Annual Mean\nTemperature")) +
theme_void()
first_map
Prepare data
Now that we have our occurrence data and environmental data, there are a few steps we’ll complete to prepare our data for modelling.
Thinning
The first step is to remove data where many points are overlapping. The resolution of our prediction is dependent on the resolution of our environmental data. If each cell in our environmental data defines the average value of one square kilometre, even if we had kookaburra observations at a higher resolution (for example, every square metre of our defined area), we can only detect differences to the lowest resolution of our grid cells.
So, we can thin our data so that there is only one observation per cell of our raster3. This step reduces spatial bias and lowers the risk that autocorrelation affects final predictions of our model.
# thin
set.seed(12345)
<- tidysdm::thin_by_cell(kookaburras_sf,
kookaburras_thin raster = aus
)
# number of observations
tibble(
before = nrow(kookaburras_sf),
after = nrow(kookaburras_thin)
)
# A tibble: 1 × 2
before after
<int> <int>
1 5624 1575
Pseudo-absences
Our data from the ALA is presence-only data. So, the second step is to create pseudo-absences (also called background points) that represent the full extent of the area where kookaburras haven’t been observed (yet) in our data. Importantly, these are not the same as true absences and this should be taken into account when interpreting results4.
Let’s add 3 times the number of presences to fill our grid, making sure they aren’t closer than 5 km to another point like our occurrence points.
<- tidysdm::sample_pseudoabs(
kookaburras_pseudoabs
kookaburras_thin,n = 3 * nrow(kookaburras_thin),
raster = aus,
method = c("dist_min", km2m(5))
)
Extract environmental values
Now that we have our presence and pseudo-absence points, the third step is to extract the environmental data for each point location in aus
and bind the resulting values to our points data in kookaburras_psuedoabs
. The result is a tibble
with our points, their class, and the specific values of all 19 BioClim variables.
<- kookaburras_pseudoabs |>
kookaburras_bioclim bind_cols(
::extract(aus,
terra
kookaburras_pseudoabs, ID = FALSE)
) kookaburras_bioclim
Simple feature collection with 6300 features and 20 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 149.0042 ymin: -34.99952 xmax: 152.0958 ymax: -32.00211
Geodetic CRS: WGS 84
# A tibble: 6,300 × 21
class geometry wc2.1_30s_bio_1 wc2.1_30s_bio_2
* <fct> <POINT [°]> <dbl> <dbl>
1 presence (151.1115 -33.66282) 16.6 9.32
2 presence (151.1556 -33.68888) 16.7 9.15
3 presence (149.7309 -34.75422) 13.4 12.8
4 presence (150.4131 -33.72526) 13.2 9.78
5 presence (151.1531 -33.82185) 17.8 9.29
6 presence (150.6711 -34.0971) 16.7 12.0
7 presence (151.0117 -33.64909) 16.6 10.5
8 presence (150.2013 -33.38858) 11.2 9.94
9 presence (150.4419 -34.62589) 12.9 10.1
10 presence (150.1324 -33.3652) 11.2 10.1
# ℹ 6,290 more rows
# ℹ 17 more variables: wc2.1_30s_bio_3 <dbl>, wc2.1_30s_bio_4 <dbl>,
# wc2.1_30s_bio_5 <dbl>, wc2.1_30s_bio_6 <dbl>, wc2.1_30s_bio_7 <dbl>,
# wc2.1_30s_bio_8 <dbl>, wc2.1_30s_bio_9 <dbl>, wc2.1_30s_bio_10 <dbl>,
# wc2.1_30s_bio_11 <dbl>, wc2.1_30s_bio_12 <dbl>, wc2.1_30s_bio_13 <dbl>,
# wc2.1_30s_bio_14 <dbl>, wc2.1_30s_bio_15 <dbl>, wc2.1_30s_bio_16 <dbl>,
# wc2.1_30s_bio_17 <dbl>, wc2.1_30s_bio_18 <dbl>, wc2.1_30s_bio_19 <dbl>
Select predictor variables
The fourth and final step is to choose predictor variables for our model. These are variables we think explain variation in our outcome (i.e., the probability a kookaburra could live in a given location). When choosing predictor variables, it is good practice to use theory and previous research to inform what variables you choose as predictors5.
In species distribution models, multicollinearity—high correlation between several independent variables in a model—can have unintended effects that bias predictions6. Data science tools can also help refine your predictor variable choices, too, including some functions in tidysdm that we used below.
A good start is to choose variables that differentiate between presences and pseudo-absences, which in the plot below are variables that have less overlap between red and blue distributions. To help choose variables with the highest non-overlapping distribution, we can decide on a percentage cut-off of 55% non-overlap, leaving us with the top 3 variables in the table below.
Code
|>
kookaburras_bioclim rename_with(
~ str_remove_all(.x, "wc2.1_30s_"),
starts_with("wc2.1_30s_")) |>
plot_pres_vs_bg(class)
Code
<- kookaburras_bioclim |>
overlap dist_pres_vs_bg(class) |>
enframe("bioclim", "percent_non_overlap") |>
arrange(desc(percent_non_overlap))
|> gt::gt() overlap
bioclim | percent_non_overlap |
---|---|
wc2.1_30s_bio_12 | 0.5638973 |
wc2.1_30s_bio_4 | 0.5600276 |
wc2.1_30s_bio_16 | 0.5534120 |
wc2.1_30s_bio_11 | 0.5394637 |
wc2.1_30s_bio_6 | 0.5210402 |
wc2.1_30s_bio_7 | 0.5113068 |
wc2.1_30s_bio_13 | 0.4997009 |
wc2.1_30s_bio_9 | 0.4827577 |
wc2.1_30s_bio_18 | 0.4820482 |
wc2.1_30s_bio_1 | 0.4599437 |
wc2.1_30s_bio_2 | 0.4581933 |
wc2.1_30s_bio_17 | 0.4464991 |
wc2.1_30s_bio_19 | 0.4221545 |
wc2.1_30s_bio_15 | 0.4202146 |
wc2.1_30s_bio_14 | 0.4099419 |
wc2.1_30s_bio_8 | 0.3411940 |
wc2.1_30s_bio_10 | 0.3182850 |
wc2.1_30s_bio_5 | 0.2343475 |
wc2.1_30s_bio_3 | 0.1002332 |
Now we can use pair plots to view the relationship between each pair of variables. Variables bio_12
and bio_16
are very highly correlated (94%). To avoid multicollinearity in our model, we’ll include only bio_12
(annual precipitation) and bio_4
(temperature seasonality).
Code
|>
aus select(wc2.1_30s_bio_12,
.1_30s_bio_4,
wc2.1_30s_bio_16) |>
wc2::pairs() terra
Here, we selected two BioClim variables that we thought were reasonable environmental predictors (using mainly data science techniques):
- BIO4: Temperature Seasonality7
- BIO12: Annual Precipitation
We’ll filter our point data and BioClim raster data to only include our two variables.
# predictor variable names
<- c("wc2.1_30s_bio_4", "wc2.1_30s_bio_12")
vars
# filter point data columns
<-
kookaburras_bioclim_filtered |>
kookaburras_bioclim select(all_of(c(vars, "class")))
|> head(5L) kookaburras_bioclim_filtered
Simple feature collection with 5 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 149.7309 ymin: -34.75422 xmax: 151.1556 ymax: -33.66282
Geodetic CRS: WGS 84
# A tibble: 5 × 4
wc2.1_30s_bio_4 wc2.1_30s_bio_12 class geometry
<dbl> <dbl> <fct> <POINT [°]>
1 391. 1204 presence (151.1115 -33.66282)
2 390. 1275 presence (151.1556 -33.68888)
3 504. 673 presence (149.7309 -34.75422)
4 424. 1273 presence (150.4131 -33.72526)
5 393. 1164 presence (151.1531 -33.82185)
# filter bioclim data columns
<- aus[[vars]]
aus_filtered aus_filtered
class : SpatRaster
dimensions : 360, 372, 2 (nrow, ncol, nlyr)
resolution : 0.008333333, 0.008333333 (x, y)
extent : 149, 152.1, -35, -32 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
names : wc2.1_30s_bio_4, wc2.1_30s_bio_12
min values : 306.3223, 620
max values : 582.4574, 1657
Code
<- ggplot() +
bioclim4 geom_spatraster(data = aus_filtered,
aes(fill = wc2.1_30s_bio_4)) +
geom_rect(data = custom_bbox,
mapping = aes(xmin = xmin,
ymin = ymin,
xmax = xmax,
ymax = ymax),
colour = "grey50",
fill = NA) +
scale_fill_whitebox_c(palette = "muted",
na.value = NA) +
guides(fill = guide_colorbar(title = "Annual Range in\nTemperature\n(°C)")) +
theme_void()
<- ggplot() +
bioclim12 geom_spatraster(data = aus_filtered,
aes(fill = wc2.1_30s_bio_12)) +
geom_rect(data = custom_bbox,
mapping = aes(xmin = xmin,
ymin = ymin,
xmax = xmax,
ymax = ymax),
colour = "grey50",
fill = NA) +
scale_fill_whitebox_c(palette = "deep",
na.value = NA) +
guides(fill = guide_colorbar(title = "Precipitation (mm)")) +
theme_void()
Fit model
Tidymodels is designed to build a model workflow, train the model’s performance, then test the model’s ability to predict data accurately. This workflow might be slightly different to what many research scientists are used to.
In machine learning models, data is like a limited resource that we must divide using a “data budget” for two main purposes: training a reasonable model, and testing the final model.
Split data
The first step to allocating your “data budget” is splitting your data. We can use initial_split()
to allocate a reasonable “data budget” into these categories (typically a 75-25% split).
# set training and testing data
set.seed(100)
<-
kookaburras_split |>
kookaburras_bioclim_filtered initial_split()
kookaburras_split
<Training/Testing/Total>
<4725/1575/6300>
Now we can save these data as separate data objects for training()
and testing()
.
<- training(kookaburras_split)
kookaburras_train <- testing(kookaburras_split) kookaburras_test
We are left with two dataframes with identical columns but different points.
|> head(5L) kookaburras_train
Simple feature collection with 5 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 149.0125 ymin: -34.21327 xmax: 150.894 ymax: -32.3875
Geodetic CRS: WGS 84
# A tibble: 5 × 4
wc2.1_30s_bio_4 wc2.1_30s_bio_12 class geometry
<dbl> <dbl> <fct> <POINT [°]>
1 498. 648 pseudoabs (150.4792 -32.3875)
2 374. 1442 presence (150.894 -34.21327)
3 561. 693 pseudoabs (149.0125 -32.7875)
4 486. 835 pseudoabs (150.1625 -32.72917)
5 435. 890 pseudoabs (150.7542 -33.1125)
|> head(5L) kookaburras_test
Simple feature collection with 5 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 149.6909 ymin: -33.82185 xmax: 151.6858 ymax: -32.9229
Geodetic CRS: WGS 84
# A tibble: 5 × 4
wc2.1_30s_bio_4 wc2.1_30s_bio_12 class geometry
<dbl> <dbl> <fct> <POINT [°]>
1 393. 1164 presence (151.1531 -33.82185)
2 461. 1085 presence (150.2013 -33.38858)
3 513. 802 presence (149.6909 -33.14061)
4 408. 1144 presence (151.6858 -32.9229)
5 463. 1073 presence (150.1981 -33.37893)
Resampling
Now let’s resample our training data so we can use it to optimise and evaluate our model. One way to resample is using cross-validation, a well-established method of resampling that randomly assigns points to analysis and assessment groups. These randomly resampled and split data sets are known as folds. We can use the spatialsample package to create 5 v-folds with spatial_block_cv()
, a function for resampling spatial data.
set.seed(100)
<- spatial_block_cv(kookaburras_train, v = 5) kookaburras_cv
spatial_block_cv()
uses a type of resampling called block cross-validation, which creates a grid of “blocks” and attempts to maintain these blocked groups when resampling data points. Block cross-validation is important because spatial data is not completely random; data from neighbouring locations probably relate in some way (they aren’t completely random), and block cross-validation attempts to preserve this spatial relationship. The plots below demonstrate the general process. The plot on the left shows the blocks, the animation on the right shows the resulting 5 folds.
Define our model
Next let’s make our model’s “recipe”. This is the tidymodels term for any pre-processing steps that happen to our data before adding them to a model. A recipe includes our model formula, and any transformations or standardisations we might wish to do8.
In our case, let’s define that our model’s outcome variable is the class
of presence or absence. We’ll then add our predictor variables to our model, with the formula class ~ .
, equivalent to class ~ bio4 + bio12
.
<- recipe(
kookaburras_recipe
kookaburras_train, formula = class ~ .
) kookaburras_recipe
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 2
coords: 2
Now we can set our workflow, which merges our formula, any data pre-processing, and specifies which models we’ll use. One of the strengths of using tidymodels is that we can run several different types of models in a single workflow_set()
to train and optimise them.
<-
kookaburras_models # create the workflow_set
workflow_set(
preproc = list(default = kookaburras_recipe),
models = list(
glm = sdm_spec_glm(), # the standard glm specs
rf = sdm_spec_rf(), # rf specs with tuning
gbm = sdm_spec_boost_tree(), # boosted tree model (gbm) specs with tuning
maxent = sdm_spec_maxent() # maxent specs with tuning
),cross = TRUE # make all combinations of preproc and models
|>
) # tweak controls to store information needed later to create the ensemble
option_add(control = control_ensemble_grid())
kookaburras_models
# A workflow set/tibble: 4 × 4
wflow_id info option result
<chr> <list> <list> <list>
1 default_glm <tibble [1 × 4]> <opts[1]> <list [0]>
2 default_rf <tibble [1 × 4]> <opts[1]> <list [0]>
3 default_gbm <tibble [1 × 4]> <opts[1]> <list [0]>
4 default_maxent <tibble [1 × 4]> <opts[1]> <list [0]>
Fit our model
Next we will determine what parameters optimise our model’s performance by tuning our model. Tuning uses trial-and-error to figure out which type of model under what hyperparameters makes reasonable predictions.
Let’s tune our models using our resampled folds (kookaburras_cv
).
Tuning is the process of simulating many different ways to fit lines made by our models to our training data. The number of curves in a model’s line of best fit increases as a model becomes more complex, determined by its degrees of freedom. By using different functions to set reasonable weighting parameters that penalize a model when lines curve too much, tuning optimises the model’s performance by finding the balance between fitting our current training data and predicting new values correctly. For more information and a visual of this, check out the tunes package Getting Started vignette.
set.seed(2345678) # for reproducability
<-
kookaburras_models_tune |>
kookaburras_models workflow_map("tune_grid",
resamples = kookaburras_cv,
grid = 6, # increase for more iterations
metrics = sdm_metric_set(),
verbose = TRUE,
control = stacks::control_stack_grid()
)
kookaburras_models_tune
# A workflow set/tibble: 4 × 4
wflow_id info option result
<chr> <list> <list> <list>
1 default_glm <tibble [1 × 4]> <opts[4]> <rsmp[+]>
2 default_rf <tibble [1 × 4]> <opts[4]> <tune[+]>
3 default_gbm <tibble [1 × 4]> <opts[4]> <tune[+]>
4 default_maxent <tibble [1 × 4]> <opts[4]> <tune[+]>
::beep(2) beepr
We can use autoplot()
to visualise which models performed best by a set of common performance metrics for species distribution models. Models with higher values and smaller confidence intervals performed better.
autoplot(kookaburras_models_tune)
We can also collect each model’s metrics and rank the models by performance.
# see metrics
1collect_metrics(kookaburras_models_tune)
- 1
-
As a general tidymodels tip, many columns with a
.
at the start of its column name can be retrieved with acollect_
function (e.g.,collect_metrics()
,collect_parameters()
).
# A tibble: 45 × 9
wflow_id .config preproc model .metric .estimator mean n std_err
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
1 default_glm Preprocesso… spatia… logi… boyce_… binary 0.890 5 0.0374
2 default_glm Preprocesso… spatia… logi… roc_auc binary 0.861 5 0.0327
3 default_glm Preprocesso… spatia… logi… tss_max binary 0.633 5 0.0617
4 default_rf Preprocesso… spatia… rand… boyce_… binary 0.753 5 0.0907
5 default_rf Preprocesso… spatia… rand… roc_auc binary 0.792 5 0.0400
6 default_rf Preprocesso… spatia… rand… tss_max binary 0.493 5 0.0681
7 default_rf Preprocesso… spatia… rand… boyce_… binary 0.761 5 0.0841
8 default_rf Preprocesso… spatia… rand… roc_auc binary 0.799 5 0.0394
9 default_rf Preprocesso… spatia… rand… tss_max binary 0.505 5 0.0670
10 default_gbm Preprocesso… spatia… boos… boyce_… binary 0.745 5 0.107
# ℹ 35 more rows
Our tuning results show that several types of models and parameters performed quite well. Rather than choosing only one model to use for predictions, it’s possible to use several as a “stacked ensemble model”! The stacks package in tidymodels let’s us blend predictions of a few good candidate models (based on whatever metric you choose) to make better overall estimates9.
library(stacks)
set.seed(123456)
<-
kookaburras_stacked stacks() |> # initialize the stack
add_candidates(kookaburras_models_tune) |> # add candidate members
blend_predictions() |> # determine how to combine their predictions
fit_members() # fit the candidates with nonzero stacking coefficients
kookaburras_stacked
# A tibble: 2 × 3
member type weight
<chr> <chr> <dbl>
1 .pred_pseudoabs_default_maxent_1_1 maxent 2.25
2 .pred_pseudoabs_default_glm_1_1 logistic_reg 0.875
Here is a nice visual of how these two member models are weighted to inform our predictions.
autoplot(kookaburras_stacked, type = "weights")
Assess model
Our model kookaburras_stacked
is now ready, and we can use it to make predictions about our test data. We’ll bind the predicted values to the true values of our test data…
<-
kookaburras_test_predictions %>%
kookaburras_test bind_cols(predict(kookaburras_stacked, .,
type = "prob",
save_pred = TRUE))
…which allows us to assess how good our model is at making correct predictions of the “true” classes.
|>
kookaburras_test_predictions sdm_metric_set()(truth = class, .pred_presence)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 boyce_cont binary 0.968
2 roc_auc binary 0.848
3 tss_max binary 0.565
We can also visualise how well the model has correctly predicted the class of each point by predicting "class"
rather than "prob"
and mapping correct vs incorrect predictions.
Code
# predict class
<-
kookaburras_test_predictions_class %>%
kookaburras_test bind_cols(predict(kookaburras_stacked, .,
type = "class",
save_pred = TRUE))
# plot correct vs incorrect predictions
|>
kookaburras_test_predictions_class mutate(correct = case_when(
== .pred_class ~ "Correct",
class TRUE ~ "Incorrect"
|>
)) ggplot() +
geom_sf(aes(geometry = geometry, colour = correct)) +
labs(color = NULL) +
scale_color_manual(values = c("darkred", "lightpink")) +
geom_spatraster(data = aus,
aes(fill = wc2.1_30s_bio_4),
alpha = 0.1) +
scale_fill_whitebox_c(palette = "muted",
na.value = NA) +
theme_void()
tidysdm offers its own wrapper function simple_ensemble()
to run a stack model workflow and some helpful ways to assess their performance.
Final prediction
Finally, we can use our model to predict the habitat suitability of laughing kookaburras over our area. We’ll predict an entire surface of values within our aus_filtered
area using the incredible predict_raster()
function from tidysdm (which saves us quite a few wrangling steps to work nicely with terra).
# predict
<- predict_raster(kookaburras_stacked,
prediction_present
aus_filtered, type = "prob")
# map
ggplot() +
geom_spatraster(data = prediction_present,
aes(fill = .pred_presence)) +
scale_fill_whitebox_c(palette = "purple",
na.value = NA) +
guides(
fill = guide_colorbar(title="Relative\nHabitat\nSuitability")
+
)
# plot presences used in the model
geom_sf(data = kookaburras_sf,
alpha = 0.3) +
labs(title="Predicted distribution of laughing kookaburras") +
::theme_pilot(grid="hv") +
pilottheme(
legend.text = element_text(hjust = 0.5)
)
And there we have our predictions of our species distribution model!
Final thoughts
We hope this article has made the steps of species distribution modelling and interpretation clearer. Species distribution models remain one of the most powerful statistical tools for making inferences about species and their habitat range. tidymodels, tidysdm, and tidyterra offer a useful toolset for running these models in R.
Although our model performed decently under several model performance metrics, no model is perfect. For example, you can see that many of the values towards the centre of Australia have low relative habitat suitability despite quite a few kookaburra occurrences. This is a limitation likely caused by our data and our choice of predictor variables. Testing different subsets of predictors and trying environmental layers at different levels of spatial resolution will help to improve the performance of the model. Another option (if collecting more data is not feasible) is to explore other datasets that could be aggregated to enhance the quality of training data.
Our model above is quite minimalist (for simplicity). If you’d like an example list of variables used in more performant models, check out this paper.
To learn more on ALA Labs, check out our posts on spatial bias and mapping multiple overlapping species distributions.
Expand for session info
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.2 (2023-10-31 ucrt)
os Windows 10 x64 (build 19045)
system x86_64, mingw32
ui RTerm
language (EN)
collate English_Australia.utf8
ctype English_Australia.utf8
tz Australia/Sydney
date 2024-05-02
pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
broom * 1.0.5 2023-06-09 [1] CRAN (R 4.3.1)
dials * 1.2.0 2023-04-03 [1] CRAN (R 4.3.2)
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.2)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.2)
galah * 2.0.2 2024-04-12 [1] CRAN (R 4.3.3)
ggplot2 * 3.4.4 2023-10-12 [1] CRAN (R 4.3.1)
glmnet * 4.1-8 2023-08-22 [1] CRAN (R 4.3.1)
here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.2)
htmltools * 0.5.7 2023-11-03 [1] CRAN (R 4.3.2)
infer * 1.0.5 2023-09-06 [1] CRAN (R 4.3.1)
lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.3.2)
Matrix * 1.6-4 2023-11-30 [1] CRAN (R 4.3.2)
maxnet * 0.1.4 2021-07-09 [1] CRAN (R 4.3.2)
modeldata * 1.2.0 2023-08-09 [1] CRAN (R 4.3.1)
ozmaps * 0.4.5 2021-08-03 [1] CRAN (R 4.3.2)
parsnip * 1.1.1 2023-08-17 [1] CRAN (R 4.3.1)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.2)
ranger * 0.16.0 2023-11-12 [1] CRAN (R 4.3.2)
readr * 2.1.5 2024-01-10 [1] CRAN (R 4.3.3)
recipes * 1.0.8 2023-08-25 [1] CRAN (R 4.3.1)
rsample * 1.2.0 2023-08-23 [1] CRAN (R 4.3.1)
scales * 1.3.0 2023-11-28 [1] CRAN (R 4.3.2)
sessioninfo * 1.2.2 2021-12-06 [1] CRAN (R 4.3.2)
sf * 1.0-16 2024-03-24 [1] CRAN (R 4.3.3)
spatialsample * 0.5.1 2023-11-08 [1] CRAN (R 4.3.2)
stacks * 1.0.3 2023-11-06 [1] CRAN (R 4.3.3)
stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.3.2)
terra * 1.7-55 2023-10-13 [1] CRAN (R 4.3.1)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.2)
tidymodels * 1.1.1 2023-08-24 [1] CRAN (R 4.3.1)
tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.3.3)
tidysdm * 0.9.2 2023-11-13 [1] CRAN (R 4.3.2)
tidyterra * 0.5.0 2023-11-21 [1] CRAN (R 4.3.2)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.2)
tune * 1.1.2 2023-08-23 [1] CRAN (R 4.3.1)
workflows * 1.1.3 2023-02-22 [1] CRAN (R 4.3.2)
workflowsets * 1.0.1 2023-04-06 [1] CRAN (R 4.3.2)
xgboost * 1.7.6.1 2023-12-06 [1] CRAN (R 4.3.1)
yardstick * 1.2.0 2023-04-21 [1] CRAN (R 4.3.2)
[1] C:/Users/KEL329/R-packages
[2] C:/Users/KEL329/AppData/Local/Programs/R/R-4.3.2/library
──────────────────────────────────────────────────────────────────────────────
Footnotes
ALA data is projected using CRS EPSG:4326 (the same one used by Google Earth).↩︎
tidyterra follows the Grammar of Graphics made popular in R by ggplot2 and allows rasters to be plotted using the same syntax. In contrast, the terra package requires users to plot using base R styling (using
plot()
).↩︎It’s also possible to thin data by distance rather than cell size using
tidysdm::thin_by_dist()
.↩︎A true absence has quite a different meaning than a pseudo-absence to a species distribution model. The main difference is in the value a known absence provides compared to a simulated one for our interpretation of the results.
A true absence is a point where, at a specific time, an organism was not found there. Alternatively, a pseudo-absence is a point that acts like we haven’t found an animal there, but we don’t actually have data for that location! The model, however, doesn’t know if a point represents a true absence or a psuedo-absence. It only knows the information it is given and will interpret that information using the parameters it is provided (in this way, models are a reflection of the real-world, but never a substitute).
Collecting true absence data is difficult, typically requiring expert knowledge, surveys with stricter methodologies, and repeated measures of the same areas over time. Pseudo-absences are much easier to collect—you simply simulate them on a computer—but they are less informative. Keep this trade-off in mind as you interpret your model’s results.↩︎Keep in mind that the strength of variables depends on the scale of your prediction. If you wish to make predictions at a broad-scale, variables like temperature and rainfall will likely be strong predictors, whereas if you wish to make predictions at a fine-scale, variables like food scarcity and competition might be stronger predictors for your outcome.↩︎
A model can only use the information it is provided to make inferences about the world. If multiple variables in a model correlate, the model can place too much weight on those values to determine the outcome! The model isn’t aware of the many other environmental variables that affect the real world outcome.↩︎
Measured as the standard deviation of the mean monthly temperature↩︎
For example, log transformation, centring scales, setting dummy variables↩︎
To learn more about how putting together a stack works, check out this helpful article on the stacks website.↩︎