Author

Date

6 February 2023

Choropleth maps are an excellent way to visualise differences in variables (eg. number of species observed) across several geographic regions (eg. countries, states, study areas). Often, creating a choropleth map from species observations requires two things:

A shapefile - a file with vector data of a specific geographic location, with detailed info of its geographic features
Species observations recorded as point data - the specific longitude and latitude coordinates of where an individual was observed

However, to create a choropleth map of species observations requires us to summarise our points to a single statistic for each polygon of our shapefile. This conversion from points to polygons can sometimes be tricky!

Here, we show you how to extract and count the number of points inside each polygon of a shapefile to create a choropleth map of the number of species observations per km² in each suburb of the Australian Capital Territory (ACT).

Download data

First we will load the R packages that we need:

library(galah)
library(here) 
library(rmapshaper) 
library(tidyverse) 
library(sf)
library(ggtext)

Download shapefile

Next we will need a shapefile. You can find many shapefiles online from reputable sources. For this example, I’ve downloaded a shapefile of suburb boundaries in the city of Canberra, ACT from the ACT’s open-access map database.

Usually when you download a shapefile, it is compressed within a zip folder. Save this downloaded zip folder in a local folder inside your current R project. If you need to unzip your folder, you can do so with the following code:

zip_folder <- here("folder-name", "shapefile-folder-name.zip")
output_dir <- "folder-name-to-save-unzipped-files" 
unzip(zip_folder, exdir = output_dir)

Now we load this unzipped shapefile into R. To save space, we’ll remove some complexity from our shapefile polygons with ms_simplify() from the {rmapshaper} package.

The actsuburbs shapefile contains both suburb boundaries and “district” boundaries, which can encompass several suburbs. To avoid confusion, we will remove districts using filter(LOC_CLASS != "District"). We’ll also use st_make_valid() to make sure any weird invalid geometries in our shapefile are made valid, and therefore plot correctly.

actsuburbs <- st_read(here("folder-name",
                           "folder-name-2",
                           "shapefilename.shp")) |>
                     ms_simplify(keep = 0.1) |> 
  st_transform(crs = st_crs("WGS84")) |> 
  st_make_valid() |> 
  filter(LOC_CLASS != "District")

Now to see if our shapefile plots correctly, we can use geom_sf() (and it looks like it does!)

ggplot() +
  geom_sf(data = actsuburbs) +
  theme(axis.text.x = element_text(angle = -90, hjust = 0))

Download species observations

Next let’s use the {galah} package to download bird occurrence records from the Atlas of Living Australia (ALA).

We can download all Aves (bird) data provided by BirdLife Australia within the ACT by using galah_filter() to narrow our download. We’ll also add ALA’s data profile, or what the ALA calls a set of data quality filters to remove suspicious records, using galah_apply_profile(ALA).

You will need to provide a registered email with the ALA using galah_config() before retrieving records.

galah_config(email = "your-email@email.com")

birdocc <- galah_call() |> 
  galah_identify("Aves") |> 
  galah_apply_profile(ALA) |>
  galah_filter(stateProvince == "Australian Capital Territory",
               dataProviderName == "BirdLife Australia") |>  
atlas_occurrences()

Retrying in 1 seconds.
Retrying in 2 seconds.
Retrying in 4 seconds.
Retrying in 8 seconds.

birdocc |> head(8L)

# A tibble: 8 × 8
  recordID        scientificName taxonConceptID decimalLatitude decimalLongitude
  <chr>           <chr>          <chr>                    <dbl>            <dbl>
1 0001983b-e50e-… Rhipidura (Rh… https://biodi…           -35.3             149.
2 0001cc23-26e4-… Tachybaptus n… https://biodi…           -35.2             149.
3 0001d4d1-761e-… Malurus (Malu… https://biodi…           -35.2             149.
4 0001e1d0-ce75-… Philemon (Tro… https://biodi…           -35.2             149.
5 0001e509-54d7-… Turdus merula  https://biodi…           -35.2             149.
6 00026573-8ea6-… Strepera (Str… https://biodi…           -35.2             149.
7 0002d2ef-54b8-… Platycercus (… https://biodi…           -35.2             149.
8 000301c9-b3b1-… Cacatua (Caca… https://biodi…           -35.2             149.
# ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>

Count points in each polygon

To prepare our data, we’ll convert each observation into a format suitable for spatial mapping. st_as_sf() transforms each point into an sf spatial object (which plots nicely with {ggplot2}). We’ll also make sure the points are projected to crs = set_crs("WGS84"), the same as our shapefile, so that the points line up correctly.

bird_points_sf <- birdocc |> 
  st_as_sf(coords = c("decimalLongitude", "decimalLatitude"), 
  crs = st_crs("WGS84"))

Now we’ll find and count how many points are in each of our suburbs.

The st_intersects() function checks whether each point is within, or “intersects”, a specified POLYGON and then marks it as TRUE or FALSE in a matrix. Using st_intersects() in a loop with pmap() allows us to run st_intersects() on each row of a supplied list.

In our case, because each row of actsuburbs$geometry corresponds to each suburb, pmap_dbl() recursively checks which points are within each of our ACT suburbs! Adding lengths() around st_intersects() will count the number of rows returned for each suburb list, returning the total number of points that intersect each suburb. ¹. We’ve saved this count in a new column bird_count.

Warning

This function takes ~3.5 minutes to run

act_counts <- actsuburbs |> 
  mutate(bird_count = pmap_dbl(.l = list(x = actsuburbs$geometry),
                           .f = function(x) {
                             lengths(st_intersects(x, bird_points_sf))
                             }))

act_counts |> 
  select(LOC_NAME, bird_count) |> 
  head(8L) # see sample of counts

Simple feature collection with 8 features and 2 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 149.0563 ymin: -35.48032 xmax: 149.2188 ymax: -35.15856
Geodetic CRS:  WGS 84
   LOC_NAME bird_count                       geometry
1     Acton        693 POLYGON ((149.1269 -35.2929...
2   Ainslie        121 POLYGON ((149.1555 -35.2504...
3    Amaroo         50 POLYGON ((149.1152 -35.1677...
4    Aranda         36 POLYGON ((149.0895 -35.2568...
5     Banks          0 POLYGON ((149.0908 -35.4751...
6    Barton        667 POLYGON ((149.1291 -35.3063...
7     Beard          0 POLYGON ((149.2188 -35.3413...
8 Belconnen       1518 POLYGON ((149.0804 -35.2309...

Showing the total number of bird observations on a choropleth map can be misleading because areas that are larger might have more records simply because they are large areas! It’s a good idea to standardise your data to avoid this bias. In this case, we will show the number of observations per square kilometer.

To do this, we will use sf::st_area() to help us get the area per m² of our suburbs & convert it to km² by dividing by 1000, saving this in a new column area_km2. Then we’ll divide our bird_count by area_km2.

act_counts <- act_counts |>
  rowwise() |> 
  mutate(area_km2 = as.integer(st_area(geometry))/1000,
         counts_km2 = bird_count/area_km2) |>
  replace_na(list(counts_km2 = 0))

act_counts |> rmarkdown::paged_table() # final data frame

It’s a good idea to check the distribution of our data before we plot so we know what we should expect it to look like. If we check our bird counts, we can notice that our count data is skewed because many regions have lower numbers of observations, and only a few regions have very high numbers of observations.

hist(act_counts$bird_count, main = "bird_count distribution")

Log transformation will reduce the skew in our data, ultimately making our final choropleth map easier to interpret. We will handle this when we make our ggplot in the final step!

Make a map

Now we can use {ggplot2} to make our choropleth map!

One easy way we can log-transform our counts_km2 column is to use the trans argument within scale_fill_stepsn(), allowing us to transform our data when we are setting the color palette!

# make galah palette
galah <- colorRampPalette(c("#FFD2CF", 
                            "#EC9FAC", 
                            "#D96D89", 
                            "#B55474", 
                            "#80556D"))(5)

ggplot() +
  geom_sf(
    data = act_counts,
    mapping = aes(fill = counts_km2), 
    colour = "grey90" 
  ) +
  scale_fill_stepsn(
    name = "Number of <br>observations* <br>per km^2", 
    breaks = c(0.001, .01, .1, 1),
    labels = c("0.001", "0.01", "0.01", "1"),
    trans = "log10", # log-transform
    colours = galah,
    na.value = "#FFD2CF",
    guide = guide_colorsteps( 
      direction = "vertical", 
      label.position = "right",
      title.position = "top",
      draw.ulim = FALSE, 
      draw.llim = FALSE,
    )
  ) +
  labs( 
    title = "Bird Records by Suburb",
    subtitle = "Canberra (ACT)",
    caption = "*ALA BirdLife Australia data"
  ) + 
  theme_void() + 
  theme(legend.title = element_markdown())

Final thoughts

There’s not much better than a galah-themed choropleth map!

One way you can extend this map is to add floating labels with geom_text_repel() or geom_label_repel() from the {ggrepel} package, which we added for our final version of our map we posted for National Bird Week 2022. Code to add floating labels can be found here.

You can also learn how to create a choropleth map with two different colour scales in this ALA Labs post!

Expand for session info

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31 ucrt)
 os       Windows 10 x64 (build 19045)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_Australia.utf8
 ctype    English_Australia.utf8
 tz       Australia/Sydney
 date     2024-02-12
 pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version    date (UTC) lib source
 dplyr       * 1.1.4      2023-11-17 [1] CRAN (R 4.3.2)
 feathers    * 0.0.0.9000 2022-10-11 [1] Github (shandiya/feathers@4be766d)
 forcats     * 1.0.0      2023-01-29 [1] CRAN (R 4.3.2)
 galah       * 2.0.1      2024-02-06 [1] CRAN (R 4.3.2)
 ggplot2     * 3.4.4      2023-10-12 [1] CRAN (R 4.3.1)
 ggtext      * 0.1.2      2022-09-16 [1] CRAN (R 4.3.2)
 here        * 1.0.1      2020-12-13 [1] CRAN (R 4.3.2)
 htmltools   * 0.5.7      2023-11-03 [1] CRAN (R 4.3.2)
 lubridate   * 1.9.3      2023-09-27 [1] CRAN (R 4.3.2)
 ozmaps      * 0.4.5      2021-08-03 [1] CRAN (R 4.3.2)
 purrr       * 1.0.2      2023-08-10 [1] CRAN (R 4.3.2)
 readr       * 2.1.4      2023-02-10 [1] CRAN (R 4.3.2)
 rmapshaper  * 0.5.0      2023-04-11 [1] CRAN (R 4.3.2)
 sessioninfo * 1.2.2      2021-12-06 [1] CRAN (R 4.3.2)
 sf          * 1.0-14     2023-07-11 [1] CRAN (R 4.3.2)
 stringr     * 1.5.1      2023-11-14 [1] CRAN (R 4.3.2)
 tibble      * 3.2.1      2023-03-20 [1] CRAN (R 4.3.2)
 tidyr       * 1.3.0      2023-01-24 [1] CRAN (R 4.3.2)
 tidyverse   * 2.0.0      2023-02-22 [1] CRAN (R 4.3.2)

 [1] C:/Users/KEL329/R-packages
 [2] C:/Users/KEL329/AppData/Local/Programs/R/R-4.3.2/library

──────────────────────────────────────────────────────────────────────────────

Footnotes

Many thanks to Shandiya Balasubramaniam for suggesting this method, and for many other very helpful edits!↩︎