Justin's Code Blog: CTDS For Groups

Justin G Cally

Setup

Load relevant packages into environment and include some custom functions.

Show code

knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
library(cmdstanr)
check_cmdstan_toolchain(fix = TRUE, quiet = TRUE)
register_knitr_engine(override = FALSE)
library(bayesplot)
library(data.table)
library(kableExtra)

get_closest<- function(n, max_distance) {
  rho <- sqrt(runif(n)) * max_distance
  theta <- runif(n, 0, 2*base::pi)
  x <- rho * cos(theta)
  y <- rho * sin(theta)
  d<- sqrt(x^2 + y^2)
  return(min(d))
}

get_closest_x_y <- function(n, max_distance) {
  rho <- sqrt(runif(n)) * max_distance
  theta <- runif(n, 0, 2*base::pi)
  x <- rho * cos(theta)
  y <- rho * sin(theta)
  d <- sqrt(x^2 + y^2)
  mind <- which(d == min(d))
  return(data.frame(x = x[[mind]], y = y[[mind]]))
}

midpoints <- seq(from = 0.5, to = 12.5)
delta <- 1
max_distance <- 13
breaks <- 0:13

Background

Camera trap distance sampling (CTDS) can be used to measure abundance of species. CTDS employs conventional distance sampling (DS) methods at a point and offsets the surveyed area by the time period a camera is deployed for. Key to CTDS are DS assumptions that detection rates decrease further from the observer (camera) and that the availability of an individual in the surveyed area (camera to truncation distance) follows a uniform distribution per metre-squared.

However, when CTDS is used to measure abundance of a species where group size is not fixed at 1 then this assumption is likely violated in practice. For larger groups, CTDS should account for the availability of the closest individual rather than the availability of all individuals. When group size increases it is more likely the the closest individual from the group is closer to the camera. If we assume that the triggering of the camera is dependent upon the closest individual than we must adjust our estimated availability to account for variable group sizes. If we do not adjust for group size and only use the distance to the closest individual for our DS models then we will likely underestimate the detection rate as distances will be smaller than reality. Alternatively, if we record distances to multiple individuals in the same photo and take an average or model them independently we would likely overestimate detection probability because individuals at further distances are recorded because a closer individual has triggered the camera trap.

Here we simulate camera trap distance sampling data with variable group sizes to compare the effect of taking into account an adjusted availability, which estimates the availability of the closest individual given a group of a size >= 1.

Simulations

Visualisation

For illustration purposes we can visualize the probability of the closest individual in various group sizes falling within a given distance bin. For group sizes of 1, probability is uniform for a given area. However as group size increases the probability that the closest individual is closer to the camera (centre) increases.

Show code

#### Test with a simulation ####
# emp_mult <- matrix(data = NA, nrow = length(midpoints), ncol = 5)
#  
# for(i in 1:5) {
#   cat("ind ",i,"\n")
#   d<- replicate(1e6, get_closest(n=i, max_distance))
#   bins<- cut(d, breaks = breaks)
#   emp_mult[,i]<- table(bins)/sum(table(bins))
# }
# 
# list(Simultion = round(emp_mult, 3),
# Theoretical = round(pi_mult, 3))

xy <- list()

for(i in 1:5) {
  cat("ind ",i,"\n")
  xy[[i]] <- replicate(1e4, get_closest_x_y(n=i, max_distance)) %>% 
    str2str::a2lm() %>%
    lapply(., function(x) as.data.frame(matrix(unlist(x, recursive = TRUE), ncol = 2, byrow = T))) %>%
    bind_rows() %>%
    `colnames<-`(c("x", "y")) %>%
    mutate(`Group size` = i)
}

ind  1 
ind  2 
ind  3 
ind  4 
ind  5

Show code

xy_joined <- bind_rows(xy)


# Area + contour
ggplot(xy_joined, aes(x=x, y=y) ) +
  stat_density_2d(aes(fill = ..level..), geom = "polygon", colour="white") +
  facet_wrap(~ `Group size`, nrow = 3) +
  scale_fill_viridis_c(option = "A") +
  labs(x = "distance (x)", y = "distance (y)", fill = "Probability \nClosest individual \n at Location") +
  theme_minimal()

Figure 1: Probability of availability under various group size scenarios

Functions for availability calculations at various points, with various bins and individuals

For convenience, we also create functions for the calculation of availability of the closest individual. This is valuable for more general applications to different datasets,

Show code

#' Availability calculation for a point transect
#'
#' @param delta bin width
#' @param midpoint bin midpoint
#' @param max_distance maximum distance of survey (truncation distance)
#'
#' @return numeric
availability_fn <- function(delta, midpoint, max_distance) {
  (2 * delta * midpoint)/max_distance^2
}

#' Probability closest animal is in a given bin, given a provided bin and number of individuals in a photo
#'
#' @param bin_start start point of bin
#' @param bin_end end point of bin
#' @param max_distance maximum distance of survey (truncation distance)
#' @param n number of individuals in photo
#'
#' @return numeric
pr_closest <- function(bin_start, bin_end, max_distance, n) {
    if(bin_start == 0) {
      prob_closer <- 0
    } else {
    closer_midpoint <- (bin_start)/2
    prob_closer <- 1 - (1 - availability_fn(bin_start, closer_midpoint, max_distance = max_distance))^n
    }
    if(bin_end == max_distance) {
      prob_further <- 0
    } else {
      further_midpoint <- (bin_end+max_distance)/2
      further_delta <- max_distance-bin_end
      prob_further <- availability_fn(further_delta, further_midpoint, max_distance = max_distance)
    }
    
    # Combined probability 
# probability that the closest indiv is in this bin is 1 minus the probability that the closest was in a closer bin, minus the probability that all individuals are in further bins
      pi_m <- 1 - (prob_closer + prob_further^n)
    
      return(pi_m)
}

Data Generation

Below we simulate our data for the model. Here we simulate point-transect data for camera traps across 10 sites. Each site has 250 entries by a species or groups of a species, with group sizes in these cases varying based on a gamma distribution with variable alpha parameters across different simulated datasets.

We simulate detection probability using a half-normal detection function with a sigma value of 5.

A matrix of the availability for detections is generated using the ‘probability of closest individual’ described earlier.

Show code

# Simulate number of sites and number of entries for each site
sites <- 10
entries_per_site <- rep(250, sites)
total_entries <- sum(entries_per_site)
density_average <- total_entries/base::pi*max_distance^2
group_lambdas <- c(1,2,3,4,5)
counts <- lapply(group_lambdas, function(x) {ceiling(rgamma(n = total_entries, shape = x, scale = 0.5))})
unique_groups <- list()
length_unique_groups <- list()
group_sizes <- sapply(counts, max)
length_group_sizes <- length(group_sizes)

camera_data_base <- data.frame(Site = rep(1:sites, times = entries_per_site)) 

detected_prob <- function(distance, key = "hn", scale = 5) {
  if(key == "hn") {
    sigma <- scale
    pr <- exp(-(distance)^2/(2*sigma^2))
  }
  return(pr)
}

# Create data for 3 scenarios with different group sizes
camera_data <- vector("list", length_group_sizes)
y <- vector("list", length_group_sizes)
y_a <- vector("list", length_group_sizes)
# y_dets <- vector("list", 5)
# pa <- vector("list", 5)
n_obs <- vector("list", length_group_sizes)
log_gs_cont <- vector("list", length_group_sizes)

for (i in 1:length(group_sizes)) {
  camera_data[[i]] <- camera_data_base
  camera_data[[i]]$count <- counts[[i]]
  camera_data[[i]]$distance <- sapply(camera_data[[i]]$count, function(x) get_closest(n=x, max_distance))
  camera_data[[i]]$detected_prob <- detected_prob(camera_data[[i]]$distance, scale = 5)
  camera_data[[i]]$detected <- rbinom(n = total_entries, size = 1, prob = camera_data[[i]]$detected_prob)
  
  camera_data[[i]] <- camera_data[[i]] %>%
    mutate(detected_group_size = detected*count, 
           midpoint = case_when(detected == 1 ~ as.numeric(cut(distance, breaks = 0:max_distance, include.lowest = TRUE, right = FALSE, labels = midpoints)), 
                                detected == 0 ~ NA_real_))
  
  # replace unique groups base on detections
  unique_groups[[i]] <- camera_data[[i]] %>%
    filter(detected == 1) %>%
    pull(detected_group_size) %>%
    unique() %>% 
    sort()
  
  length_unique_groups[[i]] <- length(unique_groups[[i]])
  
   y[[i]] <- list()
   
  
  for(j in 1:length(unique_groups[[i]])) {
  
  y[[i]][[j]] <- camera_data[[i]] %>% 
  mutate(midpoint = factor(midpoint, levels = 1:length(midpoints)), 
         Site = factor(Site, levels = 1:sites)) %>%
  group_by(Site, midpoint, .drop = FALSE) %>% 
  filter(detected_group_size == unique_groups[[i]][j]) %>%
  summarise(n = length(detected_group_size)) %>%
  as.data.table() %>%
  dcast(Site ~ midpoint, value.var = "n") %>%
  dplyr::select(-Site, - `NA`)
  
  y[[i]][[j]][is.na(y[[i]][[j]])] <- 0
  
  }
   
   y_a[[i]] <- str2str::ld2a(y[[i]])
  
  n_obs[[i]] <- camera_data[[i]] %>% 
    filter(detected == 1) %>%
    mutate(Site = factor(Site, levels = 1:sites),
           dgs = factor(detected_group_size, levels = unique_groups[[i]])) %>%
    group_by(Site, dgs, .drop = TRUE) %>%
    summarise(n = sum(detected, na.rm = T)) %>%
    as.data.table() %>%
    dcast(Site ~ dgs, value.var = "n") %>%
    dplyr::select(-Site) %>%
    as.matrix() 
  
  n_obs[[i]][is.na(n_obs[[i]])] <- 0
  
  log_gs_cont[[i]] <- log((n_obs[[i]] %>% colSums())/(sum(n_obs[[i]])/(n_obs[[i]] %>% ncol())))
  
}

  pa <- matrix(data = NA, nrow = max(group_sizes), ncol = length(midpoints))
  pa_static <- matrix(data = NA, nrow = max(group_sizes), ncol = length(midpoints))

  for(j in 1:max(group_sizes)) {

  pa[j, ] <- sapply(midpoints, function(x) {pr_closest(x-(0.5*delta), 
                                                           x+(0.5*delta), 
                                                           max_distance = max_distance, 
                                                           n = j)})
  
  pa_static[j, ] <- sapply(midpoints, function(x) {pr_closest(x-(0.5*delta), 
                                                           x+(0.5*delta), 
                                                           max_distance = max_distance, 
                                                           n = 1)})
    
  
  }

Model

Using a model written in the STAN programming language we test the effect of accounting for variable group size with modified availability functions. The model can be described as:

Show code

data {
  real delta;                          // bin width
  int<lower=1> n_site;                 // site
  int<lower=1> n_distance_bins;        // distance bins
  int<lower=1> n_gs;                   // number of group sizes
  vector[n_gs] gs;      // group sizes
  vector[n_distance_bins] midpts;      // midpt of each interval
  real<lower=1> max_distance; 
  int<lower=1> max_int_dist;          // max distance as an integer
  array[n_site, n_distance_bins, n_gs] int y; // observations matrix 
  array[n_gs, n_distance_bins] real pi ; // averaged availability for multiple individuals
  array[n_site, n_gs] int n_obs;           //number of observations
}

parameters {

  // detection parameters 
  real log_sigma;  
  
 // abundance parameters 
  real beta_intercept; 
  vector[n_site] eps_site;
  simplex[n_gs] eps_ngs; // random variation in group size 
}

transformed parameters {
  // distance parameters
  
  real sigma = exp(log_sigma);
  array[n_distance_bins] real<lower=0,upper=1> p_raw; // detection probability
  array[n_distance_bins, n_gs] real<lower=0,upper=1> p_raw_scale; // detection probability for point independence model and accounting for availability
  array[n_distance_bins, n_gs] real<upper=0> log_p_raw; // log detection probability accounting for availability
  array[n_gs] real log_p; // mean site detection probability for a group size
  array[n_gs] real<lower=0,upper=1> p; // log probability with summed exponential for the multinomial logit model
  // real log_theta = log(theta)
  array[n_site, n_gs] real lambda;

  for (i in 1:n_distance_bins) {
      // assuming a half-normal detection fn from line 
       p_raw[i] = exp(-(midpts[i])^2/(2*sigma^2)); // half-normal function (pg 419 of AHM)
      // hazard function
      // p_raw[i] = 1 - exp(-(midpts[i]/theta)^(-sigma)); // hazard function (pg 507 of AHM)
        for(j in 1:n_gs) {
      p_raw_scale[i,j] = p_raw[i]*pi[j,i]; //  pr(animal occurs and is detected in bin i)
      log_p_raw[i,j] = log(p_raw_scale[i,j]);
        }
  }

  for(j in 1:n_gs) {
    log_p[j] = log_sum_exp(log_p_raw[,j]);
    p[j] = exp(log_p[j]);
  }
  
      // model site abundance   
  for(n in 1:n_site) {
    for(j in 1:n_gs) {
    lambda[n, j] = exp(beta_intercept + log_p[j] + log(eps_ngs[j]) + eps_site[n]);
    }
  }

}

model {
  
  log_sigma ~ normal(0, 5); // prior for sigma
  eps_site ~ student_t(4, 0, 1); // prior for site effect 
  eps_ngs ~ uniform(0, 1); // prior for group size effect
  beta_intercept ~ student_t(4, 0, 1); // prior for abundance
  
for(n in 1:n_site) {
  for(j in 1:n_gs) {
  n_obs[n,j] ~ poisson(lambda[n,j]);
    }
  }
  
  for(n in 1:n_site) {
    for(j in 1:n_gs) {
      y[n,,j] ~ multinomial_logit(to_vector(log_p_raw[,j]));
    }
  }
}

generated quantities {
  array[max_int_dist+1] real DetCurve;
  array[n_site, n_gs] real log_lik;
  array[n_site, n_gs] real n_obs_true;
  array[n_site] real N_site;
  real N;
  real D;
  
  for(i in 1:n_site) {
    for(j in 1:n_gs) {
  log_lik[i,j] = multinomial_logit_lpmf(y[i,,j] | to_vector(log_p_raw[,j])); //for loo
  n_obs_true[i, j] = gs[j] * (poisson_rng(exp(beta_intercept + log(1-p[j]) + log(eps_ngs[j]) + eps_site[i])) + n_obs[i,j]);
    }
    N_site[i] = sum(n_obs_true[i,]);
  }
  N = sum(N_site);
  D = N/(pi()*max_distance^2);
  // loop over distance bins
  for(j in 0:max_int_dist) { // get DS predictions for distance 0 to max bin distance
    DetCurve[j+1] = exp(-(j+0.5)^2/(2*sigma^2)); // half normal
    // DetCurve[j+1] =  1 - exp(-(j+0.5/theta)^(-sigma)); //hazard rate
    }
  
}

Model data

We generate two model data sets. One with a supplied adjusted availability matrix for the various group sizes and the other with an unadjusted availaility matrix (i.e., modelled as group size equalling 1); which assumes availability is uniform (adjusted for point transect however).

Show code

modelpartial <- list()
modelcrook <- list()

for(i in 1:length(group_sizes)) {
  modelpartial[[i]] <- list(delta = delta,
                    n_site = sites,
                    n_distance_bins = length(midpoints),
                    n_gs = length_unique_groups[[i]],
                    gs = unique_groups[[i]],
                    midpts = midpoints,
                    max_distance = max_distance,
                    max_int_dist = as.integer(round(max_distance)),
                    y = y_a[[i]],
                    pi = pa[unique_groups[[i]],,drop = F],
                    n_obs = n_obs[[i]])
}

for(i in 1:length(group_sizes)) {
  
  modelcrook[[i]] <- list(delta = delta,
                    n_site = sites,
                    n_distance_bins = length(midpoints),
                    n_gs = length_unique_groups[[i]],
                    gs = unique_groups[[i]],
                    midpts = midpoints,
                    max_distance = max_distance,
                    max_int_dist = as.integer(round(max_distance)),
                    y = y_a[[i]],
                    pi = pa_static[unique_groups[[i]],,drop = F],
                    n_obs = n_obs[[i]])
}

Model Runs

Using cmdstanr we run the models for 1,500 iterations with a warmup of 500 iterations across 3 parallel chains.

Show code

ni <- 1500 
nw <- 500
nc <- 3

inits = lapply(1:nc, function(i) list(log_sigma=runif(1), 
                                      beta_intercept=runif(1), 
                                      eps_ngs = runif(1)))

fit_partial <- list()

for(i in 1:length(group_sizes)) {
fit_partial[[i]] <- model$sample(data = modelpartial[[i]], 
                    chains = nc,
                  parallel_chains = nc,
                  show_messages = FALSE, 
                  save_warmup = FALSE,
                  iter_sampling = ni, 
                  iter_warmup = nw, refresh = 0)
}

fit_crook <- list()

for(i in 1:length(group_sizes)) {
fit_crook[[i]] <- model$sample(data = modelcrook[[i]], 
                    chains = nc,
                  parallel_chains = nc, 
                  show_messages = FALSE, 
                  save_warmup = FALSE,
                  iter_sampling = ni, 
                  iter_warmup = nw, refresh = 0)
}

data_combined <- c(modelpartial, modelcrook)
fits_combined <- c(fit_partial, fit_crook)

Summary of sigmas

From the models we obtain estimates of sigma (half-normal detection function shape parameter), and the total abundance of individuals that entered the sampling area of the camera trap. We compare these values against the true simulated values.

Show code

sigma_values <- list()
model_names <- c("Adjusted availability, group size = gamma(1,0.5)", 
                 "Adjusted availability, group size = gamma(2,0.5)",
                 "Adjusted availability, group size = gamma(3,0.5)",
                 "Adjusted availability, group size = gamma(4,0.5)",
                 "Adjusted availability, group size = gamma(5,0.5)", 
                 "Unadjusted availability, group size = gamma(1,0.5)", 
                 "Unadjusted availability, group size = gamma(2,0.5)",
                 "Unadjusted availability, group size = gamma(3,0.5)",
                 "Unadjusted availability, group size = gamma(4,0.5)",
                 "Unadjusted availability, group size = gamma(5,0.5)")

cam_data_rep <- rep(camera_data, 2)

total_counts <- vector()

for(i in 1:length(fits_combined)) {
  
  max_modelled_distance <- data_combined[[i]][["max_distance"]]
  
  camera_data_corrected <- cam_data_rep[[i]] %>% 
    filter(distance < max_modelled_distance)
  
average_count <-  mean(camera_data_corrected$count)
average_det <- sum(camera_data_corrected$detected == 1) / nrow(camera_data_corrected)
total_counts[i] <- sum(camera_data_corrected$count)
  
  sigma_values[[i]] <- fits_combined[[i]]$summary(c("sigma" , "N")) %>%
    mutate(Model = model_names[i], 
           `Average count` = average_count,
           `True value` = c(5, total_counts[i]))
}

summary_bound <- bind_rows(sigma_values) 

summary_bound %>%
  select(Model, `Average count`, variable, `True value`, mean, sd, q5, q95) %>%
  kbl(format = "html", 
      digits = 2, 
      caption = "Comparison of estimated half-normal shape (sigma) and the total population size (N) to the simulated true value") %>% 
  kable_styling(bootstrap_options = "striped") %>%
  collapse_rows(1:3)

Table 1: Comparison of estimated half-normal shape (sigma) and the total population size (N) to the simulated true value
Model	Average count	variable	True value	mean	sd	q5	q95
Adjusted availability, group size = gamma(1,0.5)	1.16	sigma	5	4.98	0.12	4.79	5.19
Adjusted availability, group size = gamma(1,0.5)	1.16	N	2912	3051.89	150.01	2814.00	3306.00
Adjusted availability, group size = gamma(2,0.5)	1.53	sigma	5	5.00	0.12	4.80	5.21
Adjusted availability, group size = gamma(2,0.5)	1.53	N	3827	3826.52	171.26	3553.95	4112.00
Adjusted availability, group size = gamma(3,0.5)	1.99	sigma	5	4.91	0.13	4.71	5.14
Adjusted availability, group size = gamma(3,0.5)	1.99	N	4970	5187.71	212.06	4842.00	5527.05
Adjusted availability, group size = gamma(4,0.5)	2.49	sigma	5	5.05	0.14	4.83	5.29
Adjusted availability, group size = gamma(4,0.5)	2.49	N	6229	6290.95	230.62	5914.00	6685.00
Adjusted availability, group size = gamma(5,0.5)	3.01	sigma	5	4.92	0.14	4.70	5.16
Adjusted availability, group size = gamma(5,0.5)	3.01	N	7535	7670.72	273.11	7240.00	8127.00
Unadjusted availability, group size = gamma(1,0.5)	1.16	sigma	5	4.77	0.10	4.60	4.95
Unadjusted availability, group size = gamma(1,0.5)	1.16	N	2912	3779.90	199.48	3461.95	4119.00
Unadjusted availability, group size = gamma(2,0.5)	1.53	sigma	5	4.43	0.08	4.30	4.57
Unadjusted availability, group size = gamma(2,0.5)	1.53	N	3827	6531.21	311.24	6031.95	7054.00
Unadjusted availability, group size = gamma(3,0.5)	1.99	sigma	5	4.07	0.07	3.95	4.18
Unadjusted availability, group size = gamma(3,0.5)	1.99	N	4970	11634.74	520.04	10786.95	12497.10
Unadjusted availability, group size = gamma(4,0.5)	2.49	sigma	5	3.88	0.06	3.78	3.98
Unadjusted availability, group size = gamma(4,0.5)	2.49	N	6229	17710.37	756.76	16492.90	18976.05
Unadjusted availability, group size = gamma(5,0.5)	3.01	sigma	5	3.64	0.05	3.56	3.74
Unadjusted availability, group size = gamma(5,0.5)	3.01	N	7535	25384.43	1015.44	23783.95	27089.00

Below we plot the posterior estimates for the total abundance against the true simulated value (red line).

Show code

N_dist <- list()
for(i in 1:length(fits_combined)) {
  N_dist[[i]] <- fits_combined[[i]]$draws(c("N")) %>%
    bayesplot::mcmc_areas(prob = 0.9, prob_outer = 0.99) +
    geom_vline(xintercept = total_counts[i], linetype = "dashed", colour = "red") +
    scale_x_continuous(labels = function(x) format(x, scientific = TRUE)) + 
    ggtitle(model_names[i])
}

cowplot::plot_grid(plotlist = N_dist, ncol = 2, byrow = FALSE)

Conclusion

Camera Trap Distance Sampling of species, where group size is not always equa to 1 should account for adjusted availability of the closest individual. Our simulation shows that abundance estimates are likely to be biased (overestimated) if group size is not taken into account.

We also provide a model that allows for the joint estimation of parameters across various sites and group sizes.

CTDS For Groups

Setup

Background

Simulations

Visualisation

Functions for availability calculations at various points, with various bins and individuals

Data Generation

Model

Model data

Model Runs

Summary of sigmas

Conclusion

Corrections

Citation