This vignette is associated with the manuscript titled, ‘A plain language review and guidance for modeling animal habitat-selection’. We will be demonstrating some of the points from the manuscript on fitting models to estimate traditional habitat selection functions (HSF) at the third-order of selection (e.g. selection within the home range).
Note: some of the code-chunks are not automatically displayed. To show the code, select ‘show’.
#Load packages
library(ResourceSelection)
library(glmmTMB)
library(plotrix)
library(broom.mixed)
library(knitr)
library(amt)
library(remotes)
library(ggplot2)
library(raster)
# A github repository to install
# remotes::install_github("Pakillo/grateful")
library(grateful)
# Source power analysis function
source("./functions/sample.size.used.locs.r")
# Source bootstrapping function
source("./functions/mean_ci_boot.r")
# Load spatial covariate data (used for power analysis)
load("./data/Covs")
We will consider a habitat selection analysis of individuals within a broad geographic area (e.g., home range; third-order of selection). Within each individual’s home range the effects of predictors (i.e., \(\beta_1\) and \(\beta_2\)) on habitat selection are assumed to come from a distribution of effects that can be characterized by a mean and standard deviation (i.e., random effect).
# Number of Sampled individuals (e.g., tracked via GPS telemetry)
n.indiv = 20
# Define the true population-level coefficients
# and use these to simulate individual-level coefficients
#Intercept - Make them all the same, except the last individual
# This difference is used below.
beta0 = c(rep(1.5,(n.indiv-1)),0.2)
# Selection against at population-level
# Low variation among individuals
beta1.mu = -1
beta1.sd = 0.2
set.seed(543543)
beta1 = rnorm(n.indiv,
mean = beta1.mu,
sd = beta1.sd
)
# No selection at population-level
# Wide variation among individuals
beta2.mu = 0
beta2.sd = 1
set.seed(5435431)
beta2 = rnorm(n.indiv,
mean = beta2.mu,
sd = beta2.sd
)
# Selection for this feature at population-level
# Low variation among individuals
beta3.mu = 1
beta3.sd = 0.2
set.seed(1543543)
beta3 = rnorm(n.indiv,
mean = beta3.mu,
sd = beta3.sd
)
# Combine coefficients and plot these values
betas = data.frame(b0 = beta0,
b1 = beta1,
b2 = beta2,
b3 = beta3
)
par(mfrow=c(3,1))
hist(betas[,2],xlab=bquote(beta[1]),xlim=c(-3,3),main="True Individual Coefficient Values",breaks=10,freq=FALSE)
abline(v=beta1.mu,lwd=2,col=2)
curve(dnorm(x,beta1.mu,beta1.sd),lwd=3,col=3,add=TRUE)
legend("topright",lwd=3,col=c("gray","red","green"),legend=c("Indiv. Coefs", "Pop. Mean","True Distribution"))
hist(betas[,3],xlab=bquote(beta[2]),xlim=c(-3,3),main="",breaks=20,freq=FALSE)
abline(v=beta2.mu,lwd=2,col=2)
curve(dnorm(x,beta2.mu,beta2.sd),lwd=3,col=3,add=TRUE)
legend("topright",lwd=3,col=c("gray","red","green"),legend=c("Indiv. Coefs", "Pop. Mean","True Distribution"))
hist(betas[,4],xlab=bquote(beta[3]),xlim=c(-3,3),main="",breaks=20,freq=FALSE)
abline(v=beta3.mu,lwd=2,col=2)
curve(dnorm(x,beta3.mu,beta3.sd),lwd=3,col=3,add=TRUE)
legend("topright",lwd=3,col=c("gray","red","green"),legend=c("Indiv. Coefs", "Pop. Mean","True Distribution"))
We are now ready to simulate individual-level data. We will do this
with the simulateUsedAvail
function from the R package
ResourceSelection
. Because of this, we are not simulating
spatial (x-y) data from explicit spatial layers, but we will connect how
these data represent typical GPS data and setup when working with
spatial variables.
# How many used samples per individual
n.used = 500
# Ratio of used:available
m = 10
#Create available samples
n=n.used*m
# Create spatial covariate variables
# dist.dev = Euclidean distance to nearest development
# forest = percent forest cover at a relevant spatial scale
# shrub = shrub cover at a relevant spatial scale
# These are continuous variables that are scaled to a mean of 0 and stdev of 1
set.seed(51234)
x = data.frame(dist.dev=rnorm(n),forest=runif(n),shrub=runif(n))
# Create one dataset per each individual with equal used samples per individual
# Note that in the simulation function (simulateUsedAvail) we are specifyng
# a link of 'log' indicating we will be fitting a model to estimate
# RELATIVE habitat selection rather than absolute selection.
sims = apply(betas,1,FUN=function(b){
ResourceSelection::simulateUsedAvail(data=x,
parms=b,
n.used=n.used,
m=m,
link="log"
)
})
For more information on why a log link indicates relative selection, see Fieberg et al. (2021):
“Instead of focusing on \(p_i\), as is typical in applications to presence–absence data, logistic regression applied to use-availability data should simply be viewed as a convenient tool for estimating coefficients in a habitat-selection function, \(w(X(s); \beta) = \text{exp}(X_{1}(s)\beta_1 + ... X_{k}(s)\beta_k)\) (Boyce & McDonald, 1999; Boyce et al., 2002), where we have X(s) to highlight that the predictors correspond to measurements at specific point locations in geographical space, s. As we will see in the next section, this expression is equivalent to the intensity function of an IPP model but with the intercept (the log of the baseline intensity) removed; the baseline intensity gives the expected density of points when all covariates are 0. Because habitat-selection functions do not include this baseline intensity, they are said to measure ‘relative probabilities of use’, or alternatively, said to be ‘proportional to the probability of use’ (Manly et al., 2002).”
Before moving forward lets understand the data for one individual.
# Check the dimensions of the data
# This should have the same length as n.indiv
length(sims)
## [1] 20
# Look at one individual's dataset
dim(sims[[1]])
## [1] 5500 4
head(sims[[1]])
## status dist.dev forest shrub
## 1969 1 0.14004930 0.162122994 0.2868476
## 3827 1 -0.43224008 0.006257207 0.8402856
## 3266 1 -3.18972176 0.905023379 0.2473926
## 2556 1 -1.73908829 0.714554879 0.6718038
## 2241 1 -0.73270288 0.738071773 0.5666879
## 498 1 0.09237958 0.246186710 0.8294609
table(sims[[1]]$status)
##
## 0 1
## 5000 500
We see from this individuals’ data frame that we have the columns
status, dist.dev, forest, shrub. The column status
is our
response variable. A 1
indicates an animal location (used
point) and a 0
indicates a potentially used location (i.e.,
available location) within the individual’s home range. Each 1 and 0
have corresponding spatial variables, which we have called
dist.dev
(Euclidean distance to nearest development),
forest
(percent of landcover that is forested), and
shrub
(percent of landcover that is covered in shrub). Each
of these is a continuous spatial variable and has been standardized to
have a mean of 0 and standard deviation of 1 (also called Normalizing).
This is a common setup in statistical models to put all the variables on
an equivalent scale, i.e., the scale of one standard deviation. Note
that we have a large available sample, much more than the used sample.
This is correct. Remember that the available sample (the 0’s) is what is
going to allow us to use logistic regression to approximate the IPP
model.
We now want to package our simulated data into a single data frame.
We will use the objects sims
and sims2
below
when we want to fit models to each individual separately and together,
respectively.
# Combine data into a single data.frame
sims2 = do.call("rbind", sims)
head(sims2)
## status dist.dev forest shrub
## 1969 1 0.14004930 0.162122994 0.2868476
## 3827 1 -0.43224008 0.006257207 0.8402856
## 3266 1 -3.18972176 0.905023379 0.2473926
## 2556 1 -1.73908829 0.714554879 0.6718038
## 2241 1 -0.73270288 0.738071773 0.5666879
## 498 1 0.09237958 0.246186710 0.8294609
dim(sims2)
## [1] 110000 4
#Create ID vector for individual's data
ID = rep(1:n.indiv, each = nrow(sims[[1]]))
sims2$ID = ID
head(sims2)
## status dist.dev forest shrub ID
## 1969 1 0.14004930 0.162122994 0.2868476 1
## 3827 1 -0.43224008 0.006257207 0.8402856 1
## 3266 1 -3.18972176 0.905023379 0.2473926 1
## 2556 1 -1.73908829 0.714554879 0.6718038 1
## 2241 1 -0.73270288 0.738071773 0.5666879 1
## 498 1 0.09237958 0.246186710 0.8294609 1
dim(sims2)
## [1] 110000 5
# Create a vector to weight the 0's by 1000 and the 1's by 1
# This is discussed below (see manuscript section 10. The model being approximated)
sims2$weight = 1000^(1-sims2$status)
head(sims2)
## status dist.dev forest shrub ID weight
## 1969 1 0.14004930 0.162122994 0.2868476 1 1
## 3827 1 -0.43224008 0.006257207 0.8402856 1 1
## 3266 1 -3.18972176 0.905023379 0.2473926 1 1
## 2556 1 -1.73908829 0.714554879 0.6718038 1 1
## 2241 1 -0.73270288 0.738071773 0.5666879 1 1
## 498 1 0.09237958 0.246186710 0.8294609 1 1
We are now ready to fit models to estimate our habitat selection parameters and demonstrate points made in the manuscript. We have organized the sections below to generally match with the sections of the manuscript.
Definitions only.
Definitions only.
Definitions only.
We want to remind the reader of the nuance between a habitat selection function and the statistical model we will be fitting. The model we are fitting (as coded) and the true underlying model (Inhomogeneous Poisson Point process; IPP) is not the habitat selection function. The habitat selection function is a component of the IPP model. We will be using a logistic regression modeling (or Generalized Binomial Regression Model) to approximate the IPP model, in which our estimated parameters will be the parameters within the habitat selection function. To understand the model, see equation 1 of Gerber and Northrup, 2020.
It is also important to differentiate a habitat selection function from a habitat selection analysis. As mentioned above, a habit selection function is a component of the model we want to fit, while a habitat analysis refers to all the steps that are being considered; this includes, for example, how the data are setup, the model fit, the algorithms used, the inference or predictions from the model, and how the model is evaluated.
Definitions only.
Definitions only.
In our data setup and analysis we will be estimating habitat selection at the third-order of the scales defined in Johnson (1980). Most commonly, we can think of this as selection of habitat within a defined home-range. Our data generating process above was not specific about the spatial bounds of each individual’s home-range, but this is how you can think of the data. The animal telemetry locations were used to create the 1’s in our data and the corresponding spatial covariates (e.g., dist.dev) as well as the home range boundary. We then make a grid of points within the home range to extract the spatial covariate values and we assign these to the response of 0 and are the available sample. These two sets of values (1’s and covariate values and 0’s and covariate values) are appended together in a single data frame.
There are many home-range estimators in the literature. Choose one that meets the requirements of your data, for example, whether you have assumed independence between consecutive spatial locations (e.g., kernel density estimation) or whether spatial autocorrelation is due to a high-frequency of sampling (e.g., short fix interval between relocations relative to the animals speed; movement-based kernel estimator).
This is the hardest part of the whole habitat selection study. How do you decide on the what, where, and how many individuals and relocations to answer the question at hand. Talk to many people. Ask questions to many people.
The modeling mentioned here is about differences between study goals, such as inference and prediction. Modeling for inference versus prediction is not a straightforward distinction. There are lots of opinions from the statistical folks (which are great and everyone should read them, e.g., Shmueli, 2010 and Scholz and Burner, 2022) and the science philosophy folks.
We reference the manuscript Gerber and Northrup, 2020 in regards to when the study goal is prediction. Associated with this manuscript is code in the Supporting Information file (ecy2953-sup-0004-DataS1.Zip) that pertains to optimizing for predicting spatial distributions of habitat selection (i.e., making a map). This process can jeopardize inference, e.g., make the interpretation of estimated effects unreliable. In contrast, if inference is sought you should think hard about a single model that includes the most important variables for the species and for your question that you want to consider so that estimated effects and measures of uncertainty are reliable (Bolker 2023, Tredennick et al. 2021).
In this section, we discuss some mechanics of inference and prediction, as in interpreting coefficients and predicting relative habitat selection. We will walk through the basics here, but refer the reader to full in depth coverage in Fieberg et al. 2021 and Avgar et al. 2017. We also mention the goal of using habitat selection to predict animal abundance. Demonstrating this is beyond our work. We suggest starting with the reading of section 1.6, “When is species density a reliable reflection of habitat suitability?” of Matthiopoulos, Fieberg, and Aarts, 2023.
Let’s fit a model to one individual’s used and available data and discuss the coefficients.
# We will use data from individual #20
indiv.data20 = sims[[20]]
# Let's look at the data to remind us what it is
head(indiv.data20)
## status dist.dev forest shrub
## 4750 1 -1.7175434 0.8433439 0.5885524
## 2483 1 -1.4853814 0.2001264 0.7956416
## 4802 1 0.6492396 0.2110496 0.4721530
## 2363 1 -1.3261416 0.9622746 0.9805034
## 1674 1 -0.7316487 0.6683905 0.1342194
## 4756 1 -2.3376280 0.9587345 0.1744920
We are now ready to approximate our true model using the generalized
linear modeling machinery of logistic regression. We will do this using
two different packages and functions. First, we will use the
glmmTMB
function in package glmmTMB
.
# We have related our response variable of the used and available sample (status) to our covariates. This is an additive model, as opposed to having interactions.
model1 = glmmTMB::glmmTMB(status ~ dist.dev + forest + shrub,
family = binomial(),
data = indiv.data20
)
# Look at estimated coefficients
summary(model1)
## Family: binomial ( logit )
## Formula: status ~ dist.dev + forest + shrub
## Data: indiv.data20
##
## AIC BIC logLik deviance df.resid
## 2817.6 2844.1 -1404.8 2809.6 5496
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.4354 0.1518 -22.634 < 2e-16 ***
## dist.dev -1.1067 0.0542 -20.418 < 2e-16 ***
## forest 0.1244 0.1709 0.728 0.467
## shrub 0.8778 0.1781 4.929 8.26e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s first consider the Intercept
. As mentioned in the
section ‘12. Population Inference’, this parameter is largely the ratio
of used to available sample. If we increase our available sample, this
estimate will get smaller. Its value has a lot to do with how we setup
the data to approximate the true underlying model. It has no biological
interpretation, so we can skip it.
Next, we have estimated the effect of dist.dev
as -1.11.
This estimate is negative, indicating that as the value of
dist.dev
increases from its mean (defined at 0) while the
values of forest
and shrub
(the other
variables in the model) are held at their mean that habitat selection
decreases. In other words, habitat use relative to what is available to
this individual (as we defined it!) decreases the further from
development. That’s a lot of qualifiers to understand this estimate.
These are important though. In terms of evidence of an effect, we can
say that at a defined probability of Type I error of 0.05, we have a statistically clear
effect and that that it is not likely zero. The evidence of this is the
very small p-value of 1.1558137^{-92}.
Next, we have estimated the effect of forest
as 0.12.
This estimate is positive, indicating that as the value of
forest
increases from its mean (defined at 0) while the
values of dist.dev
and shrub
(the other
variables in the model) are held at their mean that habitat selection
increases relative to what is available to this individual. At our
defined Type I error we do not have statistical clarity and there is a
reasonable chance that it is zero. (p-value = 0.4665834).
Next, we have estimated the effect of shrub
as 0.88.
This estimate is positive, indicating that as the value of
shrub
increases from its mean (defined at 0) while the
values of dist.dev
and forest
(the other
variables in the model) are held at their mean that habitat selection
increases relative to what is available to this individual. At our
defined Type I error we do have statistical clarity of an effect that is
not zero (p-value = 8.2627753^{-7}).
We can estimate the parameters with other functions that implement
the logistic regression model. For example, the glm
function in the stats
package as:
model1.glm = stats::glm(status ~ dist.dev + forest + shrub,
family = binomial(),
data = indiv.data20
)
summary(model1.glm)
##
## Call:
## stats::glm(formula = status ~ dist.dev + forest + shrub, family = binomial(),
## data = indiv.data20)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.4354 0.1518 -22.634 < 2e-16 ***
## dist.dev -1.1067 0.0542 -20.418 < 2e-16 ***
## forest 0.1244 0.1709 0.728 0.467
## shrub 0.8778 0.1781 4.929 8.26e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3351.0 on 5499 degrees of freedom
## Residual deviance: 2809.6 on 5496 degrees of freedom
## AIC: 2817.6
##
## Number of Fisher Scoring iterations: 6
Notice that we estimate the exact same quantities.
The package amt
can do the same thing using the function
fit_rsf
which is a wrapper function for using the
stats glm
function.
amt::fit_rsf
## function (data, formula, ...)
## {
## m <- stats::glm(formula, data = data, family = stats::binomial(link = "logit"),
## ...)
## m <- list(model = m)
## class(m) <- c("fit_logit", "glm", class(m))
## m
## }
## <bytecode: 0x00000237297814d8>
## <environment: namespace:amt>
How do we quantitatively evaluate two locations in terms of habitat selection using our model results? We can do so using Equation 8 of Fieberg et al. 2021.
Perhaps we want to compare a location that is very near development (dist.dev = -2) at high forest and shrub cover (forest = 2, shrub = 2) with that of a location far from development (dist.dev = 2) and also at high forest but low shrub cover (forest =2, shrub = -2).
# Get estimates without the intercept
coef = fixef(model1)[[1]][-1]
RS = exp(-2*coef[1] + 2*coef[2] + 2*coef[3]) /
exp(2*coef[1] + 2*coef[2] + -2*coef[3])
RS
## dist.dev
## 2801.048
If given equivalent availability between these two types of sites ( 1) near development at high forest and shrub cover, 2) far from development at high forest and shrub cover ), this individual would relatively select the first location by a factor of 2801.05.
Why are we exponentiating our linear combinations of terms (additions of covariates times estimated coefficients)? This is because in this setup, we have assumed the habitat selection process follows an exponential form (see section 10. The model being approximated) and equation 1 of Gerber and Northrup, 2020.
Avgar et al. 2017 refers to the relative selection strength (RSS) as the exponentiation of each coefficient.
For example,
# Coefficient for dist.dev
exp(coef[2])
## forest
## 1.132473
Given two locations that differ by 1 standard deviation of dist.dev, but are otherwise equal, this individual would be 1.1324727 as likely to choose the location with higher dist.dev (or, equivalently, 0.8830235 times more likely to choose the location with the lower dist.dev.
Lets combine our covariates now to predict relative habitat selection over more scenarios. To do this, we need to remember we used generalized linear modeling functions to approximate our model. We need to take care when predicting. First, we need to remember to drop the intercept. Second, we need to remember our true link function for relative habitat selection is the log-link (inverse-link is the exponential function) and not the logit-link, which is the default when fitting logistic regression models.
# Force the intercept to be zero without any error
model1$fit$par[1] = 0
model1$sdr$par.fixed[1] = 0
model1$sdr$cov.fixed[1,] = 0
model1$sdr$cov.fixed[,1] = 0
# Create the dataframe for the combinations of variables for prediction
newdata=data.frame(dist.dev = seq(-2,2, by=0.1),
forest = 0,
shrub = 0
)
# Next, predict
preds = predict(model1,
type = "link", # NOTE: this is where we tell the 'predict' function to use the same link function as the habitat selection function we fit above
newdata = newdata,
se.fit = TRUE
)
# Use the Normal Approximation to get confidence intervals (other approachces are possible too!)
preds$LCL = preds$fit-1.96*preds$se.fit
preds$UCL = preds$fit+1.96*preds$se.fit
preds = data.frame(preds)
# Exponentiate predicted values
preds.exp = exp(preds[,-2])
# Plot
plot(newdata$dist.dev,preds.exp$fit,type="l",lwd=3,xlab="Distance to Development (Scaled)",
ylab="Relative Intensity of Selection")
lines(newdata$dist.dev,preds.exp$LCL,lwd=3,col=2,lty=4)
lines(newdata$dist.dev,preds.exp$UCL,lwd=3,col=2,lty=4)
abline(h=1, lwd=4, col=4)
text(1.5,1.4, "Selection")
text(1.5,0.6, "Avoidance")
We see that for this individual, given equal availability and
forest
and shrub
at their mean values, avoid
areas far from development (high values on x-axis) and select for areas
close to development (low values on x-axis).
We can use this same process to extract values across spatial layers to predict relative habitat selection that is mapped.
In fitting these data using a traditional HSF, we are making two important assumptions. First, we are assuming that the available locations for an individual are accessible and could be used if the individual chooses. Note that in our data setup, we have not assumed the set of available locations are the same for each individual. Each individual has a different set, based on their home-range (remember this is not explicitly how we sampled, but it is implied). Second, we are assuming that each used location is independent from each other. Lets consider the implications of violating these assumptions.
We can demonstrate this effect by comparing results from individual 20 (above) with that of the same model but additional available locations are added to the available sample. Essentially, we are going to consider the effect of assuming these new locations are available to this individual.
# Create a new (small) set of available locations
n.avail.new = 50
set.seed(215464)
avail.new = data.frame(status = rep(0,n.avail.new),
dist.dev = rnorm(n.avail.new,-0.5,3),
forest = rnorm(n.avail.new,0.5,3),
shrub = rnorm(n.avail.new,1.5,3)
)
# Append the new data to the original data
indiv.data20.appended = rbind(indiv.data20,avail.new)
# Fit the same model with appended data
model1.appended = glmmTMB(status ~ dist.dev + forest + shrub,
family = binomial(),
data = indiv.data20.appended
)
# Re-fit the original data
model1 = glmmTMB::glmmTMB(status ~ dist.dev + forest + shrub,
family = binomial(),
data = indiv.data20
)
# Compare coefficients using the appended data and the original data
coef.df = rbind(fixef(model1)[[1]][-1],
fixef(model1.appended)[[1]][-1]
)
colnames(coef.df) = c("dist.dev","forest","shrub")
rownames(coef.df) = c("Original Data","Data with Additional Locations")
knitr::kable(coef.df)
dist.dev | forest | shrub | |
---|---|---|---|
Original Data | -1.1066673 | 0.1244035 | 0.8777699 |
Data with Additional Locations | -0.9722748 | 0.0645601 | 0.2185216 |
What did we find? Simply, that our estimates are quite different when considering these additional available locations that were inaccessible to the individual. That is a challenge for us when setting up the data. Our estimates can change quite a bit. This is an important reminder that our estimated effects highly depend on our assumptions of what is available to each animal. We want to avoid including available locations that are in fact inaccessible to the animal because it will change our inference in unknown ways.
We can evaluate this assumption by including non-independent data. We can easily do this by replicating our used animal location data. It’s as if we were tracking an individual and took two locations every 4 hours, where the 2 location fixes were only separated by 5 minutes. While 4 hours may be enough for the individual to travel throughout its home range (if it chooses), 2 minutes is certainly not enough time. Thus, we have two data points really close in space and time for every 4 hour interval.
# Lets use indidual 7's data
indiv.data7 = sims[[7]]
# Find the used locations
index1 = which(indiv.data7$status==1)
# Replicate the used locations
indiv.data7.replicated = rbind(indiv.data7,
indiv.data7[index1,]
)
fit.replicated = glmmTMB(status ~ dist.dev + forest + shrub ,
family = binomial(),
data = indiv.data7.replicated,
)
# Fit the original data
model1 = glmmTMB::glmmTMB(status ~ dist.dev+forest+shrub,
family = binomial(),
data = indiv.data7
)
# Compare coefficients when using the replicated data and the original data
coef.df[1,] = fixef(model1)[[1]][-1]
coef.df[2,] = fixef(fit.replicated)[[1]][-1]
rownames(coef.df)[2] = "Non-Independent Data"
knitr::kable(coef.df)
dist.dev | forest | shrub | |
---|---|---|---|
Original Data | -0.6883261 | -0.2891435 | 1.135597 |
Non-Independent Data | -0.6905494 | -0.2946979 | 1.147856 |
What did we find? Well, the point estimates look similar. That’s good. Now let’s look at measures of uncertainty.
# Compare measures of uncertainty
summary(model1) # independent data
## Family: binomial ( logit )
## Formula: status ~ dist.dev + forest + shrub
## Data: indiv.data7
##
## AIC BIC logLik deviance df.resid
## 3110.5 3136.9 -1551.2 3102.5 5496
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.01865 0.14274 -21.147 < 2e-16 ***
## dist.dev -0.68833 0.04961 -13.875 < 2e-16 ***
## forest -0.28914 0.16725 -1.729 0.0838 .
## shrub 1.13560 0.17299 6.565 5.21e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit.replicated) # replicated (dependent) data
## Family: binomial ( logit )
## Formula: status ~ dist.dev + forest + shrub
## Data: indiv.data7.replicated
##
## AIC BIC logLik deviance df.resid
## 4963.9 4990.7 -2478.0 4955.9 5996
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.33068 0.10605 -21.978 <2e-16 ***
## dist.dev -0.69055 0.03768 -18.326 <2e-16 ***
## forest -0.29470 0.12541 -2.350 0.0188 *
## shrub 1.14786 0.12909 8.892 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Okay, so this is not so good. We can see the standard errors of the
estimates from the model fit.replicated
are smaller because
we inappropriately doubled our sample size and treated dependent data as
independent. This effect will translate into much too small confidence
intervals for our estimates and predictions. It also means that our
p-values are too small. Notice that our coefficient for forest is now
statistically clearly different than zero (p-value of 0.0187789) while
it was not so using only independent data (p-value of 0.0838469).
Context only.
It is the responsibility of each researcher to make sure their modeling process is done such that they are approximating the true underlying Inhomogeneous Poisson Point process (IPP) model well. To demonstrate this, we will consider how estimated coefficients change with increasing numbers of available samples. We will also be able to see how the intercept changes, due its partial (mostly) interpretation as a measure of the ratio of used to available samples.
In practice, there are two components that need to be considered in the approximation. First, is how to sample the complexity of variation in the spatial covariates being considered. Simply, you want to make sure that the spatial variation is captured with these discrete locations (points placed within the defined spatial region of what is available that will be the available sample and respective covariate values) by spreading them out well. Second, is to have enough of these samples such that the approximation works well. Capturing the spatial variation is best done by making a systematic grid of locations within the area that is considered available and extracting the covariate values. Having enough available locations can be achieved by having a high density of points for the systematic grid and also by weighting the available locations within the model fitting process.
Since we are not explicitly spatially sampling here, we can’t directly evaluate how to capture spatial variation, but we can demonstrate how to weight the available locations. Essentially, you want to tell the model that for every used location to weight this data by 1, indicating that there is one of them. For the available locations, you want to weight each location by a large number (e.g.,1000). This tells the model to consider this data as having 1000 of them. You can think of it as a shortcut to replicating this available location and its covariate values 1000 times. Either way works, but weighting is more computationally efficient. By default, weighting should be done in all analyses to help with reducing approximation error. Note that this type of replicating of the available sample is a good thing, while replicating the used data (as seen in Critical Assumption 2: Independence of location data) is not a good thing.
# Grab individual one's data
indiv.dat1 = sims[[1]]
# index the used (1) and available (0) data
index.0 = which(indiv.dat1==0)
index.1 = which(indiv.dat1==1)
# Create a sequence of the number of available locations to consider
n.available = seq(10,length(index.0),by=100)
# Create storage objects
coef.save = NULL
# Loop through and modify the data and fit each model with increasing
# number of available samples
for(i in 1:length(n.available)){
dat.temp = rbind(indiv.dat1[index.1,],
indiv.dat1[index.0[1:n.available[i]],]
)
#Create the weighting vector (a 1 for each 1 and a 1000 for each 0)
dat.temp$weight = 1000^(1-dat.temp$status)
model.fit = glmmTMB(status ~ dist.dev + forest + shrub,
family = binomial(),
data = dat.temp,
weight = weight # add the weighting vector
)
coef.save = rbind(coef.save,
fixef(model.fit)[[1]]
)
}
coef.save = data.frame(n.available,coef.save)
colnames(coef.save) = c("N.Avail","Intercept","beta1","beta2","beta3")
par(mfrow=c(2,1))
plot(coef.save$N.Avail, coef.save$beta1,lwd=3,type="l",col=2,ylim=c(-1.5,1.5),
main="Slope Coefficients",
xlab="Number of Available Samples",ylab="Coefficient Estimate")
lines(coef.save$N.Avail, coef.save$beta2,lwd=3,col=2)
lines(coef.save$N.Avail, coef.save$beta3,lwd=3,col=2)
plot(coef.save$N.Avail, coef.save$Intercept,lwd=3,type="l",col=1,main="Intercept",
xlab="Number of Available Samples",ylab="Coefficient Estimate")
We can see that our estimates of the slope coefficients are sensitive to the number of available locations when there are less than a few thousand available locations; there is still a bit of jumping around until the available samples are above 4000. In contrast, we can see the intercept just keeps getting smaller and smaller because our available sample size is getting larger and larger. The intercept is not biologically meaningful.
Another way to look at this sensitivity is by showing the variability of the coefficients within and across different sizes of available samples.
# Size of available samples per used locaiton
n.avail2 = c(10,50,100,1000,2000,4000)
# Number of sample replications
n.sample = 20
#Save coefficients
coef.save=NULL
#Loop across available sample sizes
for(i in 1:length(n.avail2)){
#re-sample each available sample size 20 times
for (z in 1:n.sample){
#Loop through each used location and reduce
#the number of available samples and do this n.sample times
ind.dat = NULL
dat.temp = rbind(indiv.dat1[index.1,],
indiv.dat1[sample(index.0,n.avail2[i]),]
)
#Create the weighting vector (a 1 for each 1 and a 1000 for each 0)
dat.temp$weight = 1000^(1-dat.temp$status)
model.fit = glmmTMB(status ~ dist.dev + forest + shrub,
family = binomial(),
data = dat.temp,
weight = weight # add the weighting vector
)
coef.save= rbind(coef.save,
model.fit$fit$par
)
}#end z sample loop
}#end i loop
one = data.frame(N.Avail = rep(n.avail2,each = n.sample),
Sample = rep(1:n.sample,length(n.avail2)),
beta = coef.save[,1]
)
two = data.frame(N.Avail = rep(n.avail2,each = n.sample),
Sample = rep(1:n.sample,length(n.avail2)),
beta = coef.save[,2]
)
three = data.frame(N.Avail = rep(n.avail2,each = n.sample),
Sample = rep(1:n.sample,length(n.avail2)),
beta = coef.save[,3]
)
four = data.frame(N.Avail = rep(n.avail2,each = n.sample),
Sample = rep(1:n.sample,length(n.avail2)),
beta = coef.save[,4]
)
plot.data = rbind(one,two,three,four)
plot.data$Name = c(rep("Intercept",nrow(one)),
rep("dist.dev",nrow(one)),
rep("forest",nrow(one)),
rep("shrub",nrow(one))
)
plot.data$N.Avail = as.factor(plot.data$N.Avail)
colnames(plot.data) = c("N.Available.Sample","Sample","Coefficient.Estimate","Name")
ggplot(plot.data, aes(N.Available.Sample, Coefficient.Estimate, fill=factor(Name))) +
theme_bw()+
geom_boxplot()
We see more easily in this plot that there is high variability in the estimated coefficients when the available sample is small. This variability decreases as the available sample grows, showing how the estimates are stabilizing (except the intercept). What this means is that you could grab many different but large available samples and would get almost the same estimated coefficients. This is not true when using a small available sample; your estimated coefficients will vary and we do not want that to happen. The intercept estimates decrease in the variability but also continues to decline. It is up to each researcher to conduct a similar sensitivity analysis to see how many available samples are necessary for the coefficient estimates to stabilize.
We should assume there is individual variation in habitat selection estimates. This variation is masked when data are pooled. Lets consider the implications of pooling all data vs fitting a model separately to each individual.
# Pooled model - no consideration of individual variability
model.pool = glmmTMB(status ~ dist.dev + forest + shrub ,
family = binomial(),
data = sims2,
)
# Separate models to each individual
indiv.fit = lapply(sims, FUN = function(x){
glmmTMB(status ~ dist.dev + forest + shrub ,
family = binomial(),
data = x,
)
})
# Look at estimates when pooling the data
summary(model.pool)
## Family: binomial ( logit )
## Formula: status ~ dist.dev + forest + shrub
## Data: sims2
##
## AIC BIC logLik deviance df.resid
## 58028.7 58067.1 -29010.4 58020.7 109996
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.26310 0.03319 -98.33 < 2e-16 ***
## dist.dev -0.97998 0.01167 -83.96 < 2e-16 ***
## forest -0.19943 0.03825 -5.21 1.85e-07 ***
## shrub 1.05867 0.03942 26.86 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the pooled data, we see that the standard errors of the coefficients and p-values are very small. We are using a lot of information, assuming no variation among individuals, and thus assuming every individual’s effects can be characterized by these estimates. We are also forcing the intercept to be the same for each individual. That’s fine here because we have the same number of used locations for each individual and thus the ratio of used to available is the same. But, if you have differing number of used sampled per individual, this would not be a good thing. Now let’s, look at just two individuals’ estimates when fitting separate models.
summary(indiv.fit[[2]])
## Family: binomial ( logit )
## Formula: status ~ dist.dev + forest + shrub
## Data: x
##
## AIC BIC logLik deviance df.resid
## 3009.6 3036.0 -1500.8 3001.6 5496
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.50477 0.15003 -23.360 < 2e-16 ***
## dist.dev -0.81633 0.05057 -16.141 < 2e-16 ***
## forest 0.61806 0.17213 3.591 0.00033 ***
## shrub 0.97917 0.17399 5.628 1.83e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
First, notice that the estimated effects are a bit different than the pooled estimates. Importantly, also notice that the standard errors and p-values are larger.
summary(indiv.fit[[20]])
## Family: binomial ( logit )
## Formula: status ~ dist.dev + forest + shrub
## Data: x
##
## AIC BIC logLik deviance df.resid
## 2817.6 2844.1 -1404.8 2809.6 5496
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.4354 0.1518 -22.634 < 2e-16 ***
## dist.dev -1.1067 0.0542 -20.418 < 2e-16 ***
## forest 0.1244 0.1709 0.728 0.467
## shrub 0.8778 0.1781 4.929 8.26e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now lets look at all the estimates together.
# Extract pooled estimates with Confidence intervals
pool.effect = broom.mixed::tidy(model.pool, effects = "fixed", conf.int = TRUE)
# Extract separate individual estimates
indiv.coef = lapply(indiv.fit,FUN=function(x){
temp=broom.mixed::tidy(x, effects = "fixed",
conf.int = TRUE
)
data.frame(temp[-1,c(4,8,9)])
})
# These are all coefs (1:3) for each individual 1:n.indiv
estimates = sapply(indiv.coef,"[[",1)
LCI = sapply(indiv.coef,"[[",2)
UCI = sapply(indiv.coef,"[[",3)
layout(matrix(c(1,2), nrow = 1, ncol = 2, byrow = TRUE),
width=c(2,1)
)
plotCI(1:n.indiv,y=estimates[1,],
ui=UCI[1,],
li=LCI[1,],
ylab="beta",
xlab="Individual",
lwd=2,
ylim=c(-2.5,4)
)
legend("topright",lwd=3,col=c(1,2,3),legend=c("dist.dev","forest","shrub"),box.lty=0,
y.intersp=0.8)
plotCI(1:n.indiv,y=estimates[2,],
ui=UCI[2,],
li=LCI[2,],
ylab="beta_1",
xlab="Individual",
lwd=2,
col=2,
add=TRUE
)
plotCI(1:n.indiv,y=estimates[3,],
ui=UCI[3,],
li=LCI[3,],
ylab="beta_1",
xlab="Individual",
lwd=2,
col=3,
add=TRUE
)
plotCI(c(1,1,1),y=pool.effect$estimate[2:4],
xlim=c(0.95,1.05),
ui=pool.effect$conf.high[2:4],
li=pool.effect$conf.low[2:4],
ylab="beta",
xlab="Population",
lwd=2,
ylim=c(-2.5,4),
col=c(1,2,3),
xaxt='n'
)
The plot on the right are the pooled estimates for beta1 (black), beta2 (red), and beta3 (green). The plot on the left are each individual’s estimates (colored the same). By ignoring individual variation, we are much too confident in our estimates of uncertainty and are ignoring a lot of clear variation by individual. The pooled estimate’s certainty has to do with psuedoreplication - we are treating a subunit (each location) as the main unit of replication. Our true unit of replication is the individual.
Since our sample sizes for each individual are equal, we see that the pooled estimates generally relate to the average of each estimate across all individuals. When the number of used locations varies by individual this won’t be the case. The individuals with more used locations will disproportionately influence the pooled estimates.
A common question is how many used locations for a given individual (or pooled across individuals) are needed to provide statistical clarity about a habitat selection variable? We can determine this using the methods of Street et al. 2021; their data/code can be found at figshare.
We will explicitly use spatial data in this section. First, lets grab one spatial covariates that was already loaded.
S = covs[[1]]
values(S) = scale(values(S))
Now, we can define our intercept and slope/effect for this spatial covariate in determining the true habitat selection values.
# Create habitat selection function values
# We will use a slope of -0.2 for our spatial covariate
beta = matrix(c(1,-0.2),
nrow = 2,
ncol = 1
)
# Design matrix
X = cbind(1,values(S))
#Intensity of selection
lambda = exp(X%*%beta)
# Create spatial layer with HSF values
S.HSF = S
values(S.HSF) = lambda
# Plots
par(mfrow=c(1,2),mar=c(4,4,4,4))
plot(S, main = "Spatial Covariate")
plot(S.HSF, main = "Relative Intensity of Selection")
You can think of this covariate as the Normalized Difference Vegetation Index (NDVI) if that makes it easier to visualize. To determine the minimum number of used samples for the covariate and habitat selection values (with coefficient of \(\beta[2]\)), we need to define our criteria for statistical clarity. Specifically, we are interested in finding N:
“Recall that \(N_{\alpha, p}(\beta)\) was defined to be the minimum number of samples, N, required so that we expect to reject the null hypothesis of \(\beta = 0\) (vs. alternative \(\beta \neq 0\)), at a significance level of \(p\), in \(100(1 − \alpha)\%\) of experiments”. Street et al. 2021 (Supplementary Appendix A)
# Define Inputs
alpha = 0.05
p_val = 0.05
We are now ready to determine the number of used locations required to achieve our goal.
N.used = sample.size.used.locs(alpha = alpha,
p_val = p_val,
HSF.True = S.HSF,
S = S,
beta = beta[2]
)
N.used$Nalphapbetas
## [1] 353.5197
Given the spatial heterogeneity in our covariate, a coefficient of -0.2 and the above levels of statistical clarity, we only need 353.5196997 used locations.
Lets consider varying our Type I error rate (\(\alpha\)) to see how our sample size changes. We will hold all other variables constant.
n.combs = 20
alpha = seq(0.0001,0.3,length.out = n.combs)
N.used = rep(0,n.combs)
#Loop through
for(i in 1:n.combs){
temp = sample.size.used.locs(alpha=alpha[i],
p_val = p_val,
HSF.True = S.HSF,
S = S,
beta = beta[2])
N.used[i] = temp$Nalphapbetas
}
plot(alpha,N.used,col=2,type="l",lwd=3,xlab="Type I Error Rate",ylab="Sample Size of Used Locations Needed")
We we might expect, as we relax the Type I error rate, it is easier to determine statistical clarity at lower sample sizes of used locations.
Next, lets reset \(\alpha = 0.05\) and evaluate our sample size requirements as we change the effect size (i.e., coefficient) of our spatial variable
n.combs = 40
N.used = rep(0,n.combs)
beta.combs=seq(-2,2,length.out=n.combs)
for(i in 1:n.combs){
beta = matrix(c(-1,beta.combs[i]),
nrow=,ncol=1)
X = cbind(1,values(S))
lambda = exp(X%*%beta)
HSF.True = S
values(HSF.True)= lambda
temp = sample.size.used.locs(alpha = 0.05,
p_val = 0.05,
HSF.True = HSF.True,
S = S,
beta = beta.combs[i]
)
N.used[i] = temp$Nalphapbetas
}
plot(beta.combs,N.used,col=2,type="b",lwd=3,xlab="Coefficient",ylab="Sample Size of Used Locations Needed")
We see that as the coefficient gets closer to zero, we require more samples to determine that it is statistically clearly different from zero. This should hopefully make some logical sense.
Last, we will examine how the heterogeneity of the spatial covariate affects our required sample size. We will specifically do this by changing the values of the spatial covariate in terms of the standard deviation but will measure it in terms of ‘landscape complexity’, as one in Street et al. 2021. Note that we will return to a slope coefficient of (\(\beta = -0.2\)).
n.combs = 20
n.reps = 30
sd.combs = seq(0.1,0.5,length.out = n.combs)
N.used = N.variance = NULL
save.hsf = vector("list",20)
# Loop over sd.combs
for(i in 1:n.combs){
# Replicates
for(j in 1:n.reps){
# Simulate Spatial support of point process
S = raster(xmn=0,xmx=1.5,ymn=0,ymx=1,res=0.05)
# Random spatial covariate
phi = 0.7 # spatial decay parameter
n = ncell(S)
xy = xyFromCell(S, 1:n)
d = as.matrix(dist(xy))
Sigma = exp(-d/phi)
Sigma = t(chol(Sigma))
values(S) = Sigma %*% rnorm(n,0,sd=sd.combs[i])
values(S) = scale(values(S))
# Smooth resource covariate
S = disaggregate(S,fact=2,method="bilinear")
beta = matrix(c(-1,-0.2),nrow=2,ncol=1) #coefs
X = cbind(1,values(S)) # design matrix
lambda = exp(X%*%beta) #intensity of selection
HSF.True=S
values(HSF.True)= lambda
N.temp = sample.size.used.locs(alpha = 0.05,
p_val = 0.05,
HSF.True = HSF.True,
S = S,
beta = beta[2]
)
N.used = c(N.used,N.temp$Nalphapbetas)
N.variance = c(N.variance, N.temp$variance)
}
}
N.used.plot = data.frame(N.used = N.used,
N.variance = N.variance,
sd = rep(sd.combs,each=n.reps)
)
N.used.plot = N.used.plot[order(N.used.plot$N.variance),]
plot(N.used.plot$N.variance,N.used.plot$N.used,lwd=3,col=2,type="l",xlab="Landscape Complexity",ylab="Sample Size of Used Locations Needed")
We see how the number of used locations declines with increasing landscape complexity. As there is more variation on the landscape, we have more statistical power at lower sample sizes. Intuitively, as the landscape becomes more heterogeneous, it should be easier (i.e., fewer used samples needed) to identify their relative selection of each habitat type.
If we are interested in obtaining inference to the individual- and population-level effects, we can consider bootstrapping the results from individual models or jointly fitting a single model with a random effect across individuals. We will first consider the bootstrapping method. Second, we will demonstrate the use of random intercepts and slopes, as well as how to fix the variance of the random intercept so there is no shrinkage.
Bootstrapping is a resampling technique used often in statistics to obtain a sampling distribution of possible values. We will resample the individual-level coefficents to estimate the population-level mean and variation.
The core functions for bootstrapping are adopted from Dr. Guillaume Bastille-Rousseau’s github repository for the R package IndRSA.
We first need to get all estimated coefficients from each individual.
coef.list = lapply(indiv.fit,FUN=function(x){fixef(x)[[1]]})
coef.df=do.call(rbind.data.frame, coef.list)
colnames(coef.df)=names(fixef(indiv.fit[[1]])[[1]])
# Remove intercept
coef.df = coef.df[,-1]
# Add name (could be modified to keep track if animals have an ID)
coef.df$name = 1:nrow(coef.df)
# Frequency could be greater than one if there were multiple estimates of a single individual, perhaps across years
coef.df$freq = 1
We are ready to bootstrap the coefficient estimates.
#How many bootstraps to do? More will lead to results with less error
nboot = 1000
#Which columns have coefficients in coef.df
col.locs=1:3
boot=list()
for(i in 1:nboot){
boot[[i]]=apply(
coef.df[sample(nrow(coef.df), nrow(coef.df),
replace=TRUE,
prob=coef.df$freq),
col.locs
],
2,
median,
na.rm=TRUE
)
}
We now want to summarize our bootstrapped results. The population mean and 95% lower and upper confidence intervals are outputted.
boot.pop.esimates=mean_ci_boot(boot)
knitr::kable(boot.pop.esimates)
Mean | LCI | UCI | |
---|---|---|---|
dist.dev | -0.962506 | -1.1033896 | -0.8794034 |
forest | -0.233231 | -0.6817081 | 0.2474493 |
shrub | 1.044056 | 0.9049891 | 1.2042720 |
The most common use of random effects in habitat selection analysis
is the use of a random intercept. As described in the manuscript, this
does not account for variation in the slope coefficients, which is
problematic. Here, we will allow the intercepts to vary, which
accommodates different sample sizes and thus different ratios of used to
available samples per each individual (not a major issue in this
contrived setting of all individuals having the same sample size).
However, if we allowed the random effect variance to be estimated we
would get shrinkage of the individual-level intercept estimates towards
the mean, which we do not want in this case. Therefore, we will fix the
random effect variance to a large value to ensure there is no shrinkage.
We can do this using a step wise approach, 1) setting up the model and
not fitting it, 2) coding the parameter as NA
and 3) then
lastly fitting the model via optimization.
While we are using common language in describing the models and model
fitting process, note that there are very different meanings of
fixing a parameter
(i.e., random intercept variance) versus
a fixed effect
. The former refers to setting a parameter at
a given value and is not estimated. The later refers to a description of
a type of parameter that is estimated.
# Setup HSF with logistic regression approximation - random intercept model
# Do not fit the model
re_int = glmmTMB(status ~ dist.dev + forest + shrub + (1|ID),
family = binomial(),
data = sims2,
doFit = FALSE
)
# Make the intercept variance large and fixed (i.e., do not estimate).
re_int$parameters$theta[1] = log(1e3)
# Tell glmmTMB to not estimate the intercept
re_int$mapArg = list(theta=factor(c(NA)))
# Now ready to fit the model
re_int = glmmTMB:::fitTMB(re_int)
# The fixed components - these are the population level (across individual) effects.
fixef(re_int)
##
## Conditional model:
## (Intercept) dist.dev forest shrub
## -7.418e-05 -9.820e-01 -2.011e-01 1.060e+00
# The random components- these are the effect differences for each individual from the population mean estimate (referred to as the fixed effect in the 'conditional model' statement)
ranef(re_int)
## $ID
## (Intercept)
## 1 -3.277075
## 2 -3.237661
## 3 -3.220824
## 4 -3.326315
## 5 -3.261974
## 6 -3.304170
## 7 -3.217700
## 8 -3.237493
## 9 -3.233313
## 10 -3.270460
## 11 -3.323575
## 12 -3.234716
## 13 -3.249997
## 14 -3.274132
## 15 -3.225749
## 16 -3.319067
## 17 -3.243254
## 18 -3.234961
## 19 -3.330002
## 20 -3.267983
Notice the lack of variability in these estimated individual differences from the overall mean intercept. The reason for this is because we simulated data with the same number of used and available locations, therefore the ratio of used to available is the same. These estimates are likely to be more variable when you have different used sample sizes per individual (total individual animal locations) and are applying the same ratio of available samples. There are many reasons individuals end up having a different number of locations, such as battery life.
# Summarize results
summary(re_int)
## Family: binomial ( logit )
## Formula: status ~ dist.dev + forest + shrub + (1 | ID)
## Data: sims2
##
## AIC BIC logLik deviance df.resid
## 58414.3 58452.7 -29203.2 58406.3 109996
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## ID (Intercept) 1e+06 1000
## Number of obs: 110000, groups: ID, 20
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.418e-05 2.236e+02 0.00 1
## dist.dev -9.820e-01 1.169e-02 -83.99 < 2e-16 ***
## forest -2.011e-01 3.829e-02 -5.25 1.52e-07 ***
## shrub 1.060e+00 3.944e-02 26.88 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A random intercepts only model is not recommended. This is because we are not dealing with the expected variation of responses to hypothesized spatial factors across individuals.
The use of random effects is really not particularly useful. We are really looking to estimate slopes across all individuals (pooling) while allowing the intercepts by individual to vary to accommodate different sample sizes. We can accomplish the same thing using a fixed-effect only model and thus not need to fix any random effect variance. Specifically, we can do that by setting up a design matrix using contrast sums, also know as effecting coding.
# Make ID a factor
sims2$ID=as.factor(sims2$ID)
# Setup a design matrix with individual ID's
X = model.matrix(~ID,
data = sims2,
contrasts.arg =list(ID="contr.sum")
)
# Fixed effect intercepts that vary by individual
fix_int = glmmTMB(status ~ 0 + X + dist.dev + forest + shrub,
family = binomial(),
data = sims2
)
Now lets compare the estimated effects from fix_int
and
re_int
. First lets look at the slopes for
re_int
…
kable(summary(re_int)[[6]][[1]][2:4,])
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
dist.dev | -0.9820027 | 0.0116923 | -83.987436 | 0e+00 |
forest | -0.2010643 | 0.0382935 | -5.250612 | 2e-07 |
shrub | 1.0600931 | 0.0394368 | 26.880794 | 0e+00 |
And here we have the same estimates for fix_int
,
kable(summary(fix_int)[[6]][[1]][(n.indiv+1):(n.indiv+1+2),])
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
dist.dev | -0.9817199 | 0.0116898 | -83.980850 | 0e+00 |
forest | -0.2010046 | 0.0382881 | -5.249799 | 2e-07 |
shrub | 1.0597804 | 0.0394311 | 26.876733 | 0e+00 |
Notice that they are practically the same. The intercepts are the same as well, but we need to reorganize the results a tad to make our comparison…
int.re = ranef(re_int)[[1]][[1]]
int.fe = c(summary(fix_int)[[6]][[1]][1,1]+
summary(fix_int)[[6]][[1]][2:20],
summary(fix_int)[[6]][[1]][1,1]
)
kable(round(data.frame(individual = 1:20,int.fe=int.fe,int.re=int.re),digits=2))
individual | int.fe | X.Intercept. |
---|---|---|
1 | -3.28 | -3.28 |
2 | -3.24 | -3.24 |
3 | -3.22 | -3.22 |
4 | -3.33 | -3.33 |
5 | -3.26 | -3.26 |
6 | -3.30 | -3.30 |
7 | -3.22 | -3.22 |
8 | -3.24 | -3.24 |
9 | -3.23 | -3.23 |
10 | -3.27 | -3.27 |
11 | -3.32 | -3.32 |
12 | -3.23 | -3.23 |
13 | -3.25 | -3.25 |
14 | -3.27 | -3.27 |
15 | -3.23 | -3.23 |
16 | -3.32 | -3.32 |
17 | -3.24 | -3.24 |
18 | -3.23 | -3.23 |
19 | -3.33 | -3.33 |
20 | -3.26 | -3.27 |
We get the same estimates, again. The point is that the random effect is not being used in the manner we usually might use it. Choosing to use a random effect with a large fixed variance (i.e., do not estimate). or fixed effects can be chosen based on convenience. There may be some convenience to the random effect process when there are many individuals.
This is the recommended model structure using random effects.
#Fit RSF with intercept and slopes with random effect
re_int_slopes = glmmTMB(status ~ dist.dev + forest + shrub +
(1|ID) + (0+dist.dev|ID) +
(0+forest|ID) + (0+shrub|ID),
family = binomial(),
data = sims2,
doFit = FALSE,
weights = weight
)
# make the intercept large and fixed
re_int_slopes$parameters$theta[1] = log(1e3)
# Tell glmmTMB to not estimate the intercept, but
# to do so for the other three variables
re_int_slopes$mapArg = list(theta=factor(c(NA,1:3)))
# Now ready to fit the model
re_int_slopes = glmmTMB:::fitTMB(re_int_slopes)
Let’s look at the results
fixef(re_int_slopes)
##
## Conditional model:
## (Intercept) dist.dev forest shrub
## -0.001188 -0.983164 -0.194497 1.040308
broom.mixed::tidy(re_int_slopes, effects = "fixed", conf.int = TRUE)[-1,c(3,4,8,9)]
## # A tibble: 3 × 4
## term estimate conf.low conf.high
## <chr> <dbl> <dbl> <dbl>
## 1 dist.dev -0.983 -1.06 -0.903
## 2 forest -0.194 -0.577 0.188
## 3 shrub 1.04 0.951 1.13
Here are estimated population-level coefficients (across individual-level effects). Compared to the bootstrapped results above, they are generally similar. We should not expect them to be the same, as the random effect model shares information across individuals and the bootstrapped estimates do not.
We can also extract the estimated difference by individual from the population level coefficients.
ranef(re_int_slopes)
## $ID
## (Intercept) dist.dev forest shrub
## 1 -10.299348 -0.11130168 -0.02090383 0.06758999
## 2 -10.414969 0.15779918 0.76843245 -0.02645743
## 3 -9.882139 0.28789052 -0.23574983 0.12923279
## 4 -10.570038 -0.25249448 0.18065484 0.03798059
## 5 -10.648244 0.02743712 0.97919430 -0.01236556
## 6 -9.801698 -0.10886945 -1.06614415 -0.02520669
## 7 -9.897331 0.28606535 -0.09030067 0.03004482
## 8 -9.755803 0.11628282 -0.56285305 -0.04205768
## 9 -10.129800 0.13680787 0.08445766 0.11262520
## 10 -10.556768 0.08693897 0.96851949 -0.12290572
## 11 -10.399506 -0.30346128 -0.11944704 -0.09799560
## 12 -10.015957 -0.11198466 -0.35259407 -0.05684952
## 13 -11.130693 0.15244917 1.82792587 0.08317337
## 14 -9.529007 -0.03325338 -1.41168291 -0.08380647
## 15 -9.587807 0.08392103 -1.05863919 -0.01054412
## 16 -10.633384 -0.12491422 0.52343503 0.05682853
## 17 -9.570631 0.05667112 -1.27550215 0.04337108
## 18 -10.761501 0.06733219 1.12346753 0.12540716
## 19 -10.212692 -0.31274622 -0.56843852 -0.11791448
## 20 -10.361531 -0.10115549 0.31052304 -0.09381608
broom.mixed::tidy(re_int_slopes, effects = "ran_vals", conf.int = TRUE)[-c(1:20),c(5,6,8,9)]
## # A tibble: 60 × 4
## term estimate conf.low conf.high
## <chr> <dbl> <dbl> <dbl>
## 1 dist.dev -0.111 -0.227 0.00449
## 2 dist.dev 0.158 0.0453 0.270
## 3 dist.dev 0.288 0.174 0.402
## 4 dist.dev -0.252 -0.365 -0.140
## 5 dist.dev 0.0274 -0.0838 0.139
## 6 dist.dev -0.109 -0.222 0.00379
## 7 dist.dev 0.286 0.173 0.399
## 8 dist.dev 0.116 0.00245 0.230
## 9 dist.dev 0.137 0.0232 0.250
## 10 dist.dev 0.0869 -0.0243 0.198
## # ℹ 50 more rows
Lastly, we can see a full summary of the model.
summary(re_int_slopes)
## Family: binomial ( logit )
## Formula:
## status ~ dist.dev + forest + shrub + (1 | ID) + (0 + dist.dev |
## ID) + (0 + forest | ID) + (0 + shrub | ID)
## Data: sims2
## Weights: weight
##
## AIC BIC logLik deviance df.resid
## 193336.4 193403.7 -96661.2 193322.4 109993
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## ID (Intercept) 1.000e+06 1000.0000
## ID.1 dist.dev 3.114e-02 0.1765
## ID.2 forest 7.368e-01 0.8584
## ID.3 shrub 1.619e-02 0.1273
## Number of obs: 110000, groups: ID, 20
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.001188 223.621785 0.000 1.000
## dist.dev -0.983164 0.040690 -24.163 <2e-16 ***
## forest -0.194497 0.195184 -0.996 0.319
## shrub 1.040308 0.045784 22.722 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note the variance and standard deviation estimates (fourth column of
under the “Conditional Model” section), indicating how variable
coefficients are across individuals. We see that forest
is
estimated as the most variable. This corresponds well to how the data
were generated with forest coefficients having the highest standard
deviation of beta.sd
which was set as 1.
Another thing to notice is that the population level effect for forest is not statistically clearly different than zero. Thus, at the population level, percent forest is being selected in proportion to what is available to each individual. Let’s dive into this a bit more. Let’s plot the individual estimates along with the population-level estimate.
indiv.coef.popModel = broom.mixed::tidy(re_int_slopes,
effects = "ran_vals",
conf.int = TRUE)
indiv.forest.popModel = data.frame(indiv.coef.popModel[41:60,c(5,6,8,9)])
# Add back mean to get individual estimates
indiv.forest.popModel[,2:4] = indiv.forest.popModel[,2:4]+
rep(fixef(re_int_slopes)[[1]][3],each=20)
# Extract population mean and uncertainty for forest effect
pop.coef.popModel = data.frame(broom.mixed::tidy(re_int_slopes,
effects = "fixed",
conf.int = TRUE)
)
pop.coef.popModel=pop.coef.popModel[3,c(4,8,9)]
#Plot
plotCI(x = 1:20,
y = indiv.forest.popModel$estimate,
ui = indiv.forest.popModel$conf.high,
li = indiv.forest.popModel$conf.low,
lwd = 3,col=1,
xlab="Individual",ylab="Forest Coefficient")
abline(h=pop.coef.popModel$estimate,lwd=3,col=1,lty=4)
abline(h=pop.coef.popModel$conf.low,lwd=3,col=2,lty=4)
abline(h=pop.coef.popModel$conf.high,lwd=3,col=2,lty=4)
Our plot shows the individual effects of forest
along
with the population-level. Compare this output and inference to the
figure (right-side) for the pooled model fit at the end of Section 3.11.
In the pooled case we would conclude that at the population level, this
species avoids forest (i.e., negative coefficient), and we would say
that with some level of statistical certainty because the confidence
intervals do not include zero. Here we get a more nuanced picture of
what’s happening, and that leads to a very different conclusion.
What is clear from this plot is that the reason there is no statistically clear difference of the population-level effect from zero is because there is a wide range of effects across individuals. Some individuals have positive coefficients and some have negative. Since these are generally equal, they balance out to a population-level effect of zero. The story is more complicated!
The point of this model is that we have effects per each individual and at the population-level for each estimated effect/slope. We also are accommodating different sample sizes with the random intercepts. However, one could use the trick from above to allow intercepts to vary by individual using fixed effects and then only use random effects for the spatial variables.
An important question to ask is how many individuals should be tracked to be able to provide statistical clarity about a population-level coefficient for a habitat variable? We can determine this using the methods of Street et al. 2021; their data/code can be found at figshare. For details see the equation 8.
To evaluate this question, we need to consider the population level
mean of the coefficient, variation across individuals, landscape
complexity, and the number of used locations per individual. For
simplicity, we will use a single covariate (covs[[1]]
) for
each individual;.
#Number of individuals
M = 30
# Population-level coefs
beta_mu = 0.2
beta_s = 0.5 # note the amount of variation
beta_s2 = beta_s^2
# Individual level coefs
set.seed(4254)
beta.ind <- rnorm(M, beta_mu,beta_s)
# Need to calculate landscape complexity or Var[R(X_beta)] for each individual, given the covariate and the individual coefficient
N.variance = NULL
for(i in 1:M){
beta = matrix(c(-1,beta.ind[i]),nrow=2,ncol=1) #coefs
X = cbind(1,values(covs[[1]])) # design matrix
lambda = exp(X%*%beta) #intensity of selection
HSF.True=covs[[1]]
values(HSF.True)= lambda
N.temp = sample.size.used.locs(alpha = 0.05,
p_val = 0.05,
HSF.True = HSF.True,
S = covs[[1]],
beta = beta[2]
)
N.variance = c(N.variance, N.temp$variance)
}
# Assume the same number of locations per individual
N = rep(1000,M)
# Equation 5 of Street et al. 2021
sigma <- 1/sqrt(N*N.variance)
# Get from function
zalpha = N.temp$zalpha
zp = N.temp$zp
# We need M to be greater than or equal to this value; (Equation 8; Street et al. 2021):
Eq8 = (beta_s2*(zalpha+zp)^2 + sqrt((beta_s^4)*(zalpha+zp)^4 +
4*(beta_mu^2)*(zalpha+zp)^2*sum(sigma))) / (2*beta_mu^2)
paste("M (tracked individuals) should be greater than or equal to ", round(Eq8,digits=0))
## [1] "M (tracked individuals) should be greater than or equal to 85"
Lets consider how the number of individuals needed changes with different population-mean coefficients.
M = 30
beta_mu = seq(0.1,0.75,length.out=10)
beta_s = 0.5
beta_s2 = beta_s^2
min.indiv = rep(NA, length(beta_mu))
for(z in 1:length(beta_mu)){
# Individual level coefs
set.seed(4254)
beta.ind <- rnorm(M, beta_mu[z],beta_s)
N.variance = NULL
for(i in 1:M){
beta = matrix(c(-1,beta.ind[i]),nrow=2,ncol=1)
X = cbind(1,values(covs[[1]]))
lambda = exp(X%*%beta)
HSF.True=covs[[1]]
values(HSF.True)= lambda
N.temp = sample.size.used.locs(alpha = 0.05,
p_val = 0.05,
HSF.True = HSF.True,
S = covs[[1]],
beta = beta[2]
)
N.variance = c(N.variance, N.temp$variance)
} #end i loop
N = rep(1000,M)
sigma <- 1/sqrt(N*N.variance)
zalpha = N.temp$zalpha
zp = N.temp$zp
Eq8 = (beta_s2*(zalpha+zp)^2 + sqrt((beta_s^4)*(zalpha+zp)^4 +
4*(beta_mu[z]^2)*(zalpha+zp)^2*sum(sigma))) / (2*beta_mu[z]^2)
min.indiv[z] = Eq8
}#end z loop
plot(beta_mu,min.indiv,ylab="Minimum Number of Individuals Tracked",lwd=3,col=4,type="b")
We can see that as the population mean gets larger it becomes easier to have statistical clarity with fewer individuals tracked.
Next, lets consider how the number of individuals needed changes with
the changing the variation of the coefficient across indivduals (i.e,
beta_s
(standard deviation) and beta_s2
(variance). We will reset the population mean back to 0.2.
M = 30
beta_mu = 0.2
beta_s = seq(0.01,0.5,length.out=10)
beta_s2 = beta_s^2
min.indiv = rep(NA, length(beta_mu))
for(z in 1:length(beta_s)){
set.seed(4254)
beta.ind <- rnorm(M, beta_mu,beta_s)
N.variance = NULL
for(i in 1:M){
beta = matrix(c(-1,beta.ind[i]),nrow=2,ncol=1)
X = cbind(1,values(covs[[1]]))
lambda = exp(X%*%beta)
HSF.True=covs[[1]]
values(HSF.True)= lambda
N.temp = sample.size.used.locs(alpha = 0.05,
p_val = 0.05,
HSF.True = HSF.True,
S = covs[[1]],
beta = beta[2]
)
N.variance = c(N.variance, N.temp$variance)
} #end i loop
N = rep(1000,M)
sigma <- 1/sqrt(N*N.variance)
zalpha = N.temp$zalpha
zp = N.temp$zp
Eq8 = (beta_s2[z]*(zalpha+zp)^2 + sqrt((beta_s[z]^4)*(zalpha+zp)^4 +
4*(beta_mu^2)*(zalpha+zp)^2*sum(sigma))) / (2*beta_mu^2)
min.indiv[z] = Eq8
}#end z loop
plot(beta_s2,min.indiv,ylab="Minimum Number of Individuals Tracked",lwd=3,col=4,type="b")
Our results agree with the statement from Street et al. 2021 that “Note that [Equation 8] is a non-decreasing function of \(\sigma^2\), meaning that more variation among individuals is likely to mean one has to sample a higher number of animals.
In this section, we discuss ways of considering behavior in habitat selection analyses. To demonstrate the effect of ignoring behavior, we will simulate data where selection is different for two sets of animal locations. Imagine an animal is highly selective of being in forest landcover when it is resting, but when foraging (and otherwise) it selects against forest in proportion to its availability.
# Setup simulation input where there is more data from the second behavior
n.used.behav1 = 200
n.used.behav2 = 200
# Ratio of used:available
m = 10
#Create available samples
n1 = n.used.behav1*m
n2 = n.used.behav2*m
# Consider a single categorical variable of forest and not-forest
set.seed(51234)
x1 = data.frame(forest = rbinom(n1,1,0.4))
x2 = data.frame(forest = rbinom(n2,1,0.4))
# Intercept and effect of forest
parms.behav1 = c(0.5,2) # the 2 indicates selection
parms.behav2 = c(0.5,-2) # the -0 indicates avoidance
# simulate data
sim.behav1 = ResourceSelection::simulateUsedAvail(data = x1,
parms = parms.behav1,
n.used = n.used.behav1,
m = m,
link = "log"
)
sim.behav2 = ResourceSelection::simulateUsedAvail(data = x2,
parms = parms.behav2,
n.used = n.used.behav2,
m = m,
link = "log"
)
# Combine the data
data.ignore = rbind(sim.behav1,
sim.behav2
)
Let’s fit a model where we ignore the behavioral differences in selection and fit a single model with all the data.
model.ignore.behav = glmmTMB::glmmTMB(status ~ forest,
family = binomial(),
data = data.ignore
)
summary(model.ignore.behav)
## Family: binomial ( logit )
## Formula: status ~ forest
## Data: data.ignore
##
## AIC BIC logLik deviance df.resid
## 2684.0 2696.8 -1340.0 2680.0 4398
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.34189 0.06949 -33.70 <2e-16 ***
## forest 0.09334 0.10592 0.88 0.378
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the estimated forest coefficient is not near the true values of -2 or 2. It’s somewhat in between - near zero. Essentially, when animals are selecting different habitat features because of behavior, mixing across behaviors can lead to a muddled inference.
Interpreting coefficients and predicting is outlined above in
subsection
3.7. What are habitat-selection functions used for?
.
If focused on inference, fitting a single model is not only okay, but desirable.
We hope this vignette provided some utility. If so, let us know with an email (brian.gerber@colostate.edu).
Have a nice day.
This report was generated from the R Statistical Software (v4.4.2; R Core Team 2021) using the Markdown language and RStudio. The R packages used are acknowledged below.
Package | Version | Citation |
---|---|---|
amt | 0.3.0.0 | @amt |
base | 4.4.2 | @base |
broom.mixed | 0.2.9.6 | @broommixed |
circular | 0.5.1 | @circular |
geoR | 1.9.4 | @geoR |
glmmTMB | 1.1.10 | @glmmTMB |
knitr | 1.49 | @knitr2014; @knitr2015; @knitr2024 |
plotrix | 3.8.4 | @plotrix |
raster | 3.6.30 | @raster |
remotes | 2.5.0 | @remotes |
ResourceSelection | 0.3.6 | @ResourceSelection |
Rfast | 2.1.0 | @Rfast |
rmarkdown | 2.29 | @rmarkdown2018; @rmarkdown2020; @rmarkdown2024 |
sp | 2.1.4 | @sp2005; @sp2013 |
survival | 3.7.0 | @survival-book; @survival-package |
tidyverse | 2.0.0 | @tidyverse |