Lets explore the Bernoulli/Binomial probability mass function (PMF). Imagine doing a tagging study to estimate the survival of sandhill cranes in New Mexico over the winter. We want to use the PMF to connect the ideas of generating data via a probability distribution that is defined by parameters.
In this situation, we are interested in simulating samples or observations (or more generally, generate synthetic data) of where cranes either lived (\(y = 1\)) or died (y = 0) over our period of interest (winter). If we use the Bernoulli PMF our focus is typically on each i\(^th\) crane, while if we are interested in the sum of the number of cranes that lived (\(\sum_{i}^n y_i\)) than we would use the Binomial PMF. Either way, we need to define the parameter associeted with both function, which is the probability of observing a 1 (\(P(Y=1)\)), which we will define here as \(\theta\). To make this a useful exercise, we want to decide on \(\theta\) could be in real life.
Remember, estimation is turning the process around and having data to inform the parameter. Before we do estimation, we want to understand probability functions through simulating/generating synthetic data so we have a better idea of what we are assuming about our empirical data when we use probability functions to estimate parameters. Plus, this process can help us guide our design of future studies.
# Define sandhill crane over winter survival probability (P(Y=1))
theta = 0.8
The Bernoulli PMF is useful when thinking about a single event, happening once. Remember, this is a limiting case of the Binomial PMF. We will use the same function, but change the inputs.
Lets observe/generate a single data point where a crane dies (y = 0) or lives (y = 1). In a sense, we are flipping a theoretical coin with probability theta (i.e., \(P(Y = 1) = \theta\)) to observe/sample whether this tagged crane dies or lives.
rbinom(n = 1, size = 1,prob = theta)
## [1] 1
Consider 10 tagged cranes to get a sample of the number of alive cranes (success’ or 1’s) at the end of the study
rbinom(n = 1, size = 10, prob = theta)
## [1] 9
Now, consider 10 tagged cranes and you want to keep track of each bird and whether they lived or died (not just the sum of the alive ones).
rbinom(n = 10, size = 1,prob = theta)
## [1] 1 1 1 0 0 1 1 1 1 0
Notice that changing the arguments ‘size’ and ‘n’ either focuses on the individual event of 0 and 1’s (size = 1) or the sum of the 1’s (size > 1). n controls how many of those outputs you would like to have
What if we wanted to generate synthetic data of mortality, rather than survival? Well, we just change our setup. Still assuming survival probability is 0.80, than the equivalent mortality probability would be \(\theta = 1 - 0.80 = 0.20\), such the we are saying the probability a crane dies is \(P(Y = 1) = \theta = 0.2\) Now, we can generate data of cranes that have died (1) and that have lived (0).
theta = 0.2
rbinom(n = 10, size = 1,prob = theta)
## [1] 0 1 0 1 1 1 0 0 1 0
What is rbinom
really doing? Well, the code
is actually calling a function using the C programming language because
it is fast. But, we can understand more simply what the binomial
sampling process looks like by using the generic sample
function.
The code here defines the outcomes of the stochastic process as
either a 0 or 1; x
is the argument name of the function. We
are asking for 10 trials using size
. The
replace
argument is telling the sampler that we are
allowing it sample each outcome (0 and 1) more than one time. Lastly, we
are defining the \(P(y = 1)\) to be
0.80 and the \(P(y = 0)\) to be 0.20.
This code replicates the same process as the code above:
rbinom(n = 10, size = 1,prob = theta)
.
sample(x = c(1,0),
size = 10,
replace=TRUE,
prob = c(0.8,0.2)
)
## [1] 1 1 1 1 1 1 1 1 1 1
Use your knowledge of the Binomial PMF and investigate the study
design trade offs of a tagging study to estimate the survival of
sandhill cranes. Consider 3 sample sizes of tagged individuals (N): 100,
250, and 500. Define the survival probability as
$\theta$ = 0.80
. For each sample, estimate the probability
of survival using the estimator \(\sum(y)/N\), where y
is the
number of alive cranes at the end of the study and N
is the
number of total tagged cranes. We will justify this decision at a later
point, but logically this should be a good estimator. For each sample
size, you want to approximate the sampling distribution of the
estimator. Once you have each sampling distribution,
theta
).Note: make sure you use enough replicates such that the sampling distribution appears well approximated (i.e., symmetric).
n.sim = 5000
n = c(100,250,500)
theta = 0.8
theta.hat=matrix(NA, nrow=n.sim, ncol=length(n))
theta.hat[,1]=rbinom(n.sim, n[1], theta)/n[1]
theta.hat[,2]=rbinom(n.sim, n[2], theta)/n[2]
theta.hat[,3]=rbinom(n.sim, n[3], theta)/n[3]
plot.data = data.frame(theta.hat = c(theta.hat), N = c(rep("10",n.sim),rep("100",n.sim),rep("1000",n.sim)))
ggplot(plot.data,aes(theta.hat,fill=N))+
scale_fill_manual(values=c("red","blue","green"))+
geom_density(alpha=0.5,binwidth=0.05,position="identity") +
theme_bw()
## Warning in geom_density(alpha = 0.5, binwidth = 0.05, position = "identity"):
## Ignoring unknown parameters: `binwidth`
The estimated bias and relative bias show that the estimator is unbiased.
bias = apply(theta.hat,2,mean)-theta
bias
## [1] -0.0013560 -0.0000464 -0.0005352
r.bias = (apply(theta.hat,2,mean)-theta)/theta
r.bias
## [1] -0.001695 -0.000058 -0.000669
A sample size of 250 tags is the best choice to minimize costs and meet the objective of having a 0.90 probability of being within 10% of the true value of \(\theta\).
limits.truth=c(theta-theta*0.05,
theta+theta*0.05
)
my.func = function(x){length(which(x>=limits.truth[1] &
x<=limits.truth[2]))/length(x)
}
prob.truth = apply(theta.hat, 2,FUN = my.func)
data.frame(n = n, prob = prob.truth)
## n prob
## 1 100 0.7318
## 2 250 0.9024
## 3 500 0.9758
What if we are wrong about the assumed probability of survival? Well, we can evaluate this assumption! Rather than considering a single truth (\(\theta = 0.80\)), we can instead consider a range of true values. This way we can evaluate the robustness of our findings. Meaning, if we find that a sample size of 250 meets our objective (0.90 probability of being within 10% of the true \(\theta\)) across a range of assumed survival probabilities, we are can be confident in our recommendation of tagging 250 cranes. Keep in mind there is a lot of real world factors we have not accounted for yet!
Using a sample size of 250, evaluate whether there is at least 0.90
probability of being within 10% of the true \(\theta\) where \(\theta\) varies from 0.80 to 1.0 by
increments of 0.05. For example,
theta = seq(0.7, 1, by = 0.05)
.
n.sim = 5000
n = 250
theta = seq(0.8, 1, by=0.05)
prob.obj=rep(NA, length(theta))
for(i in 1:length(theta)){
theta.hat=rbinom(n.sim, n, theta[i])/n
limits.truth=c(theta[i]-theta[i]*0.05,
theta[i]+theta[i]*0.05
)
prob.obj[i] = length(which(theta.hat>=limits.truth[1] &
theta.hat<=limits.truth[2]))/length(theta.hat)
}
plot(theta, prob.obj,type="b",lwd=3,ylim=c(0.5,1),
xlab=expression(paste(theta)),
ylab="Probability of Objective")
abline(h=0.90,lwd=3,col=3)
We see that our objective (green horizontal line) is met or is higher at all values of \(\theta\) ranging from 0.80 to 1.0.
The Poisson PMF is often the distribution used first when considering count data. For example, counts of plants or animals within a given area. Its always a goog choice when needing to consider rates, such as counts per unit effort. Note that we are not going to think about detection probability at this point. Assuming constant detection probability, we can think of these counts as a measure of relative abundance.
Let’s say we are interested in counts of American Pika in 10x10 m plots throughout
boulder fields in the Rocky Mountains. First, we need to define the mean
count, defined as \(\lambda\). Then, we
need to decide on how many plots to sample (n
).
Sample 10 plots, count pika in each, where the true mean is 100.
rpois(n = 10, lambda = 100)
## [1] 110 124 99 112 108 115 108 77 91 105
Sample 10 plots where the mean is 2
rpois(n = 10, lambda = 2)
## [1] 2 3 1 0 3 3 1 2 3 1
When considering the Poisson distribution, notice how few plots have zero pika. In ecological and wildlife sampling, zero counts are common for a variety of reasons. This is one reason people tend to model counts with two components - a Poisson distribution and a zero-inflated part. More about this later.
One of the interesting properties of the Poisson PMF is that it can be fully described with only one parameter (\(\lambda\)), which defines the mean and variance as equivalent. This parameter governs the central tendency of the observations and how much it varies.
Lets look at this property by simulating 10000 counts and estimating the mean and variance.
set.seed(43243)
y = rpois(n = 100000, lambda = 10)
mean(y)
## [1] 10.00741
var(y)
## [1] 10.0551
So, when using the Poisson distribution, we need to consider that the mean and variance will be assumed equal.
Evaluate the sampling distribution of the estimator \(\sum(y)/N\), where y
are the
counts and N
is the number of counts. Assuming \(\lambda = 100\), determine how many sample
plots are needed such that the probability a sample is within 5% of the
mean is at least 95%. Once this is completed, change to \(\lambda = 25\). How does your evaluation
of the sample size change? Do you need the same, more, or less samples
to achieve the same objective?
#Define inputs and outputs
lambda = 100
n = seq(10,500,by=5)
n.sim = 10000
mean.mat =matrix(NA, nrow=n.sim, ncol=length(n))
# loop over the different sample sizes - n
for(i in 1:length(n)){
y.temp = matrix(
replicate(n.sim,
rpois(n[i],lambda=lambda)
), nrow=n[i],byrow = TRUE
)
#Estimate mean of each sample
mean.temp = apply(y.temp, 2, mean)
mean.mat[,i] = mean.temp
}
dim(mean.mat)
## [1] 10000 99
#Define 5% around truth
limits.truth=c(lambda-lambda*0.025,
lambda+lambda*0.025
)
#Estimate probability of being within 5%
probs = apply(mean.mat,2,FUN=function(x){length(which(x>=limits.truth[1] &
x<= limits.truth[2]
)
) / n.sim
}
)
# Plot the probability versus the sample size
plot(n,probs,type="l",lwd=3,col=1)
abline(h=0.95,lwd=3,col=3)
We need 65 samples to achieve our objective.
Changing to \(\lambda = 25\), we see that a lot more samples are needed to achieve the same objective (>100). Specifically, we need 250.
There are a lot of probability functions that are commonly used in wildlife and ecological modeling. Generally, a lot these functions fall within the exponential family of probability functions. The objective is to become a bit more familiar with a new probability function and to use the nice features of markdown to describe the function and its various properties.
Create a summary of a probability function of your choosing. If you aren’t sure pick from here: Binomial, Poisson, Log-Normal, Beta, or Gamma. If you feel comfortable with these distributions then choose from a long list of options from wikipedia.
Follow the example below for the Discrete Uniform probability distribution and …
For simplicity, use the setup below and swap in code and notation for your chosen probability function.
The discrete uniform distribution defines probabilities for a set of integer values ranging from a lower bound to an upper bound.
\[ y \sim \text{Uniform}(a,b) \]
The support of the random variable \(y\) is \[ y \in \{a, a+1,..., b-1, b\} \]
The support of the parameters \(a\) and \(b\) are all integers, where \(b \geq a\), and \(n = b-a+1\).
\[ f(y|a,b) = \frac{1}{n} \]
\[ E[y] = \frac{a+b}{2}\\ Var[Y] = \frac{n^2-1}{12} \]
Example 1
#Parameters
a=1; b=5
# Sample Space of RV
x=seq(a,b,by=1)
n=length(x)
#plotting probability mass
plot(x,dunif(x,a,b),col=2,pch=18,cex=2,
type= 'b',
ylab="f(y)",xlab=c("a = 1, b = 5"))
abline(v=(a+b)/2,lwd=3,col=4)
legend("topright",lwd=3,col=4,legend=c("Mean"))
Example 2
#Parameters
a=10; b=100
#Sample Space
x=seq(a,b,by=1)
n=length(x)
#plotting probability mass
plot(x,dunif(x,a,b),col=2,pch=18,cex=2,
type="b",
ylab="f(y)",xlab=c("a = 10, b = 100"))
abline(v=(a+b)/2,lwd=3,col=4)
legend("topright",lwd=3,col=4,legend=c("Mean"))
The probability of observing samples \(>\) 60 when \(a\) = 10 and \(b\) =100 is,
#Parameters
a=10; b=100
punif(60,a,b,lower.tail = FALSE)
## [1] 0.4444444
The value associated with the upper tail probability of 0.4444444 when \(a\) = 10 and \(b\) =100 is,
#Parameters
a=10; b=100
qunif(0.4444444,a,b,lower.tail = FALSE)
## [1] 60
#Parameters
a=10; b=100
#Sample at different sample sizes
sample.size=100
sample.size2=10000
y=round(runif(sample.size,a,b),digits=0)
y2=round(runif(sample.size2,a,b),digits=0)
#plotting samples with True probability
par(mfrow=c(1,2))
hist(y,main="N = 100",freq = FALSE)
hist(y2,main="N = 10000",freq = FALSE)
Even at a sample size of 10000, the probabilities are still a bit wonky from the truth, but close. Certainly a lot close to the sample size at 100.
The discrete uniform is not a very useful probability distribution for data. However, it is commonly used as a prior probability distribution when fitting a model using Bayesian inference for parameters that can only be integers within a given range.