Pond | egg.mass | cluster |
---|---|---|
A | 2 | 1 |
B | 6 | 2 |
C | 8 | 1 |
D | 10 | 2 |
E | 10 | 3 |
F | 12 | 3 |
Appear to be opposites
Both partition sampling into primary and secondary units
Pond | egg.mass | cluster |
---|---|---|
A | 2 | 1 |
B | 6 | 2 |
C | 8 | 1 |
D | 10 | 2 |
E | 10 | 3 |
F | 12 | 3 |
\[ y_i = \sum_{j=1}^{M_i} y_{ij} \]
\[ \bar{y} = \hat{\mu}_{\text{primary}} = \frac{1}{n}\sum_{i=1}^{n} y_{i} \]
\[ \hat{\tau} = \hat{\sigma}^2_{\mu} = \frac{1}{n-1} \sum_{i=1}^n(y_i -\hat{\mu})^2 \]
clusterA | member1 | member2 | mean | dev2 | var |
---|---|---|---|---|---|
1 | 2 | 8 | 5 | 9 | 4.5 |
2 | 6 | 10 | 8 | 0 | 0.0 |
3 | 10 | 12 | 11 | 9 | 4.5 |
\[\sum_{i=1}^3(\hat{\mu}_{i}-\mu) = 0 \]
clusterA | member1 | member2 | mean | dev2 | var |
---|---|---|---|---|---|
1 | 2 | 8 | 5 | 9 | 4.5 |
2 | 6 | 10 | 8 | 0 | 0.0 |
3 | 10 | 12 | 11 | 9 | 4.5 |
\[ E[\hat{\sigma}^{2}_{\mu}] = 3\\ \]
Superficial resemblance to stratification: ‘clustered’ sample units are grouped like a stratum
Selection process is different
Cluster sampling is really SRS applied to groups of population members
Make cluster of similar values
Pond | egg.mass | clusterB |
---|---|---|
A | 2 | 1 |
B | 6 | 1 |
C | 8 | 2 |
D | 10 | 2 |
E | 10 | 3 |
F | 12 | 3 |
clusterB | member1 | member2 | mean | dev2 | var |
---|---|---|---|---|---|
1 | 2 | 6 | 4 | 16 | 8.0 |
2 | 8 | 10 | 9 | 1 | 0.5 |
3 | 10 | 12 | 11 | 9 | 4.5 |
\[\sum_{i=1}^3(\hat{\mu}_{i}-\mu) = 0 \]
clusterB | member1 | member2 | mean | dev2 | var |
---|---|---|---|---|---|
1 | 2 | 6 | 4 | 16 | 8.0 |
2 | 8 | 10 | 9 | 1 | 0.5 |
3 | 10 | 12 | 11 | 9 | 4.5 |
\[ E[\hat{\sigma}^{2}_{\mu}] = 8.67 \]
Make clusters of dissimilar values
Pond | egg.mass | clusterC |
---|---|---|
A | 2 | 1 |
B | 6 | 2 |
C | 8 | 3 |
D | 10 | 3 |
E | 10 | 2 |
F | 12 | 1 |
Make clusters of dissimilar values
clusterC | member1 | member2 | mean | dev2 | var |
---|---|---|---|---|---|
1 | 2 | 12 | 7 | 1 | 0.5 |
2 | 6 | 10 | 8 | 0 | 0.0 |
3 | 8 | 10 | 9 | 1 | 0.5 |
\[\sum_{i=1}^3(\hat{\mu}_{i}-\mu) = 0 \]
Make clusters of dissimilar values
clusterC | member1 | member2 | mean | dev2 | var |
---|---|---|---|---|---|
1 | 2 | 12 | 7 | 1 | 0.5 |
2 | 6 | 10 | 8 | 0 | 0.0 |
3 | 8 | 10 | 9 | 1 | 0.5 |
\[ E[\hat{\sigma}^{2}_{\mu}] = 0.67 \]
Clusters of dissimilar values and size
Pond | egg.mass | clusterD |
---|---|---|
A | 2 | 1 |
B | 6 | 1 |
C | 8 | 1 |
D | 10 | 2 |
E | 10 | 2 |
F | 12 | 3 |
Clusters of dissimilar values and size
clusterD | member1 | member2 | member3 | mean |
---|---|---|---|---|
1 | 2 | 6 | 8 | 5.33 |
2 | 10 | 10 | NA | 10.00 |
3 | 12 | NA | NA | 12.00 |
Need to incorporate cluster size
clusterD | mean | clusterSize | SizeXMean | divide.avg.cluster.size | dev2 | var |
---|---|---|---|---|---|---|
1 | 5.33 | 3 | 16 | 8 | 0 | 0 |
2 | 10.00 | 2 | 20 | 10 | 4 | 2 |
3 | 12.00 | 1 | 12 | 6 | 4 | 2 |
We randomly select the primary units
You want to know something about households
In our field we often think of forming clusters by location
First Law of Geography: “everything is related to everything else, but near things are more related than distant things.” - Waldo Tobler
This contradicts the principal and utility of cluster sampling.
But also need to consider costs b/w SRS and cluster.
Which do you think is more likely to less costly?
We are interested in the total population size of pika across 10 mountains
Goal: Compare Cluster and SR sampling
# Simulate a true population
N.plots= 100 # talus slope plots
N.mtns= 10 # mtns
# Create matrix of counts
pop = matrix(NA, N.plots, N.mtns)
# Consider each mtn has a different mean abundance
# This forces the mtns to vary in pika abundance
mu = seq(1,100,length.out = N.mtns)
# Loop over each mountain and draw N.plots random values from that specific mean
# log transform the mean and the exponentiate then round to make them counts
# and to ensure values are never negative
for(j in 1:N.mtns){
set.seed(143453543+j)
pop[,j] = round(
exp(
rnorm(N.plots,
log(mu[j]),
0.1
)
),
digits=0
)
}
The same number of total units are sampled: 100
\(\mu =\) 50.67
\(\tau =\) 50670
#Number of simulated studies / replicate samples
n.sim= 10000
# Create SRS function
srs.fun = function(pop){
# Get a sample of indices
index=sample(1:(N.plots*N.mtns),size=n)
# Use those indices to get our population counts in each sample unit
y = c(pop)[index]
#Total estimate
tau.est=mean(y)*N
#Standard deviation of total
tau.sd=sqrt(N*(N-n)*(var(y)/n))
list(tau.est=tau.est,
tau.sd=tau.sd)
}
# replicate!
srs.total.dist=replicate(n.sim, srs.fun(pop))
\(\sigma_{\hat{\tau},SRS} =\) 3109.66
cluster.fun = function(pop){
#Get index of plots (clusters) to sample across mtn rnages and sample all within that cluster
index=sample(1:N.primary, size = 10)
y = pop[index,] # this is 10x10 (total of 100 sample units)
y.sum = apply(y,1,sum)
tau.est = N.primary*mean(y.sum)
#Standard deviation of total
var.primary= var(y.sum)
tau.sd = sqrt(N.primary*(N.primary-N.secondary)*(var.primary/N.secondary))
list(tau.est=tau.est,
tau.sd=tau.sd)
}
# Replicate!!!!
cluster.total.dist=replicate(n.sim, cluster.fun(pop))
\(\sigma_{\hat{\tau},Cluster} =\) 554.26
Thompson pg. 137: “The relative efficiency of the cluster (or systematic) sample to the simple random sample of equivalent sample size, defined as the ratio of variances”,
\[ \frac{\text{var}(\tau_{srs})}{\text{var}(\tau_{\mu})} = \frac{\text{var}(\text{across all values in population})}{\text{var}(\text{across secondary units})} \]
[1] 1045.07
[1] 342.1313
Cluster sampling will lead to more precise estimates than SRS (at the same sample size) when units within clusters vary more than on average than do the units in the whole sampling frame.
The greater variation within clusters, the grater precision of cluster sampling; opposite of stratified sampling
Beware
Clusters require good knowledge of the system.
A poor choice could lead to increasing sampling distribution variance.
Black-footed ferret breeding program for release into the wild.
How do we choose which two individuals to release into the wild?
You systematically have selected individual 3 and 7 for release from each colony/group.