Pond | egg.mass |
---|---|
A | 2 |
B | 6 |
C | 8 |
D | 10 |
E | 10 |
F | 12 |
Simple Random Sampling
Goal
Goal: to know the mean number of boreal toad egg masses per pond in RMNP
- different egg masses per pond is meaningful why?
SRS
Conceptual Walkthrough
We have a known population of ponds, N = 6
We have enough to money for n = 2
Will use SRS
SRS
Population Parameters:
- \(\mu = 8\)
- \(N = 6\)
- \(\sigma^2 = 12.8\)
SRS
How many possible unique samples are there (w/o replacement)
\[ {N}\choose{n} \]
. . .
\[ \frac{N!}{n!(N-n)!} \]
. . .
R
choose(6,2)
[1] 15
SRS
What is the probability of any one particular sample?
. . .
SRS: all samples have the same probability!
. . .
Convenient Sampling: What is the probability of one particular sample?
SRS
What is the probability pond “A” will be sampled”? Pond “B”?
. . .
Look at all possible combinations:
utils::combn(LETTERS[1:6],2)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] "A" "A" "A" "A" "A" "B" "B" "B" "B" "C" "C" "C" "D" "D"
[2,] "B" "C" "D" "E" "F" "C" "D" "E" "F" "D" "E" "F" "E" "F"
[,15]
[1,] "E"
[2,] "F"
SRS
Sample.Number | First Pond | Second Pond | First Value | Second Value |
---|---|---|---|---|
1 | A | B | 2 | 6 |
2 | A | C | 2 | 8 |
3 | A | D | 2 | 10 |
4 | A | E | 2 | 10 |
5 | A | F | 2 | 12 |
6 | B | C | 6 | 8 |
7 | B | D | 6 | 10 |
8 | B | E | 6 | 10 |
9 | B | F | 6 | 12 |
10 | C | D | 8 | 10 |
11 | C | E | 8 | 10 |
12 | C | F | 8 | 12 |
13 | D | E | 10 | 10 |
14 | D | F | 10 | 12 |
15 | E | F | 10 | 12 |
. . .
What is the probability pond “A” will be sampled”? Pond “B”?
SRS
Is it okay to not like your random sample and resample?
sample(LETTERS[1:6],2)
[1] "C" "F"
. . .
sample(LETTERS[1:6],2)
[1] "A" "B"
. . .
sample(LETTERS[1:6],2)
[1] "C" "D"
. . .
Yes, but why don’t you like it? Maybe SRS is not what you want.
SRS
Consider the sample mean \(\hat{\mu}\) for each sample
Sample.Number | First Pond | Second Pond | First Value | Second Value | Sample.Mean | Absolute.Deviation |
---|---|---|---|---|---|---|
1 | A | B | 2 | 6 | 4 | 4 |
2 | A | C | 2 | 8 | 5 | 3 |
3 | A | D | 2 | 10 | 6 | 2 |
4 | A | E | 2 | 10 | 6 | 2 |
5 | A | F | 2 | 12 | 7 | 1 |
6 | B | C | 6 | 8 | 7 | 1 |
7 | B | D | 6 | 10 | 8 | 0 |
8 | B | E | 6 | 10 | 8 | 0 |
9 | B | F | 6 | 12 | 9 | 1 |
10 | C | D | 8 | 10 | 9 | 1 |
11 | C | E | 8 | 10 | 9 | 1 |
12 | C | F | 8 | 12 | 10 | 2 |
13 | D | E | 10 | 10 | 10 | 2 |
14 | D | F | 10 | 12 | 11 | 3 |
15 | E | F | 10 | 12 | 11 | 3 |
SRS
Sample.Number | Sample.Mean | Deviance.Truth |
---|---|---|
1 | 4 | -4 |
2 | 5 | -3 |
3 | 6 | -2 |
4 | 6 | -2 |
5 | 7 | -1 |
6 | 7 | -1 |
7 | 8 | 0 |
8 | 8 | 0 |
9 | 9 | 1 |
10 | 9 | 1 |
11 | 9 | 1 |
12 | 10 | 2 |
13 | 10 | 2 |
14 | 11 | 3 |
15 | 11 | 3 |
\[ \frac{1}{15}\times\sum_{i=1}^{15} (\hat{\mu}_{i}) = 8\]
\[\sum_{i=1}^N(\hat{\mu}_{i}-\mu) = 0 \]
Estimator Bias?
SRS
Sample.Mean | Frequency | Relative.Freq | Mean.times.Rel.Freq |
---|---|---|---|
4 | 1 | 0.067 | 0.267 |
5 | 1 | 0.067 | 0.333 |
6 | 2 | 0.133 | 0.800 |
7 | 2 | 0.133 | 0.933 |
8 | 2 | 0.133 | 1.067 |
9 | 3 | 0.200 | 1.800 |
10 | 2 | 0.133 | 1.333 |
11 | 2 | 0.133 | 1.467 |
Sum | 15 | 1.000 | 8.000 |
. . .
\[ E[\mu] = \sum_{q=1}^{Q} p_i \times \hat{\mu}_{i} = 8 \]
\(Q\) = number of unique sample means
\(p_i\) = probability of obtaining a given sample / relative frequency
SRS
Sampling Disribution
- Sample mean formula is an estimator of the population mean (parameter )
- Sample mean is a random variable with a sampling distribution
- sample mean varies from sample-to-sample becasue of the sampling process
- The sampling distribution is specific to an estimator - has known outcomes and relative frequencies of values
Sampling Disribution
- judge an estimator by its sampling distribution
- What properties do we want?
Estimator Properties
- Precise and unbiased estimator
- Is our estimator precise?
Variance of Sampling Distribution
Sample.Number | Sample.Mean | Deviance.Truth |
---|---|---|
1 | 4 | -4 |
2 | 5 | -3 |
3 | 6 | -2 |
4 | 6 | -2 |
5 | 7 | -1 |
6 | 7 | -1 |
7 | 8 | 0 |
8 | 8 | 0 |
9 | 9 | 1 |
10 | 9 | 1 |
11 | 9 | 1 |
12 | 10 | 2 |
13 | 10 | 2 |
14 | 11 | 3 |
15 | 11 | 3 |
var(ponds.simple$Sample.Mean)
[1] 4.571429
Population Variance
The variance of all sample units
\[ \sigma^{2} = \frac{1}{N-1}\sum_{i=1}^{N} \left(y_{i}-\mu\right)^{2} \]
\[ \sigma^{2} = \frac{1}{6-1}\sum_{i=1}^{6} \left(y_{i}-8\right)^{2} \]
(1/(6-1))*sum((ponds$egg.mass-8)^2)
[1] 12.8
Sample Variance
Estimate population variance from each sample
\[ \hat{\sigma}^{2} = \frac{1}{n-1}\sum_{i=1}^{n} \left(y_{i}-\hat{\mu}\right)^{2} \]
Sample Variance
- Unbiased estimate of the population variance.
- Individual values will deviate from the population variance.
Connect these two
- Population variance - variation among sample units; estimate from sample variance
- Var. Sampling Distribution - variance of mean values from each possible sample
Connect these two
Variance of all units vs variance of all sample means
Connect these two
- As the sample size (n) increases the sample variance declines by 1/n
- Finite-population correction factor
\[ \left(\frac{N-n}{N}\right) \]
. . .
\[ E[\text{Sampling Distribution Variance}] = \frac{1}{n} \left(\frac{N-n}{N}\right) \times \text{Population Variance} \]
Connect these two
- Sample size (n) = 2
- Total units (N) = 6
- E[pop var] = 12.8
- E[sample dist. var] = 4.27
- 1/n = 0.5
- Finite correction factor = 0.67
- E[pop var] \(\times\) 1/n \(\times\) FCF = 4.27
Connect these two
\[ E[\text{Sampling Distribution Variance}] = \\\frac{1}{n} \left(\frac{N-n}{N}\right) \times \text{Population Variance} \]
- As n–> N, (N-n)/(N) approaches zero.
- n = N then no sampling distribution variance
- n << N then correction factor ~1 and expected sampling distribution variance is related to the population variance by 1/n
Connect these two
Why does this matter?
Connect these two
If you know the E[pop var] …
- you have a mechanism to understand the variation of sample distribution of the means
- the more variation in \(y\), the more variation in sampling dist. of means.
Sample Size
\[ E[\text{Sampling Distribution Variance}] \]
- n = 2 –> 4.27
- n = 3 –> 2.13
- n = 4 –> 1.06
- n = 5 –> 0.43
- n = 6 –> 0
Sample Size
Not pond example; When N is large enough . . .
Sample Size
- Sampling distribution is less variable
- Sampling distribution centers on population mean
- As n increases, the distribution becomes ‘Normal’
. . .
This is because of the Central limit theorem (CLT)
- CLT relies on random/prob. based sampling
- leads to parameter unbiasedness
- Does not depend on distribution of samples
- allows estimation of precision of parameters
- allows estimation of confidence intervals
The TOTAL
Extrapolate the sample to the total population
\[ \hat{\tau} = N \times \hat{\mu} = \frac{N}{n}\sum_{i=1}^{n}(y_i) \]
The TOTAL
Back to ponds example
The TOTAL
Variance of the total
\[ \text{var}(\hat{\tau}) = N\times(N-1) \times \frac{\hat{\sigma}^2}{n} \]
Probabilty of
Whenever you have the sampling distribution, frame precision in terms of what matters to you.
Probabilty of
\[ P(\hat{\tau}\leq 2 \times \tau) \]
Sample Size
But what \(n\)?
- \(n\) impacts variation in estimates
- variation in \(y\) impacts the influence of \(n\)
- objective probably impacts \(n\) the most
- higher \(n\) is not always better