“era of big data” - what does this mean for our field?
“when \(\rho(R,Y) >0\): larger values of Y are more likely to be in the sample than in the population and vice versa. Where \(\rho(R,Y) =0\), this term cancels the others and there is no [systematic] error”.
Three ways to reduce ‘errors’
\[ \begin{split} \text{errors} &= \begin{array} \normalfont{\text{data}} \\ \text{quality} \end{array} \times \begin{array} \normalfont{\text{problem}} \\ \text{difficulty} \end{array} \times \begin{array} \normalfont{\text{data}} \\ \text{quantitiy} \end{array} \end{split} \]
\[ \begin{split} \hat{\mu} - \mu &= \rho_{R,G} \times \sigma_{G} \times \sqrt(\frac{N - n}{n})\\ \end{split} \]
\(\rho_{R,G}\) is the data defect correlation
between the true population values (G) and what is recorded (R); measures both the sign and degree of selection bias caused by the R-mechanism (such as non-response from voters from party X).
\[ \begin{split} \hat{\mu} - \mu &= \rho_{R,G} \times \sigma_{G} \times \sqrt(\frac{N - n}{n})\\ \end{split} \]
Often assumed that Big data
reduces errors by the magnitude of the data quantity
BUT, the nature of big data is such that \(\rho_{R,G} \neq 0\)
\(\rho_{R,G} = 0\) under simple random sampling (SRS)!
Meng (2018) wanted to know the sample size (n) needed to achieve a certain level of error (mean squared error) with different levels of data quality:
The example was voter survey data for the 2016 U.S. presidential election.
the correlation between whether a person responds and what their response was
- \(f = n/N\)
Implication: When \(N\) is large, your relative ESS (\(n_{eff}/n\)) is small; your data are worthless.
“In biodiversity monitoring, N is typically very large, so this reduction can be substantial.”
Look at big data for what it is
We need to think about Mean Squared Error (Proof)
\[ \text{MSE}(\hat{\mu}) = \frac{1}{n}\Sigma_{i=1}^{n}\left(\hat{\mu}_{i}-\mu\right)^{2} \]
\[ \text{MSE}(\hat{\mu}) = E[(\hat{\mu}-\mu)] \]
\[ \text{MSE}(\hat{\mu}) = \text{Var}(\hat{\mu})+(\text{Bias}(\hat{\mu},\mu))^{2} \]
Implication: Same MSE can be achieved with different combinations of variance and bias\(^2\).
Take home: Large biased data –> highly biased and highly precise estimates
Is there really a choice?
Big Data with sample selection bias
Small sample from Simple random sample (no selection bias)