“era of big data” - what does this mean for our field?
“when \(\rho(R,Y) >0\) larger values of Y are more likely to be in the sample than in the population and vice versa. Where \(\rho(R,Y) =0\), this term cancels the others and there is no [systematic] error”.
Three ways to reduce ‘errors’
\[ \begin{split} \text{errors} &= \begin{array} \normalfont{\text{data}} \\ \text{quality} \end{array} \times \begin{array} \normalfont{\text{problem}} \\ \text{difficulty} \end{array} \times \begin{array} \normalfont{\text{data}} \\ \text{quantitiy} \end{array} \end{split} \]
\[ \begin{split} \hat{\mu} - \mu &= \rho_{R,G} \times \sigma_{G} \times \sqrt(\frac{N - n}{n})\\ \end{split} \]
\(\rho_{R,G}\) is the data defect correlation
between the true population values (G) and what is recorded (R); measures both the sign and degree of selection bias caused by the R-mechanism (such as non-response from voters from party X).
\[ \begin{split} \hat{\mu} - \mu &= \rho_{R,G} \times \sigma_{G} \times \sqrt(\frac{N - n}{n})\\ \end{split} \]
Often assumed that Big data
reduces errors by the magnitude of the data quantity
BUT, the nature of big data is such that \(\rho_{R,G} \neq 0\); this equals zero under simple random sampling (SRS)
Meng (2018) wanted to know the sample size (n) needed to achieve a certain level of error (mean squared error) with different levels of data quality:
The example was voter survey data for the 2016 U.S. presidential election.
the correlation between whether a person responds and what their response was
Implication: Each data point added means less than the previous one.
Look at big data for what it is
Mean Squared Error
\[ \text{MSE}(\hat{\mu}) = \frac{1}{n}\Sigma_{i=1}^{n}\left(\hat{\mu}_{i}-\mu\right) \]
\[ \text{MSE}(\hat{\mu}) = E[(\hat{\mu}-\mu)] \]
\[ \text{MSE}(\hat{\mu}) = \text{Var}(\hat{\mu})+(\text{Bias}(\hat{\mu},\mu))^{2} \]
‘large biased’ is \(\rho_{R,G} = -0.058\), \(n = 1000\)
vertical line is true proportional occupancy
MSE is the same
BUT 95% Confidence Intervals include the truth 0% of the time
Take home: Large biased data –> highly biased and highly precise estimates
Do ‘data analysts’ think about sample selection bias?
Is there really a choice?
Big Data with sample selection bias
Small sample from Simple random sample (no selection bias)