Linear Regression

Objectives

  • fundamentals
  • assumptions
  • lm / glm functions
  • confidence intervals
  • case study






Fundamental Idea

Why model data?

In ecology, we use models to,

  • describe relationships among outcomes and processes
  • to estimate hidden (latent) processes
  • predict unobserved values
  • forecast future outcomes, such as responses to management

Linear Regression (motivation)

Linear Regression (Equation)

Linear Regression (Equation 2)


\[ y_{i} \sim \text{Normal}(\mu_{i}, \sigma)\\ \mu_{i} = \beta_0 + \beta_1 \times x_i \\ \]

Linear Regression Line 1

\[ \hat{\mu_{i}} = 9.06 + 2\times x_i \]

Linear Regression Line 2

\[ \hat{\mu_{i}} = 9.06 + 2\times x_i \]

Linear Regression Line 3

\[ \hat{\mu_{i}} = 9.06 + 2\times x_i \]

Residuals

One Sample

\[ \begin{align*} y_{i} \sim& \text{Normal}(\mu_{i},\sigma) \\ \mu_{i} =& \beta_{0} + \beta_{1}x_{i}\\ \mu_{i} =& 1 + 0.5 \times x_{i} \end{align*} \]

Sampling Distributions of y

\[ \begin{align*} \mu_{i} = 1 + 0.5 \times x_{i} \end{align*} \]

Sampling Distributions of \(\hat{\mu}\)

\[ \begin{align*} \mu_{i} = 1 + 0.5 \times x_{i} \end{align*} \]






Assumptions

Assumptions

Independence of the errors


Correlation(\(\epsilon_{i}\),\(\epsilon_{j}\)) = 0, \(\forall\) pairs of \(i\) and \(j\)


This means that knowing how far observation \(i\) will be from the true regression line tells us nothing about how far observation \(j\) will be from the regression line.

Assumptions

Homogeniety of the variance

var(\(\epsilon_{i}\)) = \(\sigma^2\)

Constancy in the scatter of observations above and below the line, going left to right.

Assumptions

Heteroskedasticity

Assumptions

Linearity


\[ E[y_i|x_i] = \mu_i = \beta_0 + \beta_1 \times x_i \\ \]

The hypothesis about the variables included in the model (e.g., \(x_i\)) characterizes the mean well.

Assumptions

Normality


\[ \epsilon_i \sim \text{Normal}(0,\sigma) \]

Each \(i^{th}\) residual

  • comes from a Normal distribution with a mean of zero
  • is symmetrically disributed around zero
  • varies around zero by \(\sigma\), which is the same for each residual.

Assumption Violations

Robustness


Linearity and constant variance are often more important than the assumption of normality (see e.g., Knief & Forstmeier, 2021 and references therein)


This is especially true for large sample sizes






lm and glm

Intercept-Only Model

\[ y_{i} \sim \text{Normal}(\mu_{i}, \sigma^2)\\ \mu_{i} = \beta_0 \]

Generate Data

#Setup parameters
  n = 100 # sample size
  beta0 = 10 # true mean
  sigma = 2 # true std.dev

# Simulate a data set of observations
  set.seed(43243)
  y = rnorm(n, mean = beta0, sd = sigma)

Visualize Intercept-Only Model

  hist(y)

Fit Intercept-Only Model

# Fit model/hypothesis using maximum likelihood
  model1.0 = lm(y~1)

  model1.1 = glm(y~1)

  model1.2 = glm(y~1, family=gaussian(link = identity))

  
  
# Compare Results  
  data.frame(intercept=c(model1.0$coefficients,model1.1$coefficients, model1.2$coefficients),
             SE = c(summary(model1.0)$coefficients[, 2], summary(model1.1)$coefficients[, 2],summary(model1.2)$coefficients[, 2])
            )
  intercept        SE
1  9.913821 0.1765343
2  9.913821 0.1765343
3  9.913821 0.1765343

Fit Intercept-Only Model

# Summary of model results  
  summary(model1.0)

Call:
lm(formula = y ~ 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7815 -1.2733 -0.0581  1.1558  4.4979 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.9138     0.1765   56.16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.765 on 99 degrees of freedom

Fitted-values

#Predict response for all data
  preds = predict(model1.0, se.fit = TRUE)
  preds
$fit
       1        2        3        4        5        6        7        8 
9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 
       9       10       11       12       13       14       15       16 
9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 
      17       18       19       20       21       22       23       24 
9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 
      25       26       27       28       29       30       31       32 
9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 
      33       34       35       36       37       38       39       40 
9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 
      41       42       43       44       45       46       47       48 
9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 
      49       50       51       52       53       54       55       56 
9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 
      57       58       59       60       61       62       63       64 
9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 
      65       66       67       68       69       70       71       72 
9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 
      73       74       75       76       77       78       79       80 
9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 
      81       82       83       84       85       86       87       88 
9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 
      89       90       91       92       93       94       95       96 
9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 9.913821 
      97       98       99      100 
9.913821 9.913821 9.913821 9.913821 

$se.fit
  [1] 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343
  [8] 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343
 [15] 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343
 [22] 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343
 [29] 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343
 [36] 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343
 [43] 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343
 [50] 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343
 [57] 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343
 [64] 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343
 [71] 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343
 [78] 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343
 [85] 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343
 [92] 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343 0.1765343
 [99] 0.1765343 0.1765343

$df
[1] 99

$residual.scale
[1] 1.765343






Confidence Intervals

What is a CI?

“A confidence interval for a parameter is an interval computed using sample data …

“… by a method that will contain the parameter for a specified proportion of all samples.

The success rate (proportion of all samples whose intervals contain the parameter) is known as the confidence level.” R. H. Lock et al. (2020)

What is a CI?

Key

  • the parameter we are trying to estimate is a fixed unknown (i.e., it is not varying across samples)
  • the endpoints of our confidence interval are random and will change every time we collect a new data set (the endpoints themselves actually have a sampling distribution!)

More at Stats4Ecologists

What is a CI?

Confidence Intervals

Normal Approximation

# Get 90% confidence intervals (Type I error = 0.1)
  c(
    (preds$fit+preds$se.fit*qnorm(0.05))[1],
    (preds$fit+preds$se.fit*qnorm(0.95))[1]
   )
        1         1 
 9.623448 10.204195 
  CI.Normal=confint(model1.0, level=0.9)
  CI.Normal
                 5 %     95 %
(Intercept) 9.620705 10.20694

Bootstrapping

Instead of relying on the 95% intervals from an assumed normal distribution, we will create a distribution by resampling our data.


See, Stats4Ecologists

Bootstrapping (idea)

Bootstrapping (idea)

Bootstrapping (idea)

Bootstrapping (code)

# Setup
  nboot <- 1000 # number of bootstrap samples
  nobs <- length(y)
  bootcoefs <- rep(NA, nboot)
# Start loop  
for(i in 1:nboot){
  set.seed(43243+i)
  # Create bootstrap data set by sampling original observations w/ replacement  
  bootdat <- y[sample(1:nobs, nobs, replace=TRUE)] 
  # Calculate bootstrap statistic
  glmboot <- glm(bootdat ~ 1)
  bootcoefs[i] <- coef(glmboot)
}

Bootstrapping (code)

Bootstrapping (code)

# Calculate bootstrap standard errors
  boot.se = sd(bootcoefs)

# boostrap-normal CI
  boot.normal = c(
        (preds$fit+boot.se*qnorm(0.05))[1],
        (preds$fit+boot.se*qnorm(0.95))[1]
        )

# bootstrap percentile
confdat.boot.pct <- quantile(bootcoefs, probs = c(0.05, 0.95))

Comparison






Case Study /
Independent Variables

Prairie Dog Calling

y = # of calls per 5 minute at a prairie dog colony

temp = temperature, degrees F

dist.human = distance of nearest human activity to colony

  head(dat)
           y     temp dist.human
1 0.38129119 39.96298   11.75578
2 0.57770546 47.95317   14.63583
3 0.00347806 56.13182   22.89228
4 0.00000000 29.76491   21.14168
5 2.67643282 40.25998   32.41442
6 1.71519689 38.31469   35.99573

Exploratory

Model Fitting 1

  model = lm(y ~ temp+dist.human, data=dat)
  summary(model)

Call:
lm(formula = y ~ temp + dist.human, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6714 -1.6690 -0.1506  1.4910  5.0041 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.312197   1.074281  -7.737 9.65e-12 ***
temp         0.156393   0.024605   6.356 6.75e-09 ***
dist.human   0.036010   0.003388  10.629  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.338 on 97 degrees of freedom
Multiple R-squared:  0.9233,    Adjusted R-squared:  0.9217 
F-statistic: 583.8 on 2 and 97 DF,  p-value: < 2.2e-16

Model Notation 1

equatiomatic::extract_eq(model)

\[ \operatorname{y} = \alpha + \beta_{1}(\operatorname{temp}) + \beta_{2}(\operatorname{dist.human}) + \epsilon \]

equatiomatic::extract_eq(model, use_coefs = TRUE)

\[ \operatorname{\widehat{y}} = -8.31 + 0.16(\operatorname{temp}) + 0.04(\operatorname{dist.human}) \]

Model Fitting 2

Mean-center the intercept. Slopes do not change.

  model = lm(y ~ I(temp - mean(temp)) + dist.human, data=dat)
  summary(model)

Call:
lm(formula = y ~ I(temp - mean(temp)) + dist.human, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6714 -1.6690 -0.1506  1.4910  5.0041 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          2.614311   0.895284   2.920  0.00435 ** 
I(temp - mean(temp)) 0.156393   0.024605   6.356 6.75e-09 ***
dist.human           0.036010   0.003388  10.629  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.338 on 97 degrees of freedom
Multiple R-squared:  0.9233,    Adjusted R-squared:  0.9217 
F-statistic: 583.8 on 2 and 97 DF,  p-value: < 2.2e-16

Normalizing x

Mean-center and standardize by std. deviation

  dat$temp.sc = scale(dat$temp, center=TRUE, scale = TRUE)
  dat$dist.sc = scale(dat$dist.human, center=TRUE, scale = TRUE)
  mean(dat$temp.sc)
[1] 1.40337e-16
  sd(dat$temp.sc)
[1] 1

Comparison

Model Fitting 3

What does the slope mean now?

\[ \mu_{i} = \beta_0 + \beta_1 \times 1 + \beta_2 \times 0 \]

  model = lm(y ~ temp.sc + dist.sc, data=dat)
  summary(model)

Call:
lm(formula = y ~ temp.sc + dist.sc, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6714 -1.6690 -0.1506  1.4910  5.0041 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  11.8002     0.2338  50.471  < 2e-16 ***
temp.sc       3.0958     0.4871   6.356 6.75e-09 ***
dist.sc       5.1771     0.4871  10.629  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.338 on 97 degrees of freedom
Multiple R-squared:  0.9233,    Adjusted R-squared:  0.9217 
F-statistic: 583.8 on 2 and 97 DF,  p-value: < 2.2e-16

Marginal Predictions

\[ \hat{\mu_{i}} = \hat{\beta_{0}} + \hat{\beta_{1}}\times 0 + \hat{\beta_{2}} \times \text{dist.human}_i \]

# Plot marginal predictions of dist.human
  marginaleffects::plot_predictions(model, condition=c("dist.human"))

Add data points

# Plot marginal predictions of dist.human
  plot1 = marginaleffects::plot_predictions(model, condition=c("dist.human"))
  plot1+geom_point(data=dat, aes(x=dist.human,y=y))

Combined Predictions

marginaleffects::plot_predictions(model, condition=list("temp","dist.human"))

Combined Predictions

marginaleffects::plot_predictions(model, condition=list("temp","dist.human" = 0:5))

Taking Control

newdata = expand.grid(dat$temp,0:5)
colnames(newdata)=c("temp","dist.human")

preds = predict(model,newdata = newdata,interval="confidence", level=0.95)

pred.plot = data.frame(newdata,preds)

Combined Predictions 2

Evaluating Assumptions

Largely done based on the residuals

\(y_{i} - \hat{y}_{i}\)

hist(model$residuals)

Linearity Assumption

plot(model,1)

Ideally, there will be no pattern and the red line should be roughly horizontal near zero.

Homogeneity of Variance

plot(model,3)

Residuals should be spread equally along the ranges of predictors. We want a horizontal red line; otherwise, suggests a non-constant variances in the residuals (i.e., heteroscedasticity).

Normality of Residuals

plot(model,2)

Shows theoretical quantiles versus empirical quantiles of the residuals. We want to see cicles on the dotted line.

Outliers

par(mfrow=c(1,2))
plot(model,4)
plot(model,5)

Outlier: extreme value that can affect the \(\beta\) estimate. Leverage plot: points in the upper right and lower right corner.

Exploring Assumptions

Nicer looking Plots

library(ggResidpanel)
resid_panel(model)

Go to Lab