SLR: Mathematical models for inference

Prof. Maria Tackett

Sep 19, 2022

Announcements

  • Lab 02 due

    • Today at 11:59pm (Thursday labs)

    • Tue, Sep 20 at 11:59pm (Friday labs)

  • HW 01: due Wed, Sep 21 at 11:59pm

  • Statistics experience - due Fri, Dec 09 at 11:59pm

  • Lab 01 solutions posted in Resources folder in Sakai

  • See Week 04 for this week’s activities.

Topics

  • Define mathematical models to conduct inference for the slope

  • Use mathematical models to

    • calculate confidence interval for the slope

    • conduct a hypothesis test for the slope

Computational setup

# load packages
library(tidyverse)   # for data wrangling and visualization
library(tidymodels)  # for modeling
library(openintro)   # for the duke_forest dataset
library(scales)      # for pretty axis labels
library(knitr)       # for pretty tables
library(kableExtra)  # also for pretty tables
library(patchwork)   # arrange plots

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

The regression model, revisited

df_fit <- linear_reg() |>
  set_engine("lm") |>
  fit(price ~ area, data = duke_forest)

tidy(df_fit) |>
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 116652.325 53302.463 2.188 0.031
area 159.483 18.171 8.777 0.000

Inference, revisited

  • Earlier we computed a confidence interval and conducted a hypothesis test via simulation:
    • CI: Bootstrap the observed sample to simulate the distribution of the slope
    • HT: Permute the observed sample to simulate the distribution of the slope under the assumption that the null hypothesis is true
  • Now we’ll do these based on theoretical results, i.e., by using the Central Limit Theorem to define the distribution of the slope and use features (shape, center, spread) of this distribution to compute bounds of the confidence interval and the p-value for the hypothesis test

Mathematical representation of the model

Y=Model+Error=f(X)+ϵ=μY|X+ϵ=β0+β1X+ϵ

where the errors are independent and normally distributed:

  • independent: Knowing the error term for one observation doesn’t tell you anything about the error term for another observation
  • normally distributed: ϵ∼N(0,σ2ϵ)

Mathematical representation, visualized

Y|X∼N(β0+β1X,σ2ϵ)

  • Mean: β0+β1X, the predicted value based on the regression model
  • Variance: σ2ϵ, constant across the range of X
    • How do we estimate σ2ϵ?

Regression standard error

Once we fit the model, we can use the residuals to estimate the regression standard error (the spread of the distribution of the response, for a given value of the predictor variable):

ˆσϵ=√n∑i=1(yi−ˆyi)2n−2=√n∑i=1e2in−2

  1. Why divide by n−2?
  2. Why do we care about the value of the regression standard error?

Standard error of ˆβ1

SEˆβ1=ˆσϵ√1(n−1)s2X

or…

term estimate std.error statistic p.value
(Intercept) 116652.33 53302.46 2.19 0.03
area 159.48 18.17 8.78 0.00

Mathematical models for inference for β1

Hypothesis test for the slope

Hypotheses: H0:β1=0 vs. HA:β1≠0

Test statistic: Number of standard errors the estimate is away from the null

T=Estimate - NullStandard error

p-value: Probability of observing a test statistic at least as extreme (in the direction of the alternative hypothesis) from the null value as the one observed

p−value=P(|t|>|test statistic),

calculated from a t distribution with n−2 degrees of freedom

Hypothesis test: Test statistic

term estimate std.error statistic p.value
(Intercept) 116652.33 53302.46 2.19 0.03
area 159.48 18.17 8.78 0.00

T=ˆβ1−0SEˆβ1=159.48−018.17=8.78

Hypothesis test: p-value

term estimate std.error statistic p.value
(Intercept) 116652.33 53302.46 2.19 0.03
area 159.48 18.17 8.78 0.00

Understanding the p-value

Magnitude of p-value Interpretation
p-value < 0.01 strong evidence against H0
0.01 < p-value < 0.05 moderate evidence against H0
0.05 < p-value < 0.1 weak evidence against H0
p-value > 0.1 effectively no evidence against H0

Important

These are general guidelines. The strength of evidence depends on the context of the problem.

Hypothesis test: Conclusion, in context

term estimate std.error statistic p.value
(Intercept) 116652.33 53302.46 2.19 0.03
area 159.48 18.17 8.78 0.00
  • The data provide convincing evidence that the population slope β1 is different from 0.
  • The data provide convincing evidence of a linear relationship between area and price of houses in Duke Forest.

Confidence interval for the slope

Estimate± (critical value) ×SE

ˆβ1±t∗×SEˆβ1

where t∗ is calculated from a t distribution with n−2 degrees of freedom

Confidence interval: Critical value

# confidence level: 95%
qt(0.975, df = nrow(duke_forest) - 2)
[1] 1.984984
# confidence level: 90%
qt(0.95, df = nrow(duke_forest) - 2)
[1] 1.660881
# confidence level: 99%
qt(0.995, df = nrow(duke_forest) - 2)
[1] 2.628016

95% CI for the slope: Calculation

term estimate std.error statistic p.value
(Intercept) 116652.33 53302.46 2.19 0.03
area 159.48 18.17 8.78 0.00

ˆβ1=159.48t∗=1.98SEˆβ1=18.17

159.48±1.98×18.17=(123.50,195.46)

95% CI for the slope: Computation

tidy(df_fit, conf.int = TRUE, conf.level = 0.95) |> 
  kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 116652.33 53302.46 2.19 0.03 10847.77 222456.88
area 159.48 18.17 8.78 0.00 123.41 195.55

Intervals for predictions

Intervals for predictions

  • Suppose we want to answer the question “What is the predicted sale price of a Duke Forest house that is 2,800 square feet?”
  • We said reporting a single estimate for the slope is not wise, and we should report a plausible range instead
  • Similarly, reporting a single prediction for a new value is not wise, and we should report a plausible range instead

Two types of predictions

  1. Prediction for the mean: “What is the average predicted sale price of Duke Forest houses that are 2,800 square feet?”

  2. Prediction for an individual observation: “What is the predicted sale price of a Duke Forest house that is 2,800 square feet?”

Which would you expect to be more variable? The average prediction or the prediction for an individual observation? Based on your answer, how would you expect the widths of plausible ranges for these two predictions to compare?

Uncertainty in predictions

Confidence interval for the mean outcome: ˆy±t∗n−2×SEˆμ

Prediction interval for an individual observation: ˆy±t∗n−2×SEˆy

Standard errors

Standard error of the mean outcome: SEˆμ=ˆσϵ√1n+(x−ˉx)2n∑i=1(xi−ˉx)2

Standard error of an individual outcome: SEˆy=ˆσϵ√1+1n+(x−ˉx)2n∑i=1(xi−ˉx)2

Standard errors

Standard error of the mean outcome: SEˆμ=ˆσϵ√1n+(x−ˉx)2n∑i=1(xi−ˉx)2

Standard error of an individual outcome: SEˆy=ˆσϵ√1+1n+(x−ˉx)2n∑i=1(xi−ˉx)2

Confidence interval

The 95% confidence interval for the mean outcome:

new_house <- tibble(area = 2800)

predict(df_fit, new_data = new_house, type = "conf_int", level = 0.95) |>
  kable()
.pred_lower .pred_upper
529351 597060.1

We are 95% confident that mean sale price of Duke Forest houses that are 2,800 square feet is between $529,351 and $597,060.

Prediction interval

The 95% prediction interval for an individual outcome:

predict(df_fit, new_data = new_house, type = "pred_int", level = 0.95) |>
  kable()
.pred_lower .pred_upper
226438.3 899972.7

We are 95% confident that predicted sale price of a Duke Forest house that is 2,800 square feet is between $226,438 and $899,973.

Comparing intervals

Extrapolation

Calculate the prediction interval for the sale price of a “tiny house” in Duke Forest that is 225 square feet.

Black tiny house on wheels

No, thanks!

🔗 Week 04

1 / 30
SLR: Mathematical models for inference Prof. Maria Tackett Sep 19, 2022

  1. Slides

  2. Tools

  3. Close
  • SLR: Mathematical models for inference
  • Announcements
  • Topics
  • Computational setup
  • The regression model, revisited
  • Inference, revisited
  • Mathematical representation of the model
  • Mathematical representation, visualized
  • Regression standard error
  • Standard error of \(\hat{\beta}_1\)
  • Mathematical models for inference for \(\beta_1\)
  • Hypothesis test for the slope
  • Hypothesis test: Test statistic
  • Hypothesis test: p-value
  • Understanding the p-value
  • Hypothesis test: Conclusion, in context
  • Confidence interval for the slope
  • Confidence interval: Critical value
  • 95% CI for the slope: Calculation
  • 95% CI for the slope: Computation
  • Intervals for predictions
  • Intervals for predictions
  • Two types of predictions
  • Uncertainty in predictions
  • Standard errors
  • Standard errors
  • Confidence interval
  • Prediction interval
  • Comparing intervals
  • Extrapolation
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help