MLR Inference

Prof. Maria Tackett

Oct 26, 2022

Announcements

  • See Gradescope for feedback on project topic ideas.

    • Read comments carefully. Even if a data set is marked “usable”, there may be suggestions about extensive data cleaning required to make it appropriate for the project.
    • Attend office hours or talk with TAs in lab if you have questions.
  • See Week 09 activities.

Topics

  • Conduct a hypothesis test for \(\beta_j\)

  • Calculate a confidence interval for \(\beta_j\)

  • Inference pitfalls

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(knitr)      # for tables
library(patchwork)  # for laying out plots
library(rms)        # for vif

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Modeling workflow

  • Split data into training and test sets.

  • Use cross validation on the training set to fit, evaluate, and compare candidate models. Choose a final model based on summary of cross validation results.

  • Refit the model using the entire training set and do “final” evaluation on the test set (make sure you have not overfit the model).

    • Adjust as needed if there is evidence of overfit.
  • Use model fit on training set for inference and prediction.

Data: rail_trail

  • The Pioneer Valley Planning Commission (PVPC) collected data for ninety days from April 5, 2005 to November 15, 2005.
  • Data collectors set up a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station.
# A tibble: 90 × 7
   volume hightemp avgtemp season cloudcover precip day_type
    <dbl>    <dbl>   <dbl> <chr>       <dbl>  <dbl> <chr>   
 1    501       83    66.5 Summer       7.60 0      Weekday 
 2    419       73    61   Summer       6.30 0.290  Weekday 
 3    397       74    63   Spring       7.5  0.320  Weekday 
 4    385       95    78   Summer       2.60 0      Weekend 
 5    200       44    48   Spring      10    0.140  Weekday 
 6    375       69    61.5 Spring       6.60 0.0200 Weekday 
 7    417       66    52.5 Spring       2.40 0      Weekday 
 8    629       66    52   Spring       0    0      Weekend 
 9    533       80    67.5 Summer       3.80 0      Weekend 
10    547       79    62   Summer       4.10 0      Weekday 
# … with 80 more rows

Source: Pioneer Valley Planning Commission via the mosaicData package.

Variables

Outcome:

volume estimated number of trail users that day (number of breaks recorded)

Predictors

  • hightemp daily high temperature (in degrees Fahrenheit)
  • avgtemp average of daily low and daily high temperature (in degrees Fahrenheit)
  • season one of “Fall”, “Spring”, or “Summer”
  • cloudcover measure of cloud cover (in oktas)
  • precip measure of precipitation (in inches)
  • day_type one of “weekday” or “weekend”

Conduct a hypothesis test for \(\beta_j\)

Review: Simple linear regression (SLR)

ggplot(rail_trail, aes(x = hightemp, y = volume)) + 
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "High temp (F)", y = "Number of riders")

SLR model summary

rt_slr_fit <- linear_reg() |>
  set_engine("lm") |>
  fit(volume ~ hightemp, data = rail_trail)

tidy(rt_slr_fit) |> kable()
term estimate std.error statistic p.value
(Intercept) -17.079280 59.3953040 -0.2875527 0.7743652
hightemp 5.701878 0.8480074 6.7238541 0.0000000

SLR hypothesis test

term estimate std.error statistic p.value
(Intercept) -17.08 59.40 -0.29 0.77
hightemp 5.70 0.85 6.72 0.00
  1. Set hypotheses: \(H_0: \beta_1 = 0\) vs. \(H_A: \beta_1 \ne 0\)
  1. Calculate test statistic and p-value: The test statistic is \(t= 6.72\) . The p-value is calculated using a \(t\) distribution with 88 degrees of freedom. The p-value is \(\approx 0\) .
  1. State the conclusion: The p-value is small, so we reject \(H_0\). The data provide strong evidence that high temperature is a helpful predictor for the number of daily riders, i.e. there is a linear relationship between high temperature and number of daily riders.

Multiple linear regression

rt_mlr_main_fit <- linear_reg() |>
  set_engine("lm") |>
  fit(volume ~ hightemp + season, data = rail_trail)

tidy(rt_mlr_main_fit) |> kable(digits = 2)
term estimate std.error statistic p.value
(Intercept) -125.23 71.66 -1.75 0.08
hightemp 7.54 1.17 6.43 0.00
seasonSpring 5.13 34.32 0.15 0.88
seasonSummer -76.84 47.71 -1.61 0.11

MLR hypothesis test: hightemp

  1. Set hypotheses: \(H_0: \beta_{hightemp} = 0\) vs. \(H_A: \beta_{hightemp} \ne 0\), given season is in the model
  1. Calculate test statistic and p-value: The test statistic is \(t = 6.43\). The p-value is calculated using a \(t\) distribution with 86 (n - p - 1) degrees of freedom. The p-value is \(\approx 0\).
  1. State the conclusion: The p-value is small, so we reject \(H_0\). The data provide strong evidence that high temperature for the day is a useful predictor in a model that already contains the season as a predictor for number of daily riders.

The model for season = Spring

term estimate std.error statistic p.value
(Intercept) -125.23 71.66 -1.75 0.08
hightemp 7.54 1.17 6.43 0.00
seasonSpring 5.13 34.32 0.15 0.88
seasonSummer -76.84 47.71 -1.61 0.11


\[ \begin{aligned} \widehat{volume} &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times \texttt{seasonSpring} - 76.84 \times \texttt{seasonSummer} \\ &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times 1 - 76.84 \times 0 \\ &= -120.10 + 7.54 \times \texttt{hightemp} \end{aligned} \]

The model for season = Summer

term estimate std.error statistic p.value
(Intercept) -125.23 71.66 -1.75 0.08
hightemp 7.54 1.17 6.43 0.00
seasonSpring 5.13 34.32 0.15 0.88
seasonSummer -76.84 47.71 -1.61 0.11


\[ \begin{aligned} \widehat{volume} &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times \texttt{seasonSpring} - 76.84 \times \texttt{seasonSummer} \\ &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times 0 - 76.84 \times 1 \\ &= -202.07 + 7.54 \times \texttt{hightemp} \end{aligned} \]

The model for season = Fall

term estimate std.error statistic p.value
(Intercept) -125.23 71.66 -1.75 0.08
hightemp 7.54 1.17 6.43 0.00
seasonSpring 5.13 34.32 0.15 0.88
seasonSummer -76.84 47.71 -1.61 0.11


\[ \begin{aligned} \widehat{volume} &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times \texttt{seasonSpring} - 76.84 \times \texttt{seasonSummer} \\ &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times 0 - 76.84 \times 0 \\ &= -125.23 + 7.54 \times \texttt{hightemp} \end{aligned} \]

The models

Same slope, different intercepts

  • season = Spring: \(-120.10 + 7.54 \times \texttt{hightemp}\)
  • season = Summer: \(-202.07 + 7.54 \times \texttt{hightemp}\)
  • season = Fall: \(-125.23 + 7.54 \times \texttt{hightemp}\)

Application exercise

Ex 1. Add an interaction effect between hightemp and season to the model. Do the data provide evidence of a significant interaction effect? Comment on the significance of the interaction terms.

08:00

Confidence interval for \(\beta_j\)

Confidence interval for \(\beta_j\)

  • The \(C%\) confidence interval for \(\beta_j\) \[\hat{\beta}_j \pm t^* SE(\hat{\beta}_j)\] where \(t^*\) follows a \(t\) distribution with \(n - p - 1\) degrees of freedom.

  • Generically, we are \(C%\) confident that the interval LB to UB contains the population coefficient of \(x_j\).

  • In context, we are \(C%\) confident that for every one unit increase in \(x_j\), we expect \(y\) to change by LB to UB units, holding all else constant.

Confidence interval for \(\beta_j\)

tidy(rt_mlr_main_fit, conf.int = TRUE) |>
  kable(digits= 2)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -125.23 71.66 -1.75 0.08 -267.68 17.22
hightemp 7.54 1.17 6.43 0.00 5.21 9.87
seasonSpring 5.13 34.32 0.15 0.88 -63.10 73.36
seasonSummer -76.84 47.71 -1.61 0.11 -171.68 18.00

CI for hightemp

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -125.23 71.66 -1.75 0.08 -267.68 17.22
hightemp 7.54 1.17 6.43 0.00 5.21 9.87
seasonSpring 5.13 34.32 0.15 0.88 -63.10 73.36
seasonSummer -76.84 47.71 -1.61 0.11 -171.68 18.00


We are 95% confident that for every degree Fahrenheit the day is warmer, the number of riders increases by 5.21 to 9.87, on average, holding season constant.

CI for seasonSpring

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -125.23 71.66 -1.75 0.08 -267.68 17.22
hightemp 7.54 1.17 6.43 0.00 5.21 9.87
seasonSpring 5.13 34.32 0.15 0.88 -63.10 73.36
seasonSummer -76.84 47.71 -1.61 0.11 -171.68 18.00


We are 95% confident that the number of riders on a Spring day is lower by 63.1 to higher by 73.4 compared to a Fall day, on average, holding high temperature for the day constant.

Is season a significant predictor of the number of riders, after accounting for high temperature?

Inference pitfalls

Large sample sizes

Danger

If the sample size is large enough, the test will likely result in rejecting \(H_0: \beta_j = 0\) even \(x_j\) has a very small effect on \(y\).

  • Consider the practical significance of the result not just the statistical significance.

  • Use the confidence interval to draw conclusions instead of relying only p-values.

Small sample sizes

Danger

If the sample size is small, there may not be enough evidence to reject \(H_0: \beta_j=0\).

  • When you fail to reject the null hypothesis, DON’T immediately conclude that the variable has no association with the response.

  • There may be a linear association that is just not strong enough to detect given your data, or there may be a non-linear association.