Prof. Maria Tackett
Oct 26, 2022
See Gradescope for feedback on project topic ideas.
See Week 09 activities.
Conduct a hypothesis test for \(\beta_j\)
Calculate a confidence interval for \(\beta_j\)
Inference pitfalls
Split data into training and test sets.
Use cross validation on the training set to fit, evaluate, and compare candidate models. Choose a final model based on summary of cross validation results.
Refit the model using the entire training set and do “final” evaluation on the test set (make sure you have not overfit the model).
Use model fit on training set for inference and prediction.
rail_trail
# A tibble: 90 × 7
volume hightemp avgtemp season cloudcover precip day_type
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 501 83 66.5 Summer 7.60 0 Weekday
2 419 73 61 Summer 6.30 0.290 Weekday
3 397 74 63 Spring 7.5 0.320 Weekday
4 385 95 78 Summer 2.60 0 Weekend
5 200 44 48 Spring 10 0.140 Weekday
6 375 69 61.5 Spring 6.60 0.0200 Weekday
7 417 66 52.5 Spring 2.40 0 Weekday
8 629 66 52 Spring 0 0 Weekend
9 533 80 67.5 Summer 3.80 0 Weekend
10 547 79 62 Summer 4.10 0 Weekday
# … with 80 more rows
Source: Pioneer Valley Planning Commission via the mosaicData package.
Outcome:
volume
estimated number of trail users that day (number of breaks recorded)
Predictors
hightemp
daily high temperature (in degrees Fahrenheit)avgtemp
average of daily low and daily high temperature (in degrees Fahrenheit)season
one of “Fall”, “Spring”, or “Summer”cloudcover
measure of cloud cover (in oktas)precip
measure of precipitation (in inches)day_type
one of “weekday” or “weekend”term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -17.08 | 59.40 | -0.29 | 0.77 |
hightemp | 5.70 | 0.85 | 6.72 | 0.00 |
rt_mlr_main_fit <- linear_reg() |>
set_engine("lm") |>
fit(volume ~ hightemp + season, data = rail_trail)
tidy(rt_mlr_main_fit) |> kable(digits = 2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -125.23 | 71.66 | -1.75 | 0.08 |
hightemp | 7.54 | 1.17 | 6.43 | 0.00 |
seasonSpring | 5.13 | 34.32 | 0.15 | 0.88 |
seasonSummer | -76.84 | 47.71 | -1.61 | 0.11 |
season
is in the modelseason = Spring
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -125.23 | 71.66 | -1.75 | 0.08 |
hightemp | 7.54 | 1.17 | 6.43 | 0.00 |
seasonSpring | 5.13 | 34.32 | 0.15 | 0.88 |
seasonSummer | -76.84 | 47.71 | -1.61 | 0.11 |
\[ \begin{aligned} \widehat{volume} &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times \texttt{seasonSpring} - 76.84 \times \texttt{seasonSummer} \\ &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times 1 - 76.84 \times 0 \\ &= -120.10 + 7.54 \times \texttt{hightemp} \end{aligned} \]
season = Summer
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -125.23 | 71.66 | -1.75 | 0.08 |
hightemp | 7.54 | 1.17 | 6.43 | 0.00 |
seasonSpring | 5.13 | 34.32 | 0.15 | 0.88 |
seasonSummer | -76.84 | 47.71 | -1.61 | 0.11 |
\[ \begin{aligned} \widehat{volume} &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times \texttt{seasonSpring} - 76.84 \times \texttt{seasonSummer} \\ &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times 0 - 76.84 \times 1 \\ &= -202.07 + 7.54 \times \texttt{hightemp} \end{aligned} \]
season = Fall
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -125.23 | 71.66 | -1.75 | 0.08 |
hightemp | 7.54 | 1.17 | 6.43 | 0.00 |
seasonSpring | 5.13 | 34.32 | 0.15 | 0.88 |
seasonSummer | -76.84 | 47.71 | -1.61 | 0.11 |
\[ \begin{aligned} \widehat{volume} &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times \texttt{seasonSpring} - 76.84 \times \texttt{seasonSummer} \\ &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times 0 - 76.84 \times 0 \\ &= -125.23 + 7.54 \times \texttt{hightemp} \end{aligned} \]
Same slope, different intercepts
season = Spring
: \(-120.10 + 7.54 \times \texttt{hightemp}\)season = Summer
: \(-202.07 + 7.54 \times \texttt{hightemp}\)season = Fall
: \(-125.23 + 7.54 \times \texttt{hightemp}\)Ex 1. Add an interaction effect between hightemp
and season
to the model. Do the data provide evidence of a significant interaction effect? Comment on the significance of the interaction terms.
08:00
The \(C%\) confidence interval for \(\beta_j\) \[\hat{\beta}_j \pm t^* SE(\hat{\beta}_j)\] where \(t^*\) follows a \(t\) distribution with \(n - p - 1\) degrees of freedom.
Generically, we are \(C%\) confident that the interval LB to UB contains the population coefficient of \(x_j\).
In context, we are \(C%\) confident that for every one unit increase in \(x_j\), we expect \(y\) to change by LB to UB units, holding all else constant.
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -125.23 | 71.66 | -1.75 | 0.08 | -267.68 | 17.22 |
hightemp | 7.54 | 1.17 | 6.43 | 0.00 | 5.21 | 9.87 |
seasonSpring | 5.13 | 34.32 | 0.15 | 0.88 | -63.10 | 73.36 |
seasonSummer | -76.84 | 47.71 | -1.61 | 0.11 | -171.68 | 18.00 |
hightemp
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -125.23 | 71.66 | -1.75 | 0.08 | -267.68 | 17.22 |
hightemp | 7.54 | 1.17 | 6.43 | 0.00 | 5.21 | 9.87 |
seasonSpring | 5.13 | 34.32 | 0.15 | 0.88 | -63.10 | 73.36 |
seasonSummer | -76.84 | 47.71 | -1.61 | 0.11 | -171.68 | 18.00 |
We are 95% confident that for every degree Fahrenheit the day is warmer, the number of riders increases by 5.21 to 9.87, on average, holding season constant.
seasonSpring
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -125.23 | 71.66 | -1.75 | 0.08 | -267.68 | 17.22 |
hightemp | 7.54 | 1.17 | 6.43 | 0.00 | 5.21 | 9.87 |
seasonSpring | 5.13 | 34.32 | 0.15 | 0.88 | -63.10 | 73.36 |
seasonSummer | -76.84 | 47.71 | -1.61 | 0.11 | -171.68 | 18.00 |
We are 95% confident that the number of riders on a Spring day is lower by 63.1 to higher by 73.4 compared to a Fall day, on average, holding high temperature for the day constant.
Is season
a significant predictor of the number of riders, after accounting for high temperature?