Introduction
Prof. Maria Tackett
Nov 02, 2022
Project proposal due Fri, Nov 04 at 11:59pm
HW 03 due Mon, Nov 07 at 11:59pm
Team Feedback #1 due Tue, Nov 08 at 11:59pm
See Week 10 activities
Logistic regression for binary response variable
Relationship between odds and probabilities
Use logistic regression model to calculate predicted odds and probabilities
Quantitative outcome variable:
Categorical outcome variable:
Logistic regression
2 Outcomes
1: Yes, 0: No
Multinomial logistic regression
3+ Outcomes
1: Democrat, 2: Republican, 3: Independent
Students in grades 9 - 12 surveyed about health risk behaviors including whether they usually get 7 or more hours of sleep.
Sleep7
1: yes
0: no
data(YouthRisk2009) #from Stat2Data package
sleep <- YouthRisk2009 |>
as_tibble() |>
filter(!is.na(Age), !is.na(Sleep7))
sleep |>
relocate(Age, Sleep7)
# A tibble: 446 × 6
Age Sleep7 Sleep SmokeLife SmokeDaily MarijuaEver
<int> <int> <fct> <fct> <fct> <int>
1 16 1 8 hours Yes Yes 1
2 17 0 5 hours Yes Yes 1
3 18 0 5 hours Yes Yes 1
4 17 1 7 hours Yes No 1
5 15 0 4 or less hours No No 0
6 17 0 6 hours No No 0
7 17 1 7 hours No No 0
8 16 1 8 hours Yes No 0
9 16 1 8 hours No No 0
10 18 0 4 or less hours Yes Yes 1
# … with 436 more rows
Outcome: \(Y\) = 1: yes, 0: no
Outcome: Probability of getting 7+ hours of sleep
Outcome: Probability of getting 7+ hours of sleep
🛑 This model produces predictions outside of 0 and 1.
✅ This model (called a logistic regression model) only produces predictions between 0 and 1.
Method | Outcome | Model |
---|---|---|
Linear regression | Quantitative | \(Y = \beta_0 + \beta_1~ X\) |
Linear regression (transform Y) | Quantitative | \(\log(Y) = \beta_0 + \beta_1~ X\) |
Logistic regression | Binary | \(\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1 ~ X\) |
📋 AE 13: Logistic Regression Intro
Linear Regression vs. Logistic Regression
Suppose there is a 70% chance it will rain tomorrow
# A tibble: 2 × 3
Sleep7 n p
<int> <int> <dbl>
1 0 150 0.336
2 1 296 0.664
\(P(\text{7+ hours of sleep}) = P(Y = 1) = p = 0.664\)
\(P(\text{< 7 hours of sleep}) = P(Y = 0) = 1 - p = 0.336\)
\(P(\text{odds of 7+ hours of sleep}) = \frac{0.664}{0.336} = 1.976\)
odds
\[\omega = \frac{\pi}{1-\pi}\]
probability
\[\pi = \frac{\omega}{1 + \omega}\]
\[\text{probability} = \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}}\]
Logit form: \[\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~X\]
Probability form:
\[ \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}} \]
This dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to use age
to predict if a randomly selected adult is high risk of having coronary heart disease in the next 10 years.
high_risk
:
age
: Age at exam time (in years)
heart_disease
# A tibble: 4,240 × 2
age high_risk
<dbl> <fct>
1 39 0
2 46 0
3 48 0
4 61 1
5 46 0
6 43 0
7 63 1
8 45 0
9 52 0
10 43 0
# … with 4,230 more rows
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -5.561 | 0.284 | -19.599 | 0 |
age | 0.075 | 0.005 | 14.178 | 0 |
\[\log\Big(\frac{\hat{\pi}}{1-\hat{\pi}}\Big) = -5.561 + 0.075 \times \text{age}\] where \(\hat{\pi}\) is the predicted probability of being high risk of heart disease
# A tibble: 4,240 × 2
.fitted .resid
<dbl> <dbl>
1 -2.65 -0.370
2 -2.13 -0.475
3 -1.98 -0.509
4 -1.01 1.62
5 -2.13 -0.475
6 -2.35 -0.427
7 -0.858 1.56
8 -2.20 -0.458
9 -1.68 -0.585
10 -2.35 -0.427
# … with 4,230 more rows
For observation 1
\[\text{predicted odds} = \hat{\omega} = \frac{\hat{\pi}}{1-\hat{\pi}} = \exp\{-2.650\} = 0.071\]
# A tibble: 4,240 × 2
.pred_0 .pred_1
<dbl> <dbl>
1 0.934 0.0660
2 0.894 0.106
3 0.878 0.122
4 0.733 0.267
5 0.894 0.106
6 0.913 0.0870
7 0.702 0.298
8 0.900 0.0996
9 0.843 0.157
10 0.913 0.0870
# … with 4,230 more rows
\[\text{predicted probabilities} = \hat{\pi} = \frac{\exp\{-2.650\}}{1 + \exp\{-2.650\}} = 0.066\]
For a logistic regression, the default prediction is the class
.
What does the following table show?
predict(heart_disease_fit, new_data = heart_disease) |>
bind_cols(heart_disease) |>
count(high_risk, .pred_class)
# A tibble: 2 × 3
high_risk .pred_class n
<fct> <fct> <int>
1 0 0 3596
2 1 0 644
The .pred_class
is the class with the highest predicted probability. What is a limitation to using this method to determine the predicted class?
Logistic regression for binary response variable
Relationship between odds and probabilities
Used logistic regression model to calculate predicted odds and probabilities