Logistic regression

Introduction

Prof. Maria Tackett

Nov 02, 2022

Announcements

  • Project proposal due Fri, Nov 04 at 11:59pm

  • HW 03 due Mon, Nov 07 at 11:59pm

  • Team Feedback #1 due Tue, Nov 08 at 11:59pm

    • Received email from Teammates around 10am
    • Check your junk/spam folder if you do not see the email.
    • Email me if you still don’t see it.
  • See Week 10 activities

Topics

  • Logistic regression for binary response variable

  • Relationship between odds and probabilities

  • Use logistic regression model to calculate predicted odds and probabilities

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(knitr)
library(Stat2Data)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Predicting categorical outcomes

Types of outcome variables

Quantitative outcome variable:

  • Sales price of a house in Levittown, NY
  • Model: Expected sales price given the number of bedrooms, lot size, etc.

Categorical outcome variable:

  • High risk of coronary heart disease
  • Model: Probability an adult is high risk of heart disease given their age, total cholesterol, etc.

Models for categorical outcomes

Logistic regression

2 Outcomes

1: Yes, 0: No

Multinomial logistic regression

3+ Outcomes

1: Democrat, 2: Republican, 3: Independent

2022 election forecasts

Source: FiveThirtyEight 2022 Election Forecasts

2020 NBA finals predictions

Source: FiveThirtyEight 2019-20 NBA Predictions

Do teenagers get 7+ hours of sleep?

Students in grades 9 - 12 surveyed about health risk behaviors including whether they usually get 7 or more hours of sleep.

Sleep7

1: yes

0: no

data(YouthRisk2009) #from Stat2Data package
sleep <- YouthRisk2009 |>
  as_tibble() |>
  filter(!is.na(Age), !is.na(Sleep7))
sleep |>
  relocate(Age, Sleep7)
# A tibble: 446 × 6
     Age Sleep7 Sleep           SmokeLife SmokeDaily MarijuaEver
   <int>  <int> <fct>           <fct>     <fct>            <int>
 1    16      1 8 hours         Yes       Yes                  1
 2    17      0 5 hours         Yes       Yes                  1
 3    18      0 5 hours         Yes       Yes                  1
 4    17      1 7 hours         Yes       No                   1
 5    15      0 4 or less hours No        No                   0
 6    17      0 6 hours         No        No                   0
 7    17      1 7 hours         No        No                   0
 8    16      1 8 hours         Yes       No                   0
 9    16      1 8 hours         No        No                   0
10    18      0 4 or less hours Yes       Yes                  1
# … with 436 more rows

Plot the data

ggplot(sleep, aes(x = Age, y = Sleep7)) +
  geom_point() + 
  labs(y = "Getting 7+ hours of sleep")

Let’s fit a linear regression model

Outcome: Y = 1: yes, 0: no

Let’s use proportions

Outcome: Probability of getting 7+ hours of sleep

What happens if we zoom out?

Outcome: Probability of getting 7+ hours of sleep

🛑 This model produces predictions outside of 0 and 1.

Let’s try another model

✅ This model (called a logistic regression model) only produces predictions between 0 and 1.

The code

ggplot(sleep_age, aes(x = Age, y = prop)) +
  geom_point() + 
  geom_hline(yintercept = c(0,1), lty = 2) + 
  stat_smooth(method ="glm", method.args = list(family = "binomial"), 
              fullrange = TRUE, se = FALSE) +
  labs(y = "P(7+ hours of sleep)") +
  xlim(1, 40) +
  ylim(-0.5, 1.5)

Different types of models

Method Outcome Model
Linear regression Quantitative Y=β0+β1 X
Linear regression (transform Y) Quantitative log(Y)=β0+β1 X
Logistic regression Binary log(π1−π)=β0+β1 X

Application exercise

📋 AE 13: Logistic Regression Intro

Linear Regression vs. Logistic Regression

Odds and probabilities

Binary response variable

  • Y=1: yes,0: no
  • π: probability that Y=1, i.e., P(Y=1)
  • π1−π: odds that Y=1
  • log(π1−π): log odds
  • Go from π to log(π1−π) using the logit transformation

Odds

Suppose there is a 70% chance it will rain tomorrow

  • Probability it will rain is p=0.7
  • Probability it won’t rain is 1−p=0.3
  • Odds it will rain are 7 to 3, 7:3, 0.70.3≈2.33

Are teenagers getting enough sleep?

sleep |>
  count(Sleep7) |>
  mutate(p = round(n / sum(n), 3))
# A tibble: 2 × 3
  Sleep7     n     p
   <int> <int> <dbl>
1      0   150 0.336
2      1   296 0.664

P(7+ hours of sleep)=P(Y=1)=p=0.664

P(< 7 hours of sleep)=P(Y=0)=1−p=0.336

P(odds of 7+ hours of sleep)=0.6640.336=1.976

From odds to probabilities

odds

ω=π1−π

probability

π=ω1+ω

Logistic regression

From odds to probabilities

  1. Logistic model: log odds = log(π1−π)=β0+β1 X
  2. Odds = exp{log(π1−π)}=π1−π
  3. Combining (1) and (2) with what we saw earlier

probability=π=exp{β0+β1 X}1+exp{β0+β1 X}

Logistic regression model

Logit form: log(π1−π)=β0+β1 X

Probability form:

π=exp{β0+β1 X}1+exp{β0+β1 X}

Risk of coronary heart disease

This dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to use age to predict if a randomly selected adult is high risk of having coronary heart disease in the next 10 years.


high_risk:

  • 1: High risk of having heart disease in next 10 years
  • 0: Not high risk of having heart disease in next 10 years

age: Age at exam time (in years)

Data: heart_disease

# A tibble: 4,240 × 2
     age high_risk
   <dbl> <fct>    
 1    39 0        
 2    46 0        
 3    48 0        
 4    61 1        
 5    46 0        
 6    43 0        
 7    63 1        
 8    45 0        
 9    52 0        
10    43 0        
# … with 4,230 more rows

High risk vs. age

ggplot(heart_disease, aes(x = high_risk, y = age)) +
  geom_boxplot(fill = "steelblue") +
  labs(x = "High risk - 1: yes, 0: no",
       y = "Age", 
       title = "Age vs. High risk of heart disease")

Let’s fit the model

heart_disease_fit <- logistic_reg() |>
  set_engine("glm") |>
  fit(high_risk ~ age, data = heart_disease, family = "binomial")

tidy(heart_disease_fit) |> kable(digits = 3)heart_disease_fit <- logistic_reg() |>
  set_engine("glm") |>
  fit(high_risk ~ age, data = heart_disease, family = "binomial")

tidy(heart_disease_fit) |> kable(digits = 3)heart_disease_fit <- logistic_reg() |>
  set_engine("glm") |>
  fit(high_risk ~ age, data = heart_disease, family = "binomial")

tidy(heart_disease_fit) |> kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) -5.561 0.284 -19.599 0
age 0.075 0.005 14.178 0

The model

tidy(heart_disease_fit) |> kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) -5.561 0.284 -19.599 0
age 0.075 0.005 14.178 0


log(ˆπ1−ˆπ)=−5.561+0.075×age where ˆπ is the predicted probability of being high risk of heart disease

Predicted log odds

augment(heart_disease_fit$fit) |> select(.fitted, .resid)
# A tibble: 4,240 × 2
   .fitted .resid
     <dbl>  <dbl>
 1  -2.65  -0.370
 2  -2.13  -0.475
 3  -1.98  -0.509
 4  -1.01   1.62 
 5  -2.13  -0.475
 6  -2.35  -0.427
 7  -0.858  1.56 
 8  -2.20  -0.458
 9  -1.68  -0.585
10  -2.35  -0.427
# … with 4,230 more rows

For observation 1

predicted odds=ˆω=ˆπ1−ˆπ=exp{−2.650}=0.071

Predicted probabilities

predict(heart_disease_fit, new_data = heart_disease, type = "prob")
# A tibble: 4,240 × 2
   .pred_0 .pred_1
     <dbl>   <dbl>
 1   0.934  0.0660
 2   0.894  0.106 
 3   0.878  0.122 
 4   0.733  0.267 
 5   0.894  0.106 
 6   0.913  0.0870
 7   0.702  0.298 
 8   0.900  0.0996
 9   0.843  0.157 
10   0.913  0.0870
# … with 4,230 more rows

predicted probabilities=ˆπ=exp{−2.650}1+exp{−2.650}=0.066

Predicted classes

predict(heart_disease_fit, new_data = heart_disease, type = "class")
# A tibble: 4,240 × 1
   .pred_class
   <fct>      
 1 0          
 2 0          
 3 0          
 4 0          
 5 0          
 6 0          
 7 0          
 8 0          
 9 0          
10 0          
# … with 4,230 more rows

Default prediction

For a logistic regression, the default prediction is the class.

predict(heart_disease_fit, new_data = heart_disease)
# A tibble: 4,240 × 1
   .pred_class
   <fct>      
 1 0          
 2 0          
 3 0          
 4 0          
 5 0          
 6 0          
 7 0          
 8 0          
 9 0          
10 0          
# … with 4,230 more rows

Observed vs. predicted

What does the following table show?

predict(heart_disease_fit, new_data = heart_disease) |>
  bind_cols(heart_disease) |>
  count(high_risk, .pred_class)
# A tibble: 2 × 3
  high_risk .pred_class     n
  <fct>     <fct>       <int>
1 0         0            3596
2 1         0             644

The .pred_class is the class with the highest predicted probability. What is a limitation to using this method to determine the predicted class?

Recap

  • Logistic regression for binary response variable

  • Relationship between odds and probabilities

  • Used logistic regression model to calculate predicted odds and probabilities

Application exercise

📋 AE 13: Logistic Regression Intro

🔗 Week 10

1 / 38
Logistic regression Introduction Prof. Maria Tackett Nov 02, 2022

  1. Slides

  2. Tools

  3. Close
  • Logistic regression
  • Announcements
  • Topics
  • Computational setup
  • Predicting categorical outcomes
  • Types of outcome variables
  • Models for categorical outcomes
  • 2022 election forecasts
  • 2020 NBA finals predictions
  • Do teenagers get 7+ hours of sleep?
  • Plot the data
  • Let’s fit a linear regression model
  • Let’s use proportions
  • What happens if we zoom out?
  • Let’s try another model
  • The code
  • Different types of models
  • Application exercise
  • Odds and probabilities
  • Binary response variable
  • Odds
  • Are teenagers getting enough sleep?
  • From odds to probabilities
  • Logistic regression
  • From odds to probabilities
  • Logistic regression model
  • Risk of coronary heart disease
  • Data: heart_disease
  • High risk vs. age
  • Let’s fit the model
  • The model
  • Predicted log odds
  • Predicted probabilities
  • Predicted classes
  • Default prediction
  • Observed vs. predicted
  • Recap
  • Application exercise
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help