MLR: Types of predictors

Prof. Maria Tackett

Sep 26, 2022

Announcements

Lab 03 due
- Today at 11:59pm (Thursday labs)
- Tue, Sep 27 at 11:59pm (Friday labs)
Exam 01: Sep 28 - 30
- Exam 01 review on Sep 28
- Videos for Weeks 01 - 05 available until Sep 28 at 11:59pm
See Week 05 for this week’s activities.

Topics

Prediction for multiple linear regression
Types of predictors for multiple linear regression
Mean-centering quantitative predictors
Using indicator variables for categorical predictors
Using interaction terms

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(openintro)
library(patchwork)
library(knitr)
library(kableExtra)
library(colorblindr)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Prediction

The data

levittown <- read_csv(here::here("slides/data/homeprices.csv"))
levittown

# A tibble: 85 × 7
   bedrooms bathrooms living_area lot_size year_built property_tax sale_price
      <dbl>     <dbl>       <dbl>    <dbl>      <dbl>        <dbl>      <dbl>
 1        4       1          1380     6000       1948         8360     350000
 2        4       2          1761     7400       1951         5754     360000
 3        4       2          1564     6000       1948         8982     350000
 4        5       2          2904     9898       1949        11664     375000
 5        5       2.5        1942     7788       1948         8120     370000
 6        4       2          1830     6000       1948         8197     335000
 7        4       1          1585     6000       1948         6223     295000
 8        4       1           941     6800       1951         2448     250000
 9        4       1.5        1481     6000       1948         9087     299990
10        3       2          1630     5998       1948         9430     375000
# … with 75 more rows

Variables

Predictors:

bedrooms: Number of bedrooms
bathrooms: Number of bathrooms
living_area: Total living area of the house (in square feet)
lot_size: Total area of the lot (in square feet)
year_built: Year the house was built
property_tax: Annual property taxes (in USD)

Response: sale_price: Sales price (in USD)

Model fit

term	estimate	std.error	statistic	p.value
(Intercept)	-7148818.957	3820093.694	-1.871	0.065
bedrooms	-12291.011	9346.727	-1.315	0.192
bathrooms	51699.236	13094.170	3.948	0.000
living_area	65.903	15.979	4.124	0.000
lot_size	-0.897	4.194	-0.214	0.831
year_built	3760.898	1962.504	1.916	0.059
property_tax	1.476	2.832	0.521	0.604

Prediction

What is the predicted sale price for a house in Levittown, NY with 3 bedrooms, 1 bathroom, 1,050 square feet of living area, 6,000 square foot lot size, built in 1948 with $6,306 in property taxes?

-7148818.957 - 12291.011 * 3 + 51699.236 * 1 + 
  65.903 * 1050 - 0.897 * 6000 + 3760.898 * 1948 + 
  1.476 * 6306

[1] 265360.4

The predicted sale price for a house in Levittown, NY with 3 bedrooms, 1 bathroom, 1050 square feet of living area, 6000 square foot lot size, built in 1948 with $6306 in property taxes is $265,360.

Prediction, revisit

Just like with simple linear regression, we can use the predict() function in R to calculate the appropriate intervals for our predicted values:

new_house <- tibble(
  bedrooms = 3, bathrooms = 1, 
  living_area = 1050, lot_size = 6000, 
  year_built = 1948, property_tax = 6306
  )

predict(price_fit, new_house)

# A tibble: 1 × 1
    .pred
    <dbl>
1 265360.

Confidence interval for $\hat{\mu}_y$

Calculate a 95% confidence interval for the estimated mean price of houses in Levittown, NY with 3 bedrooms, 1 bathroom, 1050 square feet of living area, 6000 square foot lot size, built in 1948 with $6306 in property taxes.

predict(price_fit, new_house, type = "conf_int", level = 0.95)

# A tibble: 1 × 2
  .pred_lower .pred_upper
        <dbl>       <dbl>
1     238482.     292239.

Prediction interval for $\hat{y}$

Calculate a 95% prediction interval for an individual house in Levittown, NY with 3 bedrooms, 1 bathroom, 1050 square feet of living area, 6000 square foot lot size, built in 1948 with $6306 in property taxes.

predict(price_fit, new_house, type = "pred_int", level = 0.95)

# A tibble: 1 × 2
  .pred_lower .pred_upper
        <dbl>       <dbl>
1     167277.     363444.

Cautions

Do not extrapolate! Because there are multiple predictor variables, there is the potential to extrapolate in many directions
The multiple regression model only shows association, not causality
- To show causality, you must have a carefully designed experiment or carefully account for confounding variables in an observational study

Application exercise

AE 06: Prediction for MLR

Types of predictors

Data: Peer-to-peer lender

Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50 data frame in the openintro R package.

# A tibble: 50 × 4
   annual_income debt_to_income verified_income interest_rate
           <dbl>          <dbl> <fct>                   <dbl>
 1         59000         0.558  Not Verified            10.9 
 2         60000         1.31   Not Verified             9.92
 3         75000         1.06   Verified                26.3 
 4         75000         0.574  Not Verified             9.92
 5        254000         0.238  Not Verified             9.43
 6         67000         1.08   Source Verified          9.92
 7         28800         0.0997 Source Verified         17.1 
 8         80000         0.351  Not Verified             6.08
 9         34000         0.698  Not Verified             7.97
10         80000         0.167  Source Verified         12.6 
# … with 40 more rows

Variables

Predictors:

annual_income: Annual income
debt_to_income: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total income
verified_income: Whether borrower’s income source and amount have been verified (Not Verified, Source Verified, Verified)

Outcome: interest_rate: Interest rate for the loan

Outcome: `interest_rate`

min	median	max	iqr
5.31	9.93	26.3	5.755

Predictors

Data manipulation 1: Rescale income

loan50 <- loan50 |>
  mutate(annual_income_th = annual_income / 1000)

ggplot(loan50, aes(x = annual_income_th)) +
  geom_histogram(binwidth = 20) +
  labs(title = "Annual income (in $1000s)", 
       x = "")

Outcome vs. predictors

Fit regression model

int_fit <- linear_reg() |>
  set_engine("lm") |>
  fit(interest_rate ~ debt_to_income + verified_income  + annual_income_th,
      data = loan50)

Summarize model results

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	10.726	1.507	7.116	0.000	7.690	13.762
debt_to_income	0.671	0.676	0.993	0.326	-0.690	2.033
verified_incomeSource Verified	2.211	1.399	1.581	0.121	-0.606	5.028
verified_incomeVerified	6.880	1.801	3.820	0.000	3.253	10.508
annual_income_th	-0.021	0.011	-1.804	0.078	-0.043	0.002

Describe the subset of borrowers who are expected to get an interest rate of 10.726% based on our model. Is this interpretation meaningful? Why or why not?

Mean-centered variables

Mean-centering

If we are interested in interpreting the intercept, we can mean-center the quantitative predictors in the model.

We can mean-center a quantitative predictor $X_j$ using the following:

\[X_{j_{Cent}} = X_{j}- \bar{X}_{j}\]

If we mean-center all quantitative variables, then the intercept is interpreted as the expected value of the response variable when all quantitative variables are at their mean value.

Data manipulation 2: Mean-center numeric predictors

loan50 <- loan50 |>
  mutate(
    debt_inc_cent = debt_to_income - mean(debt_to_income), 
    annual_income_th_cent = annual_income_th - mean(annual_income_th)
    )

Visualize mean-centered predictors

Using mean-centered variables in the model

How do you expect the model to change if we use the debt_inc_cent and annual_income_cent in the model?

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	9.444	0.977	9.663	0.000	7.476	11.413
debt_inc_cent	0.671	0.676	0.993	0.326	-0.690	2.033
verified_incomeSource Verified	2.211	1.399	1.581	0.121	-0.606	5.028
verified_incomeVerified	6.880	1.801	3.820	0.000	3.253	10.508
annual_income_th_cent	-0.021	0.011	-1.804	0.078	-0.043	0.002

Original vs. mean-centered model

term	estimate
(Intercept)	10.726
debt_to_income	0.671
verified_incomeSource Verified	2.211
verified_incomeVerified	6.880
annual_income_th	-0.021

term	estimate
(Intercept)	9.444
debt_inc_cent	0.671
verified_incomeSource Verified	2.211
verified_incomeVerified	6.880
annual_income_th_cent	-0.021

Indicator variables

Suppose there is a categorical variable with $K$ categories (levels)
We can make $K$ indicator variables - one indicator for each category
An indicator variable takes values 1 or 0
- 1 if the observation belongs to that category
- 0 if the observation does not belong to that category

Data manipulation 3: Create indicator variables for `verified_income`

loan50 <- loan50 |>
  mutate(
    not_verified = if_else(verified_income == "Not Verified", 1, 0),
    source_verified = if_else(verified_income == "Source Verified", 1, 0),
    verified = if_else(verified_income == "Verified", 1, 0)
  )

# A tibble: 3 × 4
  verified_income not_verified source_verified verified
  <fct>                  <dbl>           <dbl>    <dbl>
1 Not Verified               1               0        0
2 Verified                   0               0        1
3 Source Verified            0               1        0

Indicators in the model

We will use $K-1$ of the indicator variables in the model.
The baseline is the category that doesn’t have a term in the model.
The coefficients of the indicator variables in the model are interpreted as the expected change in the response compared to the baseline, holding all other variables constant.
This approach is also called dummy coding.

loan50 |>
  select(verified_income, source_verified, verified) |>
  slice(1, 3, 6)

# A tibble: 3 × 3
  verified_income source_verified verified
  <fct>                     <dbl>    <dbl>
1 Not Verified                  0        0
2 Verified                      0        1
3 Source Verified               1        0

Interpreting `verified_income`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	9.444	0.977	9.663	0.000	7.476	11.413
debt_inc_cent	0.671	0.676	0.993	0.326	-0.690	2.033
verified_incomeSource Verified	2.211	1.399	1.581	0.121	-0.606	5.028
verified_incomeVerified	6.880	1.801	3.820	0.000	3.253	10.508
annual_income_th_cent	-0.021	0.011	-1.804	0.078	-0.043	0.002

The baseline category is Not verified.
People with source verified income are expected to take a loan with an interest rate that is 2.211% higher, on average, than the rate on loans to those whose income is not verified, holding all else constant.
People with verified income are expected to take a loan with an interest rate that is 6.880% higher, on average, than the rate on loans to those whose income is not verified, holding all else constant.

Interaction terms

Sometimes the relationship between a predictor variable and the response depends on the value of another predictor variable.
This is an interaction effect.
To account for this, we can include interaction terms in the model.

Interest rate vs. annual income

The lines are not parallel indicating there is an interaction effect. The slope of annual income differs based on the income verification.

Interaction term in model

int_cent_int_fit <- linear_reg() |>
  set_engine("lm") |>
  fit(interest_rate ~ debt_inc_cent + verified_income + annual_income_th_cent + verified_income * annual_income_th_cent,
      data = loan50)

term	estimate	std.error	statistic	p.value
(Intercept)	9.484	0.989	9.586	0.000
debt_inc_cent	0.691	0.685	1.009	0.319
verified_incomeSource Verified	2.157	1.418	1.522	0.135
verified_incomeVerified	7.181	1.870	3.840	0.000
annual_income_th_cent	-0.007	0.020	-0.341	0.735
verified_incomeSource Verified:annual_income_th_cent	-0.016	0.026	-0.643	0.523
verified_incomeVerified:annual_income_th_cent	-0.032	0.033	-0.979	0.333

Interpreting interaction terms

What the interaction means: The effect of annual income on the interest rate differs by -0.016 when the income is source verified compared to when it is not verified, holding all else constant.
Interpreting annual_income for source verified: If the income is source verified, we expect the interest rate to decrease by 0.023% (-0.007 + -0.016) for each additional thousand dollars in annual income, holding all else constant.

Data manipulation 4: Create interaction variables

Defining the interaction variable in the model formula as verified_income * annual_income_th_cent is an implicit data manipulation step as well

Rows: 50
Columns: 9
$ `(Intercept)`                                          <dbl> 1, 1, 1, 1, 1, …
$ debt_inc_cent                                          <dbl> -0.16511719, 0.…
$ annual_income_th_cent                                  <dbl> -27.17, -26.17,…
$ `verified_incomeNot Verified`                          <dbl> 1, 1, 0, 1, 1, …
$ `verified_incomeSource Verified`                       <dbl> 0, 0, 0, 0, 0, …
$ verified_incomeVerified                                <dbl> 0, 0, 1, 0, 0, …
$ `annual_income_th_cent:verified_incomeNot Verified`    <dbl> -27.17, -26.17,…
$ `annual_income_th_cent:verified_incomeSource Verified` <dbl> 0.00, 0.00, 0.0…
$ `annual_income_th_cent:verified_incomeVerified`        <dbl> 0.00, 0.00, -11…

Wrap up

Recap

Mean-centering quantitative predictors
Using indicator variables for categorical predictors
Using interaction terms

Looking backward

Data manipulation, with dplyr (from tidyverse):

loan50 |>
  select(interest_rate, annual_income, debt_to_income, verified_income) |>
  mutate(
    # 1. rescale income
    annual_income_th = annual_income / 1000,
    # 2. mean-center quantitative predictors
    debt_inc_cent = debt_to_income - mean(debt_to_income),
    annual_income_th_cent = annual_income_th - mean(annual_income_th),
    # 3. create dummy variables for verified_income
    source_verified = if_else(verified_income == "Source Verified", 1, 0),
    verified = if_else(verified_income == "Verified", 1, 0),
    # 4. create interaction variables
    `annual_income_th_cent:verified_incomeSource Verified` = annual_income_th_cent * source_verified,
    `annual_income_th_cent:verified_incomeVerified` = annual_income_th_cent * verified
  )

Looking forward

Feature engineering, with recipes (from tidymodels):

loan_rec <- recipe( ~ ., data = loan50) |>
  # 1. rescale income
  step_mutate(annual_income_th = annual_income / 1000) |>
  # 2. mean-center quantitative predictors
  step_center(all_numeric_predictors()) |>
  # 3. create dummy variables for verified_income
  step_dummy(verified_income) |>
  # 4. create interaction variables
  step_interact(terms = ~ annual_income_th:verified_income)

Recipe

loan_rec

Recipe

Inputs:

      role #variables
 predictor         24

Operations:

Variable mutation for annual_income / 1000
Centering for all_numeric_predictors()
Dummy variables from verified_income
Interactions with annual_income_th:verified_income

MLR: Types of predictors

Announcements

Topics

Computational setup

Prediction

The data

Variables

Model fit

Prediction

Prediction, revisit

Confidence interval for \(\hat{\mu}_y\)

Prediction interval for \(\hat{y}\)

Cautions

Application exercise

Types of predictors

Data: Peer-to-peer lender

Variables

Outcome: interest_rate

Predictors

Data manipulation 1: Rescale income

Outcome vs. predictors

Fit regression model

Summarize model results

Mean-centered variables

Mean-centering

Data manipulation 2: Mean-center numeric predictors

Visualize mean-centered predictors

Using mean-centered variables in the model

Original vs. mean-centered model

Indicator variables

Indicator variables

Data manipulation 3: Create indicator variables for verified_income

Indicators in the model

Interpreting verified_income

Interaction terms

Interaction terms

Interest rate vs. annual income

Interaction term in model

Interpreting interaction terms

Data manipulation 4: Create interaction variables

Wrap up

Recap

Looking backward

Looking forward

Recipe

Outcome: `interest_rate`

Data manipulation 3: Create indicator variables for `verified_income`

Interpreting `verified_income`