Prof. Maria Tackett
Sep 26, 2022
Lab 03 due
Today at 11:59pm (Thursday labs)
Tue, Sep 27 at 11:59pm (Friday labs)
Exam 01: Sep 28 - 30
Exam 01 review on Sep 28
Videos for Weeks 01 - 05 available until Sep 28 at 11:59pm
See Week 05 for this week’s activities.
Prediction for multiple linear regression
Types of predictors for multiple linear regression
Mean-centering quantitative predictors
Using indicator variables for categorical predictors
Using interaction terms
# A tibble: 85 × 7
bedrooms bathrooms living_area lot_size year_built property_tax sale_price
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 1 1380 6000 1948 8360 350000
2 4 2 1761 7400 1951 5754 360000
3 4 2 1564 6000 1948 8982 350000
4 5 2 2904 9898 1949 11664 375000
5 5 2.5 1942 7788 1948 8120 370000
6 4 2 1830 6000 1948 8197 335000
7 4 1 1585 6000 1948 6223 295000
8 4 1 941 6800 1951 2448 250000
9 4 1.5 1481 6000 1948 9087 299990
10 3 2 1630 5998 1948 9430 375000
# … with 75 more rows
Predictors:
bedrooms
: Number of bedroomsbathrooms
: Number of bathroomsliving_area
: Total living area of the house (in square feet)lot_size
: Total area of the lot (in square feet)year_built
: Year the house was builtproperty_tax
: Annual property taxes (in USD)Response: sale_price
: Sales price (in USD)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -7148818.957 | 3820093.694 | -1.871 | 0.065 |
bedrooms | -12291.011 | 9346.727 | -1.315 | 0.192 |
bathrooms | 51699.236 | 13094.170 | 3.948 | 0.000 |
living_area | 65.903 | 15.979 | 4.124 | 0.000 |
lot_size | -0.897 | 4.194 | -0.214 | 0.831 |
year_built | 3760.898 | 1962.504 | 1.916 | 0.059 |
property_tax | 1.476 | 2.832 | 0.521 | 0.604 |
What is the predicted sale price for a house in Levittown, NY with 3 bedrooms, 1 bathroom, 1,050 square feet of living area, 6,000 square foot lot size, built in 1948 with $6,306 in property taxes?
-7148818.957 - 12291.011 * 3 + 51699.236 * 1 +
65.903 * 1050 - 0.897 * 6000 + 3760.898 * 1948 +
1.476 * 6306
[1] 265360.4
The predicted sale price for a house in Levittown, NY with 3 bedrooms, 1 bathroom, 1050 square feet of living area, 6000 square foot lot size, built in 1948 with $6306 in property taxes is $265,360.
Just like with simple linear regression, we can use the predict()
function in R to calculate the appropriate intervals for our predicted values:
Calculate a 95% confidence interval for the estimated mean price of houses in Levittown, NY with 3 bedrooms, 1 bathroom, 1050 square feet of living area, 6000 square foot lot size, built in 1948 with $6306 in property taxes.
Calculate a 95% prediction interval for an individual house in Levittown, NY with 3 bedrooms, 1 bathroom, 1050 square feet of living area, 6000 square foot lot size, built in 1948 with $6306 in property taxes.
Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50
data frame in the openintro R package.
# A tibble: 50 × 4
annual_income debt_to_income verified_income interest_rate
<dbl> <dbl> <fct> <dbl>
1 59000 0.558 Not Verified 10.9
2 60000 1.31 Not Verified 9.92
3 75000 1.06 Verified 26.3
4 75000 0.574 Not Verified 9.92
5 254000 0.238 Not Verified 9.43
6 67000 1.08 Source Verified 9.92
7 28800 0.0997 Source Verified 17.1
8 80000 0.351 Not Verified 6.08
9 34000 0.698 Not Verified 7.97
10 80000 0.167 Source Verified 12.6
# … with 40 more rows
Predictors:
annual_income
: Annual incomedebt_to_income
: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total incomeverified_income
: Whether borrower’s income source and amount have been verified (Not Verified
, Source Verified
, Verified
)Outcome: interest_rate
: Interest rate for the loan
interest_rate
min | median | max | iqr |
---|---|---|---|
5.31 | 9.93 | 26.3 | 5.755 |
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 10.726 | 1.507 | 7.116 | 0.000 | 7.690 | 13.762 |
debt_to_income | 0.671 | 0.676 | 0.993 | 0.326 | -0.690 | 2.033 |
verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 | -0.606 | 5.028 |
verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 | 3.253 | 10.508 |
annual_income_th | -0.021 | 0.011 | -1.804 | 0.078 | -0.043 | 0.002 |
Describe the subset of borrowers who are expected to get an interest rate of 10.726% based on our model. Is this interpretation meaningful? Why or why not?
If we are interested in interpreting the intercept, we can mean-center the quantitative predictors in the model.
We can mean-center a quantitative predictor \(X_j\) using the following:
\[X_{j_{Cent}} = X_{j}- \bar{X}_{j}\]
If we mean-center all quantitative variables, then the intercept is interpreted as the expected value of the response variable when all quantitative variables are at their mean value.
How do you expect the model to change if we use the debt_inc_cent
and annual_income_cent
in the model?
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 9.444 | 0.977 | 9.663 | 0.000 | 7.476 | 11.413 |
debt_inc_cent | 0.671 | 0.676 | 0.993 | 0.326 | -0.690 | 2.033 |
verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 | -0.606 | 5.028 |
verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 | 3.253 | 10.508 |
annual_income_th_cent | -0.021 | 0.011 | -1.804 | 0.078 | -0.043 | 0.002 |
term | estimate |
---|---|
(Intercept) | 10.726 |
debt_to_income | 0.671 |
verified_incomeSource Verified | 2.211 |
verified_incomeVerified | 6.880 |
annual_income_th | -0.021 |
term | estimate |
---|---|
(Intercept) | 9.444 |
debt_inc_cent | 0.671 |
verified_incomeSource Verified | 2.211 |
verified_incomeVerified | 6.880 |
annual_income_th_cent | -0.021 |
Suppose there is a categorical variable with \(K\) categories (levels)
We can make \(K\) indicator variables - one indicator for each category
An indicator variable takes values 1 or 0
verified_income
# A tibble: 3 × 4
verified_income not_verified source_verified verified
<fct> <dbl> <dbl> <dbl>
1 Not Verified 1 0 0
2 Verified 0 0 1
3 Source Verified 0 1 0
verified_income
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 9.444 | 0.977 | 9.663 | 0.000 | 7.476 | 11.413 |
debt_inc_cent | 0.671 | 0.676 | 0.993 | 0.326 | -0.690 | 2.033 |
verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 | -0.606 | 5.028 |
verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 | 3.253 | 10.508 |
annual_income_th_cent | -0.021 | 0.011 | -1.804 | 0.078 | -0.043 | 0.002 |
Not verified
.The lines are not parallel indicating there is an interaction effect. The slope of annual income differs based on the income verification.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 9.484 | 0.989 | 9.586 | 0.000 |
debt_inc_cent | 0.691 | 0.685 | 1.009 | 0.319 |
verified_incomeSource Verified | 2.157 | 1.418 | 1.522 | 0.135 |
verified_incomeVerified | 7.181 | 1.870 | 3.840 | 0.000 |
annual_income_th_cent | -0.007 | 0.020 | -0.341 | 0.735 |
verified_incomeSource Verified:annual_income_th_cent | -0.016 | 0.026 | -0.643 | 0.523 |
verified_incomeVerified:annual_income_th_cent | -0.032 | 0.033 | -0.979 | 0.333 |
annual_income
for source verified: If the income is source verified, we expect the interest rate to decrease by 0.023% (-0.007 + -0.016) for each additional thousand dollars in annual income, holding all else constant.Defining the interaction variable in the model formula as verified_income * annual_income_th_cent
is an implicit data manipulation step as well
Rows: 50
Columns: 9
$ `(Intercept)` <dbl> 1, 1, 1, 1, 1, …
$ debt_inc_cent <dbl> -0.16511719, 0.…
$ annual_income_th_cent <dbl> -27.17, -26.17,…
$ `verified_incomeNot Verified` <dbl> 1, 1, 0, 1, 1, …
$ `verified_incomeSource Verified` <dbl> 0, 0, 0, 0, 0, …
$ verified_incomeVerified <dbl> 0, 0, 1, 0, 0, …
$ `annual_income_th_cent:verified_incomeNot Verified` <dbl> -27.17, -26.17,…
$ `annual_income_th_cent:verified_incomeSource Verified` <dbl> 0.00, 0.00, 0.0…
$ `annual_income_th_cent:verified_incomeVerified` <dbl> 0.00, 0.00, -11…
Mean-centering quantitative predictors
Using indicator variables for categorical predictors
Using interaction terms
Data manipulation, with dplyr (from tidyverse):
loan50 |>
select(interest_rate, annual_income, debt_to_income, verified_income) |>
mutate(
# 1. rescale income
annual_income_th = annual_income / 1000,
# 2. mean-center quantitative predictors
debt_inc_cent = debt_to_income - mean(debt_to_income),
annual_income_th_cent = annual_income_th - mean(annual_income_th),
# 3. create dummy variables for verified_income
source_verified = if_else(verified_income == "Source Verified", 1, 0),
verified = if_else(verified_income == "Verified", 1, 0),
# 4. create interaction variables
`annual_income_th_cent:verified_incomeSource Verified` = annual_income_th_cent * source_verified,
`annual_income_th_cent:verified_incomeVerified` = annual_income_th_cent * verified
)
Feature engineering, with recipes (from tidymodels):
loan_rec <- recipe( ~ ., data = loan50) |>
# 1. rescale income
step_mutate(annual_income_th = annual_income / 1000) |>
# 2. mean-center quantitative predictors
step_center(all_numeric_predictors()) |>
# 3. create dummy variables for verified_income
step_dummy(verified_income) |>
# 4. create interaction variables
step_interact(terms = ~ annual_income_th:verified_income)
Recipe
Inputs:
role #variables
predictor 24
Operations:
Variable mutation for annual_income / 1000
Centering for all_numeric_predictors()
Dummy variables from verified_income
Interactions with annual_income_th:verified_income