Multiple linear regression (MLR)

Prof. Maria Tackett

Sep 21, 2022

Computational setup

# load packages
library(tidyverse)   # for data wrangling and visualization
library(tidymodels)  # for modelingt
library(scales)      # for pretty axis labels
library(knitr)       # for pretty tables
library(patchwork)   # for laying out plots
library(GGally)      # for pairwise plots

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 20))

Considering multiple variables

House prices in Levittown

  • The data set contains the sales price and characteristics of 85 homes in Levittown, NY that sold between June 2010 and May 2011.
  • Levittown was built right after WWII and was the first planned suburban community built using mass production techniques.
  • The article “Levittown, the prototypical American suburb – a history of cities in 50 buildings, day 25” gives an overview of Levittown’s controversial history.

Analysis goals

  • We would like to use the characteristics of a house to understand variability in the sales price.

  • To do so, we will fit a multiple linear regression model.

  • Using our model, we can answers questions such as

    • What is the relationship between the characteristics of a house in Levittown and its sale price?
    • Given its characteristics, what is the expected sale price of a house in Levittown?

The data

levittown <- read_csv(here::here("slides/data/homeprices.csv"))
levittown
# A tibble: 85 × 7
   bedrooms bathrooms living_area lot_size year_built property_tax sale_price
      <dbl>     <dbl>       <dbl>    <dbl>      <dbl>        <dbl>      <dbl>
 1        4       1          1380     6000       1948         8360     350000
 2        4       2          1761     7400       1951         5754     360000
 3        4       2          1564     6000       1948         8982     350000
 4        5       2          2904     9898       1949        11664     375000
 5        5       2.5        1942     7788       1948         8120     370000
 6        4       2          1830     6000       1948         8197     335000
 7        4       1          1585     6000       1948         6223     295000
 8        4       1           941     6800       1951         2448     250000
 9        4       1.5        1481     6000       1948         9087     299990
10        3       2          1630     5998       1948         9430     375000
# … with 75 more rows

Variables

Predictors:

  • bedrooms: Number of bedrooms
  • bathrooms: Number of bathrooms
  • living_area: Total living area of the house (in square feet)
  • lot_size: Total area of the lot (in square feet)
  • year_built: Year the house was built
  • property_tax: Annual property taxes (in USD)

Response: sale_price: Sales price (in USD)

EDA: Response variable

EDA: Predictor variables

EDA: Response vs. Predictors

EDA: All variables

  • Plot
  • Code

ggpairs(levittown) +
  theme(
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(angle = 45, size = 10),
    strip.text.y = element_text(angle = 0, hjust = 0)
    )ggpairs(levittown) +
  theme(
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(angle = 45, size = 10),
    strip.text.y = element_text(angle = 0, hjust = 0)
    )

Single vs. multiple predictors

So far we’ve used a single predictor variable to understand variation in a quantitative response variable

Now we want to use multiple predictor variables to understand variation in a quantitative response variable

Multiple linear regression

Multiple linear regression (MLR)

Based on the analysis goals, we will use a multiple linear regression model of the following form

^sale_price = ˆβ0+ˆβ1bedrooms+ˆβ2bathrooms+ˆβ3living_area+ˆβ4lot_size+ˆβ5year_built+ˆβ6property_tax

Similar to simple linear regression, this model assumes that at each combination of the predictor variables, the values sale_price follow a Normal distribution.

Regression Model

Recall: The simple linear regression model assumes

Y|X∼N(β0+β1X,σ2ϵ)

Similarly: The multiple linear regression model assumes

Y|X1,X2,…,Xp∼N(β0+β1X1+β2X2+⋯+βpXp,σ2ϵ)

The MLR model

For a given observation (xi1,xi2…,xip,yi)

yi=β0+β1xi1+β2xi2+⋯+βpxip+ϵiϵi∼N(0,σ2ϵ)

Prediction

At any combination of the predictors, the mean value of the response Y, is

μY|X1,…,Xp=β0+β1X1+β2X2+⋯+βpXp

Using multiple linear regression, we can estimate the mean response for any combination of predictors

ˆY=ˆβ0+ˆβ1X1+ˆβ2X2+⋯+ˆβpXp

Model fit

term estimate std.error statistic p.value
(Intercept) -7148818.957 3820093.694 -1.871 0.065
bedrooms -12291.011 9346.727 -1.315 0.192
bathrooms 51699.236 13094.170 3.948 0.000
living_area 65.903 15.979 4.124 0.000
lot_size -0.897 4.194 -0.214 0.831
year_built 3760.898 1962.504 1.916 0.059
property_tax 1.476 2.832 0.521 0.604

Model equation

^price=−7148818.957−12291.011×bedrooms+51699.236×bathrooms+65.903×living area−0.897×lot size+3760.898×year built+1.476×property tax

Interpreting ˆβj

  • The estimated coefficient ˆβj is the expected change in the mean of y when xj increases by one unit, holding the values of all other predictor variables constant.
  • Example: The estimated coefficient for living_area is 65.90. This means for each additional square foot of living area, we expect the sale price of a house in Levittown, NY to increase by $65.90, on average, holding all other predictor variables constant.

Application exercise

AE 05: Multiple linear regression

Prediction

What is the predicted sale price for a house in Levittown, NY with 3 bedrooms, 1 bathroom, 1,050 square feet of living area, 6,000 square foot lot size, built in 1948 with $6,306 in property taxes?


-7148818.957 - 12291.011 * 3 + 51699.236 * 1 + 
  65.903 * 1050 - 0.897 * 6000 + 3760.898 * 1948 + 
  1.476 * 6306
[1] 265360.4

The predicted sale price for a house in Levittown, NY with 3 bedrooms, 1 bathroom, 1050 square feet of living area, 6000 square foot lot size, built in 1948 with $6306 in property taxes is $265,360.

Prediction, revisit

Just like with simple linear regression, we can use the predict() function in R to calculate the appropriate intervals for our predicted values:

new_house <- tibble(
  bedrooms = 3, bathrooms = 1, 
  living_area = 1050, lot_size = 6000, 
  year_built = 1948, property_tax = 6306
  )

predict(price_fit, new_house)
# A tibble: 1 × 1
    .pred
    <dbl>
1 265360.

Confidence interval for ˆμy

Calculate a 95% confidence interval for the estimated mean price of houses in Levittown, NY with 3 bedrooms, 1 bathroom, 1050 square feet of living area, 6000 square foot lot size, built in 1948 with $6306 in property taxes.


predict(price_fit, new_house, type = "conf_int", level = 0.95)
# A tibble: 1 × 2
  .pred_lower .pred_upper
        <dbl>       <dbl>
1     238482.     292239.

Prediction interval for ˆy

Calculate a 95% prediction interval for an individual house in Levittown, NY with 3 bedrooms, 1 bathroom, 1050 square feet of living area, 6000 square foot lot size, built in 1948 with $6306 in property taxes.


predict(price_fit, new_house, type = "pred_int", level = 0.95)
# A tibble: 1 × 2
  .pred_lower .pred_upper
        <dbl>       <dbl>
1     167277.     363444.

Cautions

  • Do not extrapolate! Because there are multiple predictor variables, there is the potential to extrapolate in many directions
  • The multiple regression model only shows association, not causality
    • To show causality, you must have a carefully designed experiment or carefully account for confounding variables in an observational study

Recap

  • Introduced multiple linear regression

  • Interpreted a coefficient ˆβj

  • Used the model to calculate predicted values and the corresponding intervals

🔗 Week 04

1 / 27
Multiple linear regression (MLR) Prof. Maria Tackett Sep 21, 2022

  1. Slides

  2. Tools

  3. Close
  • Multiple linear regression (MLR)
  • Computational setup
  • Considering multiple variables
  • House prices in Levittown
  • Analysis goals
  • The data
  • Variables
  • EDA: Response variable
  • EDA: Predictor variables
  • EDA: Response vs. Predictors
  • EDA: All variables
  • Single vs. multiple predictors
  • Multiple linear regression
  • Multiple linear regression (MLR)
  • Regression Model
  • The MLR model
  • Prediction
  • Model fit
  • Model equation
  • Interpreting \(\hat{\beta}_j\)
  • Application exercise
  • Prediction
  • Prediction, revisit
  • Confidence interval for \(\hat{\mu}_y\)
  • Prediction interval for \(\hat{y}\)
  • Cautions
  • Recap
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help