library(tidyverse)
library(tidymodels)
library(knitr)
AE 16: Exam 02 Review
Packages
Part 1: Multiple linear regression
The goal of this analysis is to use characteristics of loan applications to predict the interest rate on the loan. We will use a random sample of loans given by Lending Club, a peer-to-peer lending service. The random sample was drawn from loans_full_schema
in the openintro R package. Click here for the codebook.
<- read_csv("data/loans-sample.csv") loans
Exercise 1
Split the data into training (75%) and testing (25%) sets. Use seed 1205
.
# add code here
Exercise 2
Write the equation of the statistical model for predicting interest rate (interest_rate
) from debt to income ratio (debt_to_income
), the term of loan (term
), the number of inquiries (credit checks) into the applicant’s credit during the last 12 months (inquiries_last_12m
), whether there are any bankruptcies listed in the public record for this applicant (bankrupt
), and the type of application (application_type
). The model should allow for the effect of debt to income ratio on interest rate to vary by application type.
[Add model here]
Exercise 3
Specify a linear regression model. Call it loans_spec
.
# add code here
Exercise 4
Use the training data to build the following recipe:
- Predict
interest_rate
fromdebt_to_income
,term
,inquiries_last_12m
,public_record_bankrupt
, andapplication_type
. - Mean center
debt_to_income
. - Make
term
a factor. - Create a new variable:
bankrupt
that takes on the value “no” ifpublic_record_bankrupt
is 0 and the value “yes” ifpublic_record_bankrupt
is 1 or higher. Then, removepublic_record_bankrupt
. - Interact
application_type
withdebt_to_income
. - Create dummy variables where needed and drop any zero variance variables.
# add code here
Exercise 5
Create the workflow that brings together the model specification and recipe.
# add code here
Exercise 6
Conduct 10-fold cross validation. Use the seed 1205
. You will only collect the default metrics, \(R^2\) and RMSE. You do not need to collect AIC, BIC or Adj. \(R^2\).
# add code here
Exercise 7
Collect and summarize \(R^2\) and RMSE metrics from your CV resamples.
# add code here
Why are we focusing on \(R^2\) and RMSE instead of adjusted \(R^2\), AIC, BIC?
[Add response here]
Exercise 8
Refit the model on the entire training data.
# add code here
Then, interpret the following in the context of the data:
Intercept
debt_to_income
for joint applicationsterm
Part 2: Logistic regression
Data
As part of a study of the effects of predatory invasive crab species on snail populations, researchers measured the mean closing forces and the propodus heights of the claws on several crabs of three species.
<- read_csv("data/claws.csv") |>
claws mutate(lb = as_factor(lb))
The data set contains following variables:
force
: Closing force of claw (newtons)height
: Propodus height (mm)species
: Crab species - Cp(Cancer productus), Hn (Hemigrapsus nudus), Lb(Lophopanopeus bellus)lb
: 1 if Lophopanopeus bellus species, 0 otherwisehn
: 1 if Hemigrapsus nudus species, 0 otherwisecp
: 1 if Cancer productus species, 0 otherwiseforce_cent
: mean centered forceheight_cent
: mean centered height
Getting started
- Why do we use the log-odds as the response variable?
[Add response here]
Fill in the blanks:
Use log-odds to …
Use odds to …
Use probabilities to …
Suppose we want to use force to determine whether or not a crab is from the Lophopanopeus bellus (Lb) species. Why should we use a logistic regression model for this analysis?
Exercise 9
We will use force_cent
, the mean-centered variable for force in the model. The model output is below. Write the equation of the model produced by R. Don’t forget to fill in the blanks for …
.
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -0.798 | 0.358 | -2.233 | 0.026 | -1.542 | -0.123 |
force_cent | 0.043 | 0.039 | 1.090 | 0.276 | -0.034 | 0.123 |
Let \(\pi\) be…
\[\log\Big(\frac{\hat{\pi}}{1 - \hat{\pi}}\Big) = \]
Exercise 10
Interpret the intercept in the context of the data.
Exercise 11
Interpret the effect of force in the context of the data.
Exercise 12
Now let’s consider adding height to the model. Fit the model that includes height_cent
. Then use AIC to choose the model that best fits the data.
<- logistic_reg() |>
lb_fit_2 set_engine("glm") |>
fit(lb ~ force_cent + height_cent, data = claws)
tidy(lb_fit_2, conf.int = TRUE) |>
kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -1.130 | 0.463 | -2.443 | 0.015 | -2.167 | -0.306 |
force_cent | 0.211 | 0.092 | 2.279 | 0.023 | 0.056 | 0.424 |
height_cent | -0.895 | 0.398 | -2.249 | 0.025 | -1.815 | -0.234 |
Exercise 13
What do the following mean in the context of this data. Explain and calculate them.
Sensitivity: …
Specificity: …
Negative predictive power: …