Describe how \(R^2\) and RMSE are used to evaluate models
Assess model’s predictive importance using data splitting and bootstrapping
Computational setup
# load packageslibrary(tidyverse) # for data wrangling and visualizationlibrary(tidymodels) # for modelinglibrary(usdata) # for the county_2019 datasetlibrary(scales) # for pretty axis labelslibrary(glue) # for constructing character strings# set default theme and larger font size for ggplot2ggplot2::theme_set(ggplot2::theme_bw(base_size =16))
These data have been compiled from the 2019 American Community Survey
Uninsurance rate
High school graduation rate
Examining the relationship
The NC Labor and Economic Analysis Division (LEAD) “collects data, conducts research and analysis and publishes reports about the state’s economy and labor market. Information and data produced by LEAD help stakeholders make more informed decisions on business recruitment, education and workforce policies and career development, as well as gain a more extensive view of North Carolina’s economy.”
Suppose that an analyst working for LEAD is interested in the relationship between uninsurance and high school graduation rates in NC counties.
What type of visualization should the analyst make to examine the relationship between these two variables?
# A tibble: 100 × 3
name hs_grad uninsured
<chr> <dbl> <dbl>
1 Alamance County 86.3 11.2
2 Alexander County 82.4 8.9
3 Alleghany County 77.5 11.3
4 Anson County 80.7 11.1
5 Ashe County 85.1 12.6
6 Avery County 83.6 15.9
7 Beaufort County 87.7 12
8 Bertie County 78.4 11.9
9 Bladen County 81.3 12.9
10 Brunswick County 91.3 9.8
# … with 90 more rows
Uninsurance vs. HS graduation rates
Code
ggplot(county_2019_nc,aes(x = hs_grad, y = uninsured)) +geom_point() +scale_x_continuous(labels =label_percent(scale =1, accuracy =1)) +scale_y_continuous(labels =label_percent(scale =1, accuracy =1)) +labs(x ="High school graduate", y ="Uninsured",title ="Uninsurance vs. HS graduation rates",subtitle ="North Carolina counties, 2015 - 2019" ) +geom_point(data = county_2019_nc |>filter(name =="Durham County"), aes(x = hs_grad, y = uninsured), shape ="circle open", color ="#8F2D56", size =4, stroke =2) +geom_text(data = county_2019_nc |>filter(name =="Durham County"), aes(x = hs_grad, y = uninsured, label = name), color ="#8F2D56", fontface ="bold", nudge_y =3, nudge_x =2)
Modeling the relationship
Code
ggplot(county_2019_nc, aes(x = hs_grad, y = uninsured)) +geom_point() +geom_smooth(method ="lm", se =FALSE, color ="#8F2D56") +scale_x_continuous(labels =label_percent(scale =1, accuracy =1)) +scale_y_continuous(labels =label_percent(scale =1, accuracy =1)) +labs(x ="High school graduate", y ="Uninsured",title ="Uninsurance vs. HS graduation rates",subtitle ="North Carolina counties, 2015 - 2019" )
Fitting the model
With fit():
nc_fit <-linear_reg() |>set_engine("lm") |>fit(uninsured ~ hs_grad, data = county_2019_nc)tidy(nc_fit)
What indicates a good model fit? Higher or lower \(R^2\)? Higher or lower RMSE?
\(R^2\)
Ranges between 0 (terrible predictor) and 1 (perfect predictor)
Has no units
Calculate with rsq() using the augmented data:
rsq(nc_aug, truth = uninsured, estimate = .fitted)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rsq standard 0.243
Interpreting \(R^2\)
🗳️ Vote on Ed Discussion
The \(R^2\) of the model for predicting uninsurance rate from high school graduation rate for NC counties is 24.3%. Which of the following is the correct interpretation of this value?
High school graduation rates correctly predict 24.3% of uninsurance rates in NC counties.
24.3% of the variability in uninsurance rates in NC counties can be explained by high school graduation rates.
24.3% of the variability in high school graduation rates in NC counties can be explained by uninsurance rates.
24.3% of the time uninsurance rates in NC counties can be predicted by high school graduation rates.
Ranges between 0 (perfect predictor) and infinity (terrible predictor)
Same units as the response variable
Calculate with rmse() using the augmented data:
rmse(nc_aug, truth = uninsured, estimate = .fitted)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 2.07
The value of RMSE is not very meaningful on its own, but it’s useful for comparing across models (more on this when we get to regression with multiple predictors)
Obtaining \(R^2\) and RMSE
Use rsq() and rmse(), respectively
rsq(nc_aug, truth = uninsured, estimate = .fitted)rmse(nc_aug, truth = uninsured, estimate = .fitted)
First argument: data frame containing truth and estimate columns
Second argument: name of the column containing truth (observed outcome)
Third argument: name of the column containing estimate (predicted outcome)
Purpose of model evaluation
\(R^2\) tells us how our model is doing to predict the data we already have
But generally we are interested in prediction for a new observation, not for one that is already in our sample, i.e. out-of-sample prediction
We have a couple ways of simulating out-of-sample prediction before actually getting new data to evaluate the performance of our models
Splitting data
Spending our data
There are several steps to create a useful model: parameter estimation, model selection, performance assessment, etc.
Doing all of this on the entire data we have available leaves us with no other data to assess our choices
We can allocate specific subsets of data for different tasks, as opposed to allocating the largest possible amount to the model parameter estimation only (which is what we’ve done so far)
Simulation: data splitting
Take a random sample of 10% of the data and set aside (testing data)
Fit a model on the remaining 90% of the data (training data)
Use the coefficients from this model to make predictions for the testing data
Repeat 10 times
Predictive performance
How consistent are the predictions for different testing datasets?
How consistent are the predictions for counties with high school graduation rates in the middle of the plot vs. in the edges?
Bootstrapping
Bootstrapping our data
The idea behind bootstrapping is that if a given observation exists in a sample, there may be more like it in the population
With bootstrapping, we simulate resampling from the population by resampling from the sample we observed
Bootstrap samples are the sampled with replacement from the original sample and same size as the original sample
For example, if our sample consists of the observations {A, B, C}, bootstrap samples could be {A, A, B}, {A, C, A}, {B, C, C}, {A, B, C}, etc.
Simulation: bootstrapping
Take a bootstrap sample – sample with replacement from the original data, same size as the original data
Fit model to the sample and make predictions for that sample
Repeat many times
Predictive performance
How consistent are the predictions for different bootstrap datasets?
How consistent are the predictions for counties with high school graduation rates in the middle of the plot vs. in the edges?
Recap
Motivated the importance of model evaluation
Described how \(R^2\) and RMSE are used to evaluate models
Assessed model’s predictive importance using data splitting and bootstrapping