Model workflow
Prof. Maria Tackett
Oct 05, 2022
UConn Sports Analytics Symposium: October 08 (online options available)
Code for Good hosted by Hack Duke: October 22 - 23
Data: The data come from data.world, by way of TidyTuesday
Goal: Predict imdb_rating
from other variables in the dataset
# A tibble: 188 × 6
season episode title imdb_rating total_votes air_date
<dbl> <dbl> <chr> <dbl> <dbl> <date>
1 1 1 Pilot 7.6 3706 2005-03-24
2 1 2 Diversity Day 8.3 3566 2005-03-29
3 1 3 Health Care 7.9 2983 2005-04-05
4 1 4 The Alliance 8.1 2886 2005-04-12
5 1 5 Basketball 8.4 3179 2005-04-19
6 1 6 Hot Girl 7.8 2852 2005-04-26
7 2 1 The Dundies 8.7 3213 2005-09-20
8 2 2 Sexual Harassment 8.2 2736 2005-09-27
9 2 3 Office Olympics 8.4 2742 2005-10-04
10 2 4 The Fire 8.4 2713 2005-10-11
# … with 178 more rows
Step 1: Create an initial split:
Step 2: Save training data
Step 3: Save testing data
# A tibble: 141 × 6
season episode title imdb_rating total_votes air_date
<dbl> <dbl> <chr> <dbl> <dbl> <date>
1 8 18 Last Day in Florida 7.8 1429 2012-03-08
2 9 14 Vandalism 7.6 1402 2013-01-31
3 2 8 Performance Review 8.2 2416 2005-11-15
4 9 5 Here Comes Treble 7.1 1515 2012-10-25
5 3 22 Beach Games 9.1 2783 2007-05-10
6 7 1 Nepotism 8.4 1897 2010-09-23
7 3 15 Phyllis' Wedding 8.3 2283 2007-02-08
8 9 21 Livin' the Dream 8.9 2041 2013-05-02
9 9 18 Promos 8 1445 2013-04-04
10 8 12 Pool Party 8 1612 2012-01-19
# … with 131 more rows
We prefer simple (parsimonious) models when possible, but parsimony does not mean sacrificing accuracy (or predictive performance) in the interest of simplicity
Variables that go into the model and how they are represented are just as critical to success of the model
Feature engineering allows us to get creative with our predictors in an effort to make them more useful for our model (to increase its predictive performance)
Create a recipe for feature engineering steps to be applied to the training data
Fit the model to the training data after these steps have been applied
Using the model estimates from the training data, predict outcomes for the test data
Evaluate the performance of the model on the test data
title
isn’t a predictor, but we might want to keep it around as an ID
New features for day of week and month
prep()
to train the recipe and bake()
to apply it to your data# determine required parameters to be estimated
office_rec_trained <- prep(office_rec)
# apply recipe computations to data
bake(office_rec_trained, office_train) |>
glimpse()
Rows: 141
Columns: 8
$ season <dbl> 8, 9, 2, 9, 3, 7, 3, 9, 9, 8, 5, 5, 9, 6, 7, 6, 5, 2, 2…
$ episode <dbl> 18, 14, 8, 5, 22, 1, 15, 21, 18, 12, 25, 26, 12, 1, 20,…
$ title <fct> "Last Day in Florida", "Vandalism", "Performance Review…
$ total_votes <dbl> 1429, 1402, 2416, 1515, 2783, 1897, 2283, 2041, 1445, 1…
$ air_date <date> 2012-03-08, 2013-01-31, 2005-11-15, 2012-10-25, 2007-0…
$ imdb_rating <dbl> 7.8, 7.6, 8.2, 7.1, 9.1, 8.4, 8.3, 8.9, 8.0, 8.0, 8.7, …
$ air_date_dow <fct> Thu, Thu, Tue, Thu, Thu, Thu, Thu, Thu, Thu, Thu, Thu, …
$ air_date_month <fct> Mar, Jan, Nov, Oct, May, Sep, Feb, May, Apr, Jan, May, …
Identify holidays in air_date
, then remove air_date
office_rec <- office_rec |>
step_holiday(
air_date,
holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"),
keep_original_cols = FALSE
)
office_rec
Recipe
Inputs:
role #variables
ID 1
outcome 1
predictor 4
Operations:
Date features from air_date
Holiday features from air_date
Rows: 141
Columns: 11
$ season <dbl> 8, 9, 2, 9, 3, 7, 3, 9, 9, 8, 5, 5, 9, 6, 7…
$ episode <dbl> 18, 14, 8, 5, 22, 1, 15, 21, 18, 12, 25, 26…
$ title <fct> "Last Day in Florida", "Vandalism", "Perfor…
$ total_votes <dbl> 1429, 1402, 2416, 1515, 2783, 1897, 2283, 2…
$ imdb_rating <dbl> 7.8, 7.6, 8.2, 7.1, 9.1, 8.4, 8.3, 8.9, 8.0…
$ air_date_dow <fct> Thu, Thu, Tue, Thu, Thu, Thu, Thu, Thu, Thu…
$ air_date_month <fct> Mar, Jan, Nov, Oct, May, Sep, Feb, May, Apr…
$ air_date_USThanksgivingDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USChristmasDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USNewYearsDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USIndependenceDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Convert season
to factor
Rows: 141
Columns: 11
$ season <fct> 8, 9, 2, 9, 3, 7, 3, 9, 9, 8, 5, 5, 9, 6, 7…
$ episode <dbl> 18, 14, 8, 5, 22, 1, 15, 21, 18, 12, 25, 26…
$ title <fct> "Last Day in Florida", "Vandalism", "Perfor…
$ total_votes <dbl> 1429, 1402, 2416, 1515, 2783, 1897, 2283, 2…
$ imdb_rating <dbl> 7.8, 7.6, 8.2, 7.1, 9.1, 8.4, 8.3, 8.9, 8.0…
$ air_date_dow <fct> Thu, Thu, Tue, Thu, Thu, Thu, Thu, Thu, Thu…
$ air_date_month <fct> Mar, Jan, Nov, Oct, May, Sep, Feb, May, Apr…
$ air_date_USThanksgivingDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USChristmasDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USNewYearsDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USIndependenceDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Convert all nominal (categorical) predictors to factors
Rows: 141
Columns: 33
$ episode <dbl> 18, 14, 8, 5, 22, 1, 15, 21, 18, 12, 25, 26…
$ title <fct> "Last Day in Florida", "Vandalism", "Perfor…
$ total_votes <dbl> 1429, 1402, 2416, 1515, 2783, 1897, 2283, 2…
$ imdb_rating <dbl> 7.8, 7.6, 8.2, 7.1, 9.1, 8.4, 8.3, 8.9, 8.0…
$ air_date_USThanksgivingDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USChristmasDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USNewYearsDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USIndependenceDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ season_X2 <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ season_X3 <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ season_X4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ season_X5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
$ season_X6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ season_X7 <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ season_X8 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
$ season_X9 <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0…
$ air_date_dow_Mon <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_dow_Tue <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_dow_Wed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_dow_Thu <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ air_date_dow_Fri <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_dow_Sat <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_month_Feb <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_month_Mar <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_month_Apr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1…
$ air_date_month_May <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0…
$ air_date_month_Jun <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_month_Jul <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_month_Aug <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_month_Sep <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ air_date_month_Oct <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_month_Nov <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_month_Dec <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Remove all predictors that contain only a single value
Recipe
Inputs:
role #variables
ID 1
outcome 1
predictor 4
Operations:
Date features from air_date
Holiday features from air_date
Factor variables from season
Dummy variables from all_nominal_predictors()
Zero variance filter on all_predictors()
Rows: 141
Columns: 22
$ episode <dbl> 18, 14, 8, 5, 22, 1, 15, 21, 18, 12, 25, 26, 12, 1,…
$ title <fct> "Last Day in Florida", "Vandalism", "Performance Re…
$ total_votes <dbl> 1429, 1402, 2416, 1515, 2783, 1897, 2283, 2041, 144…
$ imdb_rating <dbl> 7.8, 7.6, 8.2, 7.1, 9.1, 8.4, 8.3, 8.9, 8.0, 8.0, 8…
$ season_X2 <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ season_X3 <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ season_X4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ season_X5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, …
$ season_X6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, …
$ season_X7 <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ season_X8 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
$ season_X9 <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, …
$ air_date_dow_Tue <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ air_date_dow_Thu <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ air_date_month_Feb <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ air_date_month_Mar <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ air_date_month_Apr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
$ air_date_month_May <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
$ air_date_month_Sep <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
$ air_date_month_Oct <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
$ air_date_month_Nov <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ air_date_month_Dec <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
office_rec <- recipe(imdb_rating ~ ., data = office_train) |>
# make title's role ID
update_role(title, new_role = "ID") |>
# extract day of week and month of air_date
step_date(air_date, features = c("dow", "month")) |>
# identify holidays and add indicators
step_holiday(
air_date,
holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"),
keep_original_cols = FALSE
) |>
# turn season into factor
step_num2factor(season, levels = as.character(1:9)) |>
# make dummy variables
step_dummy(all_nominal_predictors()) |>
# remove zero variance predictors
step_zv(all_predictors())
Workflows bring together models and recipes so that they can be easily applied to both the training and test data.
See next slide for workflow…
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps
• step_date()
• step_holiday()
• step_num2factor()
• step_dummy()
• step_zv()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
# A tibble: 21 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6.40 0.510 12.5 1.51e-23
2 episode -0.00393 0.0171 -0.230 8.18e- 1
3 total_votes 0.000375 0.0000414 9.07 2.75e-15
4 season_X2 0.811 0.327 2.48 1.44e- 2
5 season_X3 1.04 0.343 3.04 2.91e- 3
6 season_X4 1.09 0.295 3.70 3.32e- 4
7 season_X5 1.08 0.348 3.11 2.34e- 3
8 season_X6 1.00 0.367 2.74 7.18e- 3
9 season_X7 1.02 0.352 2.89 4.52e- 3
10 season_X8 0.497 0.348 1.43 1.55e- 1
# … with 11 more rows
So many predictors!
# A tibble: 21 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6.40 0.510 12.5 1.51e-23
2 episode -0.00393 0.0171 -0.230 8.18e- 1
3 total_votes 0.000375 0.0000414 9.07 2.75e-15
4 season_X2 0.811 0.327 2.48 1.44e- 2
5 season_X3 1.04 0.343 3.04 2.91e- 3
6 season_X4 1.09 0.295 3.70 3.32e- 4
7 season_X5 1.08 0.348 3.11 2.34e- 3
8 season_X6 1.00 0.367 2.74 7.18e- 3
9 season_X7 1.02 0.352 2.89 4.52e- 3
10 season_X8 0.497 0.348 1.43 1.55e- 1
11 season_X9 0.621 0.345 1.80 7.41e- 2
12 air_date_dow_Tue 0.382 0.422 0.904 3.68e- 1
13 air_date_dow_Thu 0.284 0.389 0.731 4.66e- 1
14 air_date_month_Feb -0.0597 0.132 -0.452 6.52e- 1
15 air_date_month_Mar -0.0752 0.156 -0.481 6.31e- 1
16 air_date_month_Apr 0.0954 0.177 0.539 5.91e- 1
17 air_date_month_May 0.156 0.213 0.734 4.64e- 1
18 air_date_month_Sep -0.0776 0.223 -0.348 7.28e- 1
19 air_date_month_Oct -0.176 0.174 -1.01 3.13e- 1
20 air_date_month_Nov -0.156 0.149 -1.05 2.98e- 1
21 air_date_month_Dec 0.170 0.149 1.14 2.55e- 1
# A tibble: 141 × 7
.pred season episode title imdb_rating total_votes air_date
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <date>
1 7.57 8 18 Last Day in Florida 7.8 1429 2012-03-08
2 7.77 9 14 Vandalism 7.6 1402 2013-01-31
3 8.31 2 8 Performance Review 8.2 2416 2005-11-15
4 7.67 9 5 Here Comes Treble 7.1 1515 2012-10-25
5 8.84 3 22 Beach Games 9.1 2783 2007-05-10
6 8.33 7 1 Nepotism 8.4 1897 2010-09-23
7 8.46 3 15 Phyllis' Wedding 8.3 2283 2007-02-08
8 8.14 9 21 Livin' the Dream 8.9 2041 2013-05-02
9 7.87 9 18 Promos 8 1445 2013-04-04
10 7.74 8 12 Pool Party 8 1612 2012-01-19
# … with 131 more rows
Percentage of variability in the IMDB ratings explained by the model.
Are models with high or low \(R^2\) more preferable?
An alternative model performance statistic: root mean square error.
\[ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} \]
Are models with high or low RMSE are more preferable?
Is this RMSE considered low or high?
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.302
who cares about predictions on training data?
# A tibble: 47 × 7
.pred season episode title imdb_rating total_votes air_date
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <date>
1 8.03 1 2 Diversity Day 8.3 3566 2005-03-29
2 7.98 1 3 Health Care 7.9 2983 2005-04-05
3 8.41 2 4 The Fire 8.4 2713 2005-10-11
4 8.35 2 5 Halloween 8.2 2561 2005-10-18
5 8.35 2 9 E-Mail Surveillance 8.4 2527 2005-11-22
6 8.68 2 12 The Injury 9 3282 2006-01-12
7 8.32 2 14 The Carpet 7.9 2342 2006-01-26
8 8.93 2 22 Casino Night 9.3 3644 2006-05-11
9 8.80 3 1 Gay Witch Hunt 8.9 3087 2006-09-21
10 8.37 3 5 Initiation 8.2 2254 2006-10-19
# … with 37 more rows
RMSE of model fit to testing data
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.411
R-sq of model fit to testing data
metric | train | test | comparison |
---|---|---|---|
RMSE | 0.302 | 0.411 | RMSE lower for training |
R-squared | 0.67 | 0.468 | R-squared higher for training |
The training set does not have the capacity to be a good arbiter of performance.
It is not an independent piece of information; predicting the training set can only reflect what the model already knows.
Suppose you give a class a test, then give them the answers, then provide the same test. The student scores on the second test do not accurately reflect what they know about the subject; these scores would probably be higher than their results on the first test.