AE 08: Feature Engineering- Model workflow

The Office

Published

October 5, 2022

Important

Go to the course GitHub organization and locate your ae-08- to get started.

The AE is due on GitHub by Saturday, October 08 at 11:59pm.

Packages

library(tidyverse)
library(tidymodels)
library(viridis)
library(knitr)

Load data

office_ratings <- read_csv("data/office_ratings.csv")

Exploratory data analysis

Below are two of the exploratory data analysis plots from lecture.

ggplot(office_ratings, aes(x = imdb_rating)) +
  geom_histogram(binwidth = 0.25) +
  labs(
    title = "The Office ratings",
    x = "IMDB rating"
  )

office_ratings |>
  mutate(season = as_factor(season)) |>
  ggplot(aes(x = season, y = imdb_rating, color = season)) +
  geom_boxplot() +
  geom_jitter() +
  guides(color = "none") +
  labs(
    title = "The Office ratings",
    x = "Season",
    y = "IMDB rating"
  ) +
  scale_color_viridis_d()

Test/train split

set.seed(123)
office_split <- initial_split(office_ratings) # prop = 3/4 by default
office_train <- training(office_split)
office_test  <- testing(office_split)

Build a recipe

office_rec <- recipe(imdb_rating ~ ., data = office_train) |>
  # make title's role ID
  update_role(title, new_role = "ID") |>
  # extract day of week and month of air_date
  step_date(air_date, features = c("dow", "month")) |>
  # identify holidays and add indicators
  step_holiday(
    air_date, 
    holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"), 
    keep_original_cols = FALSE
  ) |>
  # turn season into factor
  step_num2factor(season, levels = as.character(1:9)) |>
  # make dummy variables
  step_dummy(all_nominal_predictors()) |>
  # remove zero variance predictors
  step_zv(all_predictors())

office_rec

Recipe

Inputs:

      role #variables
        ID          1
   outcome          1
 predictor          4

Operations:

Date features from air_date
Holiday features from air_date
Factor variables from season
Dummy variables from all_nominal_predictors()
Zero variance filter on all_predictors()

Workflows and model fitting

Specify model

office_spec <- linear_reg() |>
  set_engine("lm")

office_spec

Linear Regression Model Specification (regression)

Computational engine: lm

Build workflow

office_wflow <- workflow() |>
  add_model(office_spec) |>
  add_recipe(office_rec)

office_wflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_date()
• step_holiday()
• step_num2factor()
• step_dummy()
• step_zv()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

Fit model to training data

office_fit <- office_wflow |>
  fit(data = office_train)

tidy(office_fit) |>
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	6.396	0.510	12.532	0.000
episode	-0.004	0.017	-0.230	0.818
total_votes	0.000	0.000	9.074	0.000
season_X2	0.811	0.327	2.482	0.014
season_X3	1.042	0.343	3.040	0.003
season_X4	1.090	0.295	3.695	0.000
season_X5	1.082	0.348	3.109	0.002
season_X6	1.004	0.367	2.735	0.007
season_X7	1.018	0.352	2.894	0.005
season_X8	0.497	0.348	1.430	0.155
season_X9	0.621	0.345	1.802	0.074
air_date_dow_Tue	0.382	0.422	0.904	0.368
air_date_dow_Thu	0.284	0.389	0.731	0.466
air_date_month_Feb	-0.060	0.132	-0.452	0.652
air_date_month_Mar	-0.075	0.156	-0.481	0.631
air_date_month_Apr	0.095	0.177	0.539	0.591
air_date_month_May	0.156	0.213	0.734	0.464
air_date_month_Sep	-0.078	0.223	-0.348	0.728
air_date_month_Oct	-0.176	0.174	-1.014	0.313
air_date_month_Nov	-0.156	0.149	-1.046	0.298
air_date_month_Dec	0.170	0.149	1.143	0.255

Evaluate model on training data

Make predictions

Important

Fill in the code and make #| eval: true before rendering the document.

office_train_pred <- predict(office_fit, ______) |>
  bind_cols(_____)

Calculate $R^{2}$

Important

Fill in the code and make #| eval: true before rendering the document.

rsq(office_train_pred, truth = _____, estimate = _____)

What is preferred - high or low values of $R^{2}$ ?

Calculate RMSE

Important

Fill in the code and make #| eval: true before rendering the document.

rmse(______, ________, ________)

What is preferred - high or low values of RMSE?

Is this RMSE considered high or low? Hint: Consider the range of the response variable to answer this question.

office_train |>
  summarise(min = min(imdb_rating), max = max(imdb_rating))

# A tibble: 1 × 2
    min   max
  <dbl> <dbl>
1   6.7   9.7

Evaluate model on testing data

Answer the following before evaluating the model performance on testing data:

Do you expect $R^{2}$ on the testing data to be higher or lower than the $R^{2}$ calculated using training data? Why?
Do you expect RMSE on the testing data to be higher or lower than the $R^{2}$ calculated using training data? Why?

Make predictions

# fill in code to make predictions from testing data

Calculate $R^{2}$

# fill in code to calculate $R^2$ for testing data

Calculate RMSE

# fill in code to calculate RMSE for testing data

Compare training and testing data results

Compare the $R^{2}$ for the training and testing data. Is this what you expected?
Compare the RMSE for the training and testing data. Is this what you expected?

Packages

Load data

Exploratory data analysis

Test/train split

Build a recipe

Workflows and model fitting

Specify model

Build workflow

Fit model to training data

Evaluate model on training data

Make predictions

Calculate R2

Calculate RMSE

Evaluate model on testing data

Make predictions

Calculate R2

Calculate RMSE

Compare training and testing data results

Calculate $R^{2}$

Calculate $R^{2}$