library(tidyverse)
library(tidymodels)
library(schrute) #install.packages("schrute")
library(lubridate)
library(knitr)
Lab 04: The Office
Feature engineering
Introduction
In today’s lab you will analyze data from the schrute package to predict IMDB scores for episodes of The Office.
This is a different data source than the one we’ve used in class last week.
Learning goals
By the end of the lab you will…
- engineer features based on episode scripts
- train a model
- make predictions
- evaluate model performance on training and testing data
- practice collaborating with others using a single Github repo
Meet your team!
Click here to see the team assignments for STA 210. This will be your team for labs and the final project. Before you get started on the lab, complete the following:
✅ Come up with a team name. You can’t use the same name as another team, so I encourage you to be creative! Your TA will get your team name by the end of lab.
✅ Fill out the team agreement. This will help you figure out a plan for communication,and working together during labs and outside of lab times. You can find the team agreement in the GitHub repo team-agreement-[github_team_name].
Have one person from the team clone the repo and start a new RStudio project. This person will type the team’s responses as you discuss the sections of the agreement. No one else in the team should type at this point but should be contributing to the discussion.
Be sure to push the completed agreement to GitHub. Each team member can refer to the document in this repo or download the PDF of the agreement for future reference. You do not need to submit the agreement on Gradescope.
Getting started
- A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
- Go to the sta210-fa22 organization on GitHub. Click on the repo with the prefix lab-04. It contains the starter documents you need to complete the lab.
- Each person on the team should clone the repository and open a new project in RStudio. Throughout the lab, each person should get a chance to make commits and push to the repo.
- Do not make any changes to the
.qmd
file until the instructions tell you do to so.
Workflow: Using Git and GitHub as a team
⌨️ Team Member 1: Hands on the keyboard.
🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!1
Change the author to your team name and include each team member’s name in the author
field of the YAML in the following format: Team Name: Member 1, Member 2, Member 3, Member 4
.
Team Member 1: Render the document and confirm that the changes are visible in the PDF. Then, commit (with an informative commit message) both the .qmd
and PDF documents, and finally push the changes to GitHub.
Team Members 2, 3, 4: Once Team Member 1 is done rendering, committing, and pushing, confirm that the changes are visible on GitHub in your team’s lab repo. Then, in RStudio, click the Pull button in the Git pane to get the updated document. You should see the updated name in your .qmd
file.
Packages
The following packages are used in the lab.
Data: The Office
The data for this lab comes from the schrute package and it’s in the a data set called theoffice
. This data set contains the entire script transcriptions from The Office.
Let’s start by taking a peek at the data.
glimpse(theoffice)
Rows: 55,130
Columns: 12
$ index <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode_name <chr> "Pilot", "Pilot", "Pilot", "Pilot", "Pilot", "Pilot",…
$ director <chr> "Ken Kwapis", "Ken Kwapis", "Ken Kwapis", "Ken Kwapis…
$ writer <chr> "Ricky Gervais;Stephen Merchant;Greg Daniels", "Ricky…
$ character <chr> "Michael", "Jim", "Michael", "Jim", "Michael", "Micha…
$ text <chr> "All right Jim. Your quarterlies look very good. How …
$ text_w_direction <chr> "All right Jim. Your quarterlies look very good. How …
$ imdb_rating <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6…
$ total_votes <int> 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706,…
$ air_date <chr> "2005-03-24", "2005-03-24", "2005-03-24", "2005-03-24…
There are 55130 observations and 12 columns in this data set. The variable names are as follows.
names(theoffice)
[1] "index" "season" "episode" "episode_name"
[5] "director" "writer" "character" "text"
[9] "text_w_direction" "imdb_rating" "total_votes" "air_date"
Each row in the data set is a line spoken by a character in a given episode of the show. This means some information at the episode level (e.g., imdb_rating
, air_date
, etc. are repeated across the rows that belong to a single episode.
The air_date
variable is coded as a factor, which is undesirable for the analysis. We’ll want to parse that variable later into its components during feature engineering. So, for now, let’s convert it to date.
<- theoffice |>
theoffice mutate(air_date = ymd(as.character(air_date)))
Let’s take a look at the data to confirm we’re happy with how each of the variables are encoded.
glimpse(theoffice)
Rows: 55,130
Columns: 12
$ index <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode_name <chr> "Pilot", "Pilot", "Pilot", "Pilot", "Pilot", "Pilot",…
$ director <chr> "Ken Kwapis", "Ken Kwapis", "Ken Kwapis", "Ken Kwapis…
$ writer <chr> "Ricky Gervais;Stephen Merchant;Greg Daniels", "Ricky…
$ character <chr> "Michael", "Jim", "Michael", "Jim", "Michael", "Micha…
$ text <chr> "All right Jim. Your quarterlies look very good. How …
$ text_w_direction <chr> "All right Jim. Your quarterlies look very good. How …
$ imdb_rating <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6…
$ total_votes <int> 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706,…
$ air_date <chr> "2005-03-24", "2005-03-24", "2005-03-24", "2005-03-24…
Exercises
Data prep
⌨️ Team Member 1: Hands still on the keyboard. Write the answers to Exercises 1 and 2.
🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!
Exercise 1
Identify episodes where Halloween, Valentine’s Day, and Christmas are mentioned.
- First, convert all
text
to lowercase with the functionstr_to_lower()
. - Then, create three new variables (
halloween_mention
,valentine_mention
, andchristmas_mention
) that take the value1
if the character string"halloween"
,"valentine"
, or"christmas"
appears in the text, respectively, and0
otherwise.
Below is code to help you get started.
<- theoffice |>
theoffice mutate(
text = ___(text),
halloween_mention = if_else(str_detect(text, "___"), ___, ___),
valentine_mention = ___,
___ = ___
)
Exercise 2
In this exercise we’ll accomplish two separate tasks. We’re doing both tasks all at once, because we’re going to drastically change our data frame, from one row per line spoken to one row per episode. We’ll call the resulting data frame office_episodes
.
The two tasks are as follows:
- Task 1. Identify episodes where the word “halloween”, “valentine”, or “christmas” were ever mentioned, using variables you created in Exercise 1.
- Task 2. Calculate the percentage of lines spoken by Jim, Pam, Michael, and Dwight for each episode of The Office.
Below are instructions and starter code to get you started with these tasks.
- Start by grouping
theoffice
data byseason
,episode
,episode_name
,imdb_rating
,total_votes
, andair_date
. (These variables, except forseason
have the same value for each given episode, hence grouping by them allows us to make sure they appear in the output of this pipeline.) - Use
summarize()
to calculate the desired features at the season-episode level. - Task 1:
- Calculate the number of lines per season per episode. You can might name this new variable
n_lines
. - Then, calculate the proportion of lines in that episode spoken by each of the four
character
s Jim, Pam, Michael, and Dwight. Name these new variableslines_jim
,lines_pam
,lines_michael
, andlines_dwight
, respectively.
- Calculate the number of lines per season per episode. You can might name this new variable
- Task 2:
- Create a variable called
halloween
that sums up the1
s inhalloween_mention
at the season-episode level and takes on the value"yes"
if the sum is greater than or equal to 1, or"no"
otherwise. - Do something similar for new variables
valentine
andchristmas
as well based on values fromvalentine_mention
andchristmas_mention
.
- Create a variable called
- Finish up your
summarize()
statement by dropping the groups, so the resulting data frame is no longer grouped. Additionally, removen_lines
(we won’t use that variable in our analysis, we only calculated it as an intermediary step).
<- theoffice |>
office_episodes group_by(___) |>
summarize(
n_lines = n(),
lines_jim = sum(character == "___") / n_lines,
lines_pam = ___,
lines_michael = ___,
lines_dwight = ___,
halloween = if_else(sum(___) >= 1, "yes", "no"),
valentine = if_else(___, "___", "___"),
christmas = if_else(___, "___", "___"),
.groups = "drop"
|>
) select(-n_lines)
Team Member 1: Render the document and confirm that the changes are visible in the PDF. Then, commit (with an informative commit message) both the .qmd
and PDF documents, and finally push the changes to GitHub. Make sure to commit and push all changed files so that your Git pane is empty afterwards.
Team Members 2, 3, 4: Once Team Member 1 is done rendering, committing, and pushing, confirm that the changes are visible on GitHub in your team’s lab repo. Then, in RStudio, click the Pull button in the Git pane to get the updated document. You should see the responses to Exercises 1 and 2 in your .qmd
file.
Now it’s time for a hand off…
⌨️ Team Member 2: Hands on the keyboard. Write the answers to Exercises 3 - 5.
🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!
Exercise 3
The Michael Scott character (played by Steve Carrell) left the show at the end of Season 7. Add an indicator variable, michael
, that takes on the value "yes"
if Michael Scott (Steve Carrell) was in the show, and "no"
if not.
<- office_episodes |>
office_episodes mutate(michael = if_else(season > ___, "___", "___"))
Exercise 4
Print out the dimensions (dim()
) of the new data set you created as well as the names()
of the columns in the data set.
Your new data set, office_episodes,
should have 186 rows and 14 columns. The column names should be season
, episode
, episode_name
, imdb_rating
, total_votes
, air_date
, lines_jim
, lines_pam
, lines_michael
, lines_dwight
, halloween
, valentine
, christmas
, and michael
. If you are not matching these numbers or columns, go back and try to figure out where you went wrong. Or ask your TA for help!
Exploratory data analysis
This would be a good place to conduct some exploratory data analysis (EDA). For example, plot the proportion of lines spoken by each character over time. Or calculate the percentage of episodes that mention Halloween, or Valentine’s Day, or Christmas. Given we have limited time in the lab we’re not going to ask you to report EDA results as part of this lab, but we’re noting this here to provide suggestions for how you might go about structuring your project.
Modeling prep
Exercise 5
Split the data into training (75%) and testing (25%). Save the training and testing data as office_train
and office_test
respectively. Use seed 123.
Naming suggestion: Call the initial split office_split
, the training data office_train
, and testing data office_test
.
set.seed(123)
<- ___(office_episodes)
office_split <- ___(office_split)
office_train <- ___(___) office_test
Team Member 2: Render the document and confirm that the changes are visible in the PDF. Then, commit (with an informative commit message) both the .qmd
and PDF documents, and finally push the changes to GitHub. Make sure to commit and push all changed files so that your Git pane is empty afterwards.
Team Members 1, 3, 4: Once Team Member 2 is done rendering, committing, and pushing, confirm that the changes are visible on GitHub in your team’s lab repo. Then, in RStudio, click the Pull button in the Git pane to get the updated document. You should see the responses to Exercises 3 - 5 in your .qmd
file.
Now it’s time for another hand off…
⌨️ Team Member 3: Hands on the keyboard. Write the answers to Exercises 6 - 8.
🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!
Exercise 6
Specify a linear regression model with engine "lm"
and call it office_spec
.
Naming suggestion: Call the model specification office_spec
.
<- ___ office_spec
Exercise 7
Create a recipe that performs feature engineering using the following steps (in the given order):
update_role()
: updates the role ofepisode_name
to not be a predictor (be an ID)step_rm()
: removesair_date
andseason
as predictorsstep_dummy()
: creates dummy variables forall_nominal_predictors()
step_zv()
: removes all zero variance predictors
Naming suggestion: Call the recipe office_rec
.
<- recipe(imdb_rating ~ ., data = office_train) |>
office_rec ___
Exercise 8
Build a model workflow for fitting the model specified earlier and using the recipe you developed to preprocess the data.
Naming suggestion: Call the model workflow office_wflow
.
<- workflow() |>
office_wflow add_model(___) |>
add_recipe(___)
Model fit and evaluation
Team Member 3: Render the document and confirm that the changes are visible in the PDF. Then, commit (with an informative commit message) both the .qmd
and PDF documents, and finally push the changes to GitHub. Make sure to commit and push all changed files so that your Git pane is empty afterwards.
Team Members 1, 2, 4: Once Team Member 3 is done rendering, committing, and pushing, confirm that the changes are visible on GitHub in your team’s lab repo. Then, in RStudio, click the Pull button in the Git pane to get the updated document. You should see the responses to Exercise 6 - 8 in your .qmd
file.
Now it’s time for another hand off…
⌨️ Team Member 4: Hands on the keyboard. Write the answers to Exercises 9 - 11.
🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!
Exercise 9
Fit the model to training data, neatly display the model output, and interpret two of the slope coefficients.
Naming suggestion: Call the model fit office_fit
.
<- office_wflow |>
office_fit fit(data = ___)
___
Exercise 10
Calculate predicted imdb_rating
for the training data using the predict()
function. Then, bind the columns from the training data to this result.
Using this data frame, create a scatterplot of predicted and observed IMDB ratings for the training data.
Naming suggestion: Call the resulting data frame office_train_pred
.
Stretch goal. Add episode names, using geom_text()
, for episodes with much higher and much lower observed IMDB ratings compared to others.
Exercise 11
Calculate the R-squared and RMSE for this model for predictions on the training data.
Team Member 4: Render the document and confirm that the changes are visible in the PDF. Then, commit (with an informative commit message) both the .qmd
and PDF documents, and finally push the changes to GitHub. Make sure to commit and push all changed files so that your Git pane is empty afterwards.
Team Members 1, 2, 3: Once Team Member 4 is done rendering, committing, and pushing, confirm that the changes are visible on GitHub in your team’s lab repo. Then, in RStudio, click the Pull button in the Git pane to get the updated document. You should see the responses to Exercise 9 - 11 in your .qmd
file.
Now it’s time for another hand off…
⌨️ Team Member 2: Hands on the keyboard. Write the answers to Exercises 12 - 14.
🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!
Exercise 12
Repeat Exercise 10, but with testing data.
Naming suggestion: Call the resulting data frame office_test_pred
.
Exercise 13
Based on your visualization on Exercise 12, speculate on whether you expect the R-squared and RMSE for this model to be higher or lower for predictions on the testing data compared to those on the training data, or do you expect them to be the same? Explain your reasoning.
Exercise 14
Check your intuition in Exercise 13 by actually calculating the R-squared and RMSE for this model for predictions on the training data. Comment on whether your intuition is confirmed or not.
Team Member 2: Render the document and confirm that the changes are visible in the PDF. Then, commit (with an informative commit message) both the .qmd
and PDF documents, and finally push the changes to GitHub. Make sure to commit and push all changed files so that your Git pane is empty afterwards.
Team Members 1, 3, 4: Once Team Member 2 is done rendering, committing, and pushing, confirm that the changes are visible on GitHub in your team’s lab repo. Then, in RStudio, click the Pull button in the Git pane to get the updated document. You should see the responses to Exercises 12 - 14 in your .qmd
file.
Wrapping up
⌨️ Team Member 3: Hands on the keyboard. Make any edits as needed.
🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!
Team Member 3: Render the document and confirm that the changes are visible in the PDF. Then, commit (with an informative commit message) both the .qmd
and PDF documents, and finally push the changes to GitHub. Make sure to commit and push all changed files so that your Git pane is empty afterwards.
Team Members 1, 2, 4: Once Team Member 3 is done rendering, committing, and pushing, confirm that the changes are visible on GitHub in your team’s lab repo. Then, in RStudio, click the Pull button in the Git pane to get the updated document. You should see the final version of your .qmd
file.
Submission
One team member submit the assignment:
- Go to http://www.gradescope.com and log in using your NetID credentials.
- Click on your STA 210 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Select all team members’ names, so they receive credit on the assignment. Click here for video on adding team members to assignment on Gradescope.
- Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
- Select the first page of your PDF submission to be associated with the “Workflow & formatting” section.
Grading
Total points available: 50 points.
Component | Points |
---|---|
Ex 1 - 14 | 42 |
Workflow & formatting | 52 |
Complete team contract | 3 |
Footnotes
Don’t trust yourself to keep your hands off the keyboard? Put them in your picket or cross your arms. No matter how silly it might feel, resist the urge to touch your keyboard until otherwise instructed!↩︎
The “Workflow & formatting” grade is to assess the reproducible workflow. This includes having at least one meaningful commit from each team member and updating the team name and date in the YAML.↩︎