Lab 04: The Office

Feature engineering

Important

Due:

Thursday, October 13 , 11:59pm (Thursday labs)
Friday, October 14, 11:59pm (Friday labs)

Introduction

In today’s lab you will analyze data from the schrute package to predict IMDB scores for episodes of The Office.

This is a different data source than the one we’ve used in class last week.

Learning goals

By the end of the lab you will…

engineer features based on episode scripts
train a model
make predictions
evaluate model performance on training and testing data
practice collaborating with others using a single Github repo

Meet your team!

Click here to see the team assignments for STA 210. This will be your team for labs and the final project. Before you get started on the lab, complete the following:

✅ Come up with a team name. You can’t use the same name as another team, so I encourage you to be creative! Your TA will get your team name by the end of lab.

✅ Fill out the team agreement. This will help you figure out a plan for communication,and working together during labs and outside of lab times. You can find the team agreement in the GitHub repo team-agreement-[github_team_name].

Have one person from the team clone the repo and start a new RStudio project. This person will type the team’s responses as you discuss the sections of the agreement. No one else in the team should type at this point but should be contributing to the discussion.
Be sure to push the completed agreement to GitHub. Each team member can refer to the document in this repo or download the PDF of the agreement for future reference. You do not need to submit the agreement on Gradescope.

Getting started

A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
Go to the sta210-fa22 organization on GitHub. Click on the repo with the prefix lab-04. It contains the starter documents you need to complete the lab.
Each person on the team should clone the repository and open a new project in RStudio. Throughout the lab, each person should get a chance to make commits and push to the repo.
Do not make any changes to the .qmd file until the instructions tell you do to so.

Workflow: Using Git and GitHub as a team

Important

Assign each person on your team a number 1 through 4. For teams of three, Team Member 1 can take on the role of Team Member 4.

The following exercises must be done in order. Only one person should type in the .qmd file, commit, and push updates at a time. When it is not your turn to type, you should still share ideas and contribute to the team’s discussion.

⌨️ Team Member 1: Hands on the keyboard.

🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!¹

Change the author to your team name and include each team member’s name in the author field of the YAML in the following format: Team Name: Member 1, Member 2, Member 3, Member 4.

Team Member 1: Render the document and confirm that the changes are visible in the PDF. Then, commit (with an informative commit message) both the .qmd and PDF documents, and finally push the changes to GitHub.

Team Members 2, 3, 4: Once Team Member 1 is done rendering, committing, and pushing, confirm that the changes are visible on GitHub in your team’s lab repo. Then, in RStudio, click the Pull button in the Git pane to get the updated document. You should see the updated name in your .qmd file.

Packages

The following packages are used in the lab.

library(tidyverse)
library(tidymodels)
library(schrute) #install.packages("schrute")
library(lubridate)
library(knitr)

Data: The Office

The data for this lab comes from the schrute package and it’s in the a data set called theoffice. This data set contains the entire script transcriptions from The Office.

Let’s start by taking a peek at the data.

glimpse(theoffice)

Rows: 55,130
Columns: 12
$ index            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ season           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode_name     <chr> "Pilot", "Pilot", "Pilot", "Pilot", "Pilot", "Pilot",…
$ director         <chr> "Ken Kwapis", "Ken Kwapis", "Ken Kwapis", "Ken Kwapis…
$ writer           <chr> "Ricky Gervais;Stephen Merchant;Greg Daniels", "Ricky…
$ character        <chr> "Michael", "Jim", "Michael", "Jim", "Michael", "Micha…
$ text             <chr> "All right Jim. Your quarterlies look very good. How …
$ text_w_direction <chr> "All right Jim. Your quarterlies look very good. How …
$ imdb_rating      <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6…
$ total_votes      <int> 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706,…
$ air_date         <chr> "2005-03-24", "2005-03-24", "2005-03-24", "2005-03-24…

There are 55130 observations and 12 columns in this data set. The variable names are as follows.

names(theoffice)

 [1] "index"            "season"           "episode"          "episode_name"    
 [5] "director"         "writer"           "character"        "text"            
 [9] "text_w_direction" "imdb_rating"      "total_votes"      "air_date"

Each row in the data set is a line spoken by a character in a given episode of the show. This means some information at the episode level (e.g., imdb_rating, air_date, etc. are repeated across the rows that belong to a single episode.

The air_date variable is coded as a factor, which is undesirable for the analysis. We’ll want to parse that variable later into its components during feature engineering. So, for now, let’s convert it to date.

theoffice <- theoffice |>
  mutate(air_date = ymd(as.character(air_date)))

Let’s take a look at the data to confirm we’re happy with how each of the variables are encoded.

glimpse(theoffice)

Rows: 55,130
Columns: 12
$ index            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ season           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode_name     <chr> "Pilot", "Pilot", "Pilot", "Pilot", "Pilot", "Pilot",…
$ director         <chr> "Ken Kwapis", "Ken Kwapis", "Ken Kwapis", "Ken Kwapis…
$ writer           <chr> "Ricky Gervais;Stephen Merchant;Greg Daniels", "Ricky…
$ character        <chr> "Michael", "Jim", "Michael", "Jim", "Michael", "Micha…
$ text             <chr> "All right Jim. Your quarterlies look very good. How …
$ text_w_direction <chr> "All right Jim. Your quarterlies look very good. How …
$ imdb_rating      <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6…
$ total_votes      <int> 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706,…
$ air_date         <chr> "2005-03-24", "2005-03-24", "2005-03-24", "2005-03-24…

Exercises

Data prep

⌨️ Team Member 1: Hands still on the keyboard. Write the answers to Exercises 1 and 2.

🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!

Exercise 1

Identify episodes where Halloween, Valentine’s Day, and Christmas are mentioned.

First, convert all text to lowercase with the function str_to_lower().
Then, create three new variables (halloween_mention, valentine_mention, and christmas_mention) that take the value 1 if the character string "halloween", "valentine", or "christmas" appears in the text, respectively, and 0 otherwise.

Below is code to help you get started.

theoffice <- theoffice |>
  mutate(
    text = ___(text),
    halloween_mention = if_else(str_detect(text, "___"), ___, ___),
    valentine_mention = ___,
    ___ = ___
  )

Exercise 2

In this exercise we’ll accomplish two separate tasks. We’re doing both tasks all at once, because we’re going to drastically change our data frame, from one row per line spoken to one row per episode. We’ll call the resulting data frame office_episodes.

The two tasks are as follows:

Task 1. Identify episodes where the word “halloween”, “valentine”, or “christmas” were ever mentioned, using variables you created in Exercise 1.
Task 2. Calculate the percentage of lines spoken by Jim, Pam, Michael, and Dwight for each episode of The Office.

Below are instructions and starter code to get you started with these tasks.

Start by grouping theoffice data by season, episode, episode_name, imdb_rating, total_votes, and air_date. (These variables, except for season have the same value for each given episode, hence grouping by them allows us to make sure they appear in the output of this pipeline.)
Use summarize() to calculate the desired features at the season-episode level.
Task 1:
- Calculate the number of lines per season per episode. You can might name this new variable n_lines.
- Then, calculate the proportion of lines in that episode spoken by each of the four characters Jim, Pam, Michael, and Dwight. Name these new variables lines_jim, lines_pam, lines_michael, and lines_dwight, respectively.
Task 2:
- Create a variable called halloween that sums up the 1s in halloween_mention at the season-episode level and takes on the value "yes" if the sum is greater than or equal to 1, or "no" otherwise.
- Do something similar for new variables valentine and christmas as well based on values from valentine_mention and christmas_mention.
Finish up your summarize() statement by dropping the groups, so the resulting data frame is no longer grouped. Additionally, remove n_lines (we won’t use that variable in our analysis, we only calculated it as an intermediary step).

office_episodes <- theoffice |>
  group_by(___) |>
  summarize(
    n_lines = n(),
    lines_jim = sum(character == "___") / n_lines,
    lines_pam = ___,
    lines_michael = ___,
    lines_dwight = ___,
    halloween = if_else(sum(___) >= 1, "yes", "no"),
    valentine = if_else(___, "___", "___"),
    christmas = if_else(___, "___", "___"),
    .groups = "drop"
  ) |>
  select(-n_lines)

Note

Why summarize() and not mutate()? We use mutate() to add / modify a column of a data frame. The output data frame always has the same number of rows as the input data frame. On the other hand, we use summarize() to reduce the data frame to either a single row (single summary statistic) or one row per each group (summary statistics at the group level).

And what about that .groups argument in summarize? You could try running your summarize() step without it first. You’ll see that R print out a message saying “summarize() has grouped output by season, episode. You can override using the .groups argument.” summarize() will only drop the last group. So if you want a data frame that doesn’t have a grouping structure as a result of a summarize(), you can explicitly ask for that with .groups = "drop". Before you proceed, read the documentation for summarize(), and specifically the explanation for the .groups argument to prepare yourself for future instances where you might see this type of message.

Now it’s time for a hand off…

⌨️ Team Member 2: Hands on the keyboard. Write the answers to Exercises 3 - 5.

🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!

Exercise 3

The Michael Scott character (played by Steve Carrell) left the show at the end of Season 7. Add an indicator variable, michael, that takes on the value "yes" if Michael Scott (Steve Carrell) was in the show, and "no" if not.

office_episodes <- office_episodes |>
  mutate(michael = if_else(season > ___, "___", "___"))

Exercise 4

Print out the dimensions (dim()) of the new data set you created as well as the names() of the columns in the data set.

Your new data set, office_episodes, should have 186 rows and 14 columns. The column names should be season, episode, episode_name, imdb_rating, total_votes, air_date, lines_jim, lines_pam, lines_michael, lines_dwight, halloween, valentine, christmas, and michael. If you are not matching these numbers or columns, go back and try to figure out where you went wrong. Or ask your TA for help!

Exploratory data analysis

This would be a good place to conduct some exploratory data analysis (EDA). For example, plot the proportion of lines spoken by each character over time. Or calculate the percentage of episodes that mention Halloween, or Valentine’s Day, or Christmas. Given we have limited time in the lab we’re not going to ask you to report EDA results as part of this lab, but we’re noting this here to provide suggestions for how you might go about structuring your project.

Modeling prep

Exercise 5

Split the data into training (75%) and testing (25%). Save the training and testing data as office_train and office_test respectively. Use seed 123.

Naming suggestion: Call the initial split office_split, the training data office_train, and testing data office_test.

set.seed(123)
office_split <- ___(office_episodes)
office_train <- ___(office_split)
office_test <- ___(___)

Team Member 2: Render the document and confirm that the changes are visible in the PDF. Then, commit (with an informative commit message) both the .qmd and PDF documents, and finally push the changes to GitHub. Make sure to commit and push all changed files so that your Git pane is empty afterwards.

Team Members 1, 3, 4: Once Team Member 2 is done rendering, committing, and pushing, confirm that the changes are visible on GitHub in your team’s lab repo. Then, in RStudio, click the Pull button in the Git pane to get the updated document. You should see the responses to Exercises 3 - 5 in your .qmd file.

Now it’s time for another hand off…

⌨️ Team Member 3: Hands on the keyboard. Write the answers to Exercises 6 - 8.

🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!

Exercise 6

Specify a linear regression model with engine "lm" and call it office_spec.

Naming suggestion: Call the model specification office_spec.

office_spec <- ___

Exercise 7

Create a recipe that performs feature engineering using the following steps (in the given order):

update_role(): updates the role of episode_name to not be a predictor (be an ID)
step_rm(): removes air_date and season as predictors
step_dummy(): creates dummy variables for all_nominal_predictors()
step_zv(): removes all zero variance predictors

Naming suggestion: Call the recipe office_rec.

office_rec <- recipe(imdb_rating ~ ., data = office_train) |>
  ___

Exercise 8

Build a model workflow for fitting the model specified earlier and using the recipe you developed to preprocess the data.

Naming suggestion: Call the model workflow office_wflow.

office_wflow <- workflow() |>
  add_model(___) |>
  add_recipe(___)

Model fit and evaluation

Team Member 3: Render the document and confirm that the changes are visible in the PDF. Then, commit (with an informative commit message) both the .qmd and PDF documents, and finally push the changes to GitHub. Make sure to commit and push all changed files so that your Git pane is empty afterwards.

Team Members 1, 2, 4: Once Team Member 3 is done rendering, committing, and pushing, confirm that the changes are visible on GitHub in your team’s lab repo. Then, in RStudio, click the Pull button in the Git pane to get the updated document. You should see the responses to Exercise 6 - 8 in your .qmd file.

Now it’s time for another hand off…

⌨️ Team Member 4: Hands on the keyboard. Write the answers to Exercises 9 - 11.

🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!

Exercise 9

Fit the model to training data, neatly display the model output, and interpret two of the slope coefficients.

Naming suggestion: Call the model fit office_fit.

office_fit <- office_wflow |>
  fit(data = ___)

___

Exercise 10

Calculate predicted imdb_rating for the training data using the predict() function. Then, bind the columns from the training data to this result.

Using this data frame, create a scatterplot of predicted and observed IMDB ratings for the training data.

Naming suggestion: Call the resulting data frame office_train_pred.

Stretch goal. Add episode names, using geom_text(), for episodes with much higher and much lower observed IMDB ratings compared to others.

Exercise 11

Calculate the R-squared and RMSE for this model for predictions on the training data.

Team Member 4: Render the document and confirm that the changes are visible in the PDF. Then, commit (with an informative commit message) both the .qmd and PDF documents, and finally push the changes to GitHub. Make sure to commit and push all changed files so that your Git pane is empty afterwards.

Team Members 1, 2, 3: Once Team Member 4 is done rendering, committing, and pushing, confirm that the changes are visible on GitHub in your team’s lab repo. Then, in RStudio, click the Pull button in the Git pane to get the updated document. You should see the responses to Exercise 9 - 11 in your .qmd file.

Now it’s time for another hand off…

⌨️ Team Member 2: Hands on the keyboard. Write the answers to Exercises 12 - 14.

🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!

Exercise 12

Repeat Exercise 10, but with testing data.

Naming suggestion: Call the resulting data frame office_test_pred.

Exercise 13

Based on your visualization on Exercise 12, speculate on whether you expect the R-squared and RMSE for this model to be higher or lower for predictions on the testing data compared to those on the training data, or do you expect them to be the same? Explain your reasoning.

Exercise 14

Check your intuition in Exercise 13 by actually calculating the R-squared and RMSE for this model for predictions on the training data. Comment on whether your intuition is confirmed or not.

Wrapping up

⌨️ Team Member 3: Hands on the keyboard. Make any edits as needed.

🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

One team member submit the assignment:

Go to http://www.gradescope.com and log in using your NetID credentials.
Click on your STA 210 course.
Click on the assignment, and you’ll be prompted to submit it.
Select all team members’ names, so they receive credit on the assignment. Click here for video on adding team members to assignment on Gradescope.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your PDF submission to be associated with the “Workflow & formatting” section.

Grading

Total points available: 50 points.

Component	Points
Ex 1 - 14	42
Workflow & formatting	5²
Complete team contract	3

Footnotes

Don’t trust yourself to keep your hands off the keyboard? Put them in your picket or cross your arms. No matter how silly it might feel, resist the urge to touch your keyboard until otherwise instructed!↩︎
The “Workflow & formatting” grade is to assess the reproducible workflow. This includes having at least one meaningful commit from each team member and updating the team name and date in the YAML.↩︎