Lab 07: General Social Survey

Logistic regression

Important

Due:

  • Monday, November 14 , 11:59pm (Thursday labs)
  • Tuesday, November 15, 11:59pm (Friday labs)

Introduction

In this assignment, you’ll analyze data from the 2016 General Social Survey using logistic regression for interpretation and prediction.

Learning goals

By the end of the lab you will be able to…

  • Use logistic regression to explore the relationship between a binary response variable and multiple predictor variables

  • Conduct exploratory data analysis for logistic regression

  • Interpret coefficients of logistic regression model

  • Use the logistic regression model for prediction and classification

Getting started

  • A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.

  • Go to the sta210-fa22 organization on GitHub. Click on the repo with the prefix lab-07. It contains the starter documents you need to complete the lab.

  • Each person on the team should clone the repository and open a new project in RStudio. Throughout the lab, each person should get a chance to make commits and push to the repo.

Workflow: Using Git and GitHub as a team

Important

There are no Team Member markers in this lab; however, you should use a similar workflow as in Lab 04. Only one person should type in the group’s Qmd file at a time. Once that person has finished typing the group’s responses, they should render, commit, and push the changes to GitHub. All other teammates can pull to see the updates in RStudio.

Every teammate must have at least one commit in the lab. Everyone is expected to contribute to discussion even when they are not typing.

Packages

The following packages are used in the lab.

library(tidyverse)
library(tidymodels)
library(knitr)

# add other packages as needed

Data: General Social Survey

The General Social Survey (GSS) has been used to measure trends in attitudes and behaviors in American society since 1972. In addition to collecting demographic information, the survey includes questions used to gauge attitudes about government spending priorities, confidence in institutions, lifestyle, and many other topics. A full description of the survey may be found here.

The data for this lab are from the 2016 General Social Survey. The original data set contains 2867 observations and 935 variables. We will use and abbreviated data set that includes the following variables:

  • natmass: Respondent’s answer to the following prompt:

    “We are faced with many problems in this country, none of which can be solved easily or inexpensively. I’m going to name some of these problems, and for each one I’d like you to tell me whether you think we’re spending too much money on it, too little money, or about the right amount…are we spending too much, too little, or about the right amount on mass transportation?”

  • age: Age in years.

  • sex: Sex recorded as male or female

  • sei10: Socioeconomic index from 0 to 100

  • region: Region where interview took place

  • polviews: Respondent’s answer to the following prompt:

    “We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal - point 1 - to extremely conservative - point 7. Where would you place yourself on this scale?”

The data are in gss2016.csv in the data folder.

Exercises

The goal of today’s lab is to use the GSS to examine the relationship between US adults’ political views and attitudes towards government spending on mass transportation projects.

Part I: Exploratory data analysis

Exercise 1

Let’s begin by making a binary variable for respondents’ views on spending on mass transportation. Create a new variable that is equal to “1” if a respondent said spending on mass transportation is about right and “0” otherwise. Then make a plot of the new variable, using informative labels for each category.

Exercise 2

Recode polviews so it is a factor with levels that are in an order that is consistent with question on the survey.

Tip

Note how the categories are spelled in the data.

  • Make a plot of the distribution of polviews.
  • Which political view occurs most frequently in this data set?

Exercise 3

Make a plot displaying the relationship between satisfaction with mass transportation spending and political views. Use the plot to describe the relationship the two variables.

Exercise 4

We’d like to use age as a quantitative variable in your model; however, it is currently a character data type because some observations are coded as "89 or older".

  • Recode age so that is a numeric variable. Note: Before making the variable numeric, you will need to replace the values "89 or older" with a single value.
  • Then plot the distribution of age.

Part II: Logistic regression model

Exercise 5

Briefly explain why we should use a logistic regression model to predict the odds a randomly selected person is satisfied with spending on mass transportation.

Exercise 6

Split the data into training (75%) and testing sets (25%). Use a seed of 6.

Exercise 7

Let’s start by fitting a model using the demographic factors - age, sex, sei10, and region - to predict the odds a person is satisfied with spending on mass transportation.

Use the training data to fit the model described above. Use a recipe to make any necessary adjustments to the variables so the intercept will have a meaningful interpretation. Then, neatly display the model using 3 digits.

Exercise 8

  • Interpret the intercept in terms of the odds in the context of the data.

  • Consider the relationship between age and one’s opinion about spending on mass transportation. Interpret the coefficient of age in terms of the odds of being satisfied with spending on mass transportation.

Exercise 9

Now let’s see whether a person’s political views has a significant impact on their odds of being satisfied with spending on mass transportation, after accounting for the demographic factors. Use the training data to fit a model using all the variables from Exercise 7 along with polviews. Neatly display the model using 3 digits.

Exercise 10

Use the testing data to produce the ROC curve and calculate the area under curve (AUC) for the model fit in Exercise 7 and the one fit in Exercise 9. Which model is a better fit for the data? Briefly explain your choice.

Exercise 11

You have been tasked by a local political organization to identify adults who are satisfied with current spending on mass transportation. These adults will receive targeted political mailings that differ from adults who are not currently satisfied with spending on mass transportation. You will use the model selected in the previous exercise to predict which adults are satisfied with spending on mass transportation.

What cutoff probability will you use to classify observations in “satisfied with mass transportation spending” versus “not satisfied”? Briefly explain how you determined that cutoff probability.

Exercise 12

Make a confusion matrix using the cutoff probability from Exercise 11. Use the confusion matrix to calculate the following:

  • Sensitivity
  • Specificity
  • False negative rate
  • False positive rate

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

One team member submit the assignment:

  • Go to http://www.gradescope.com and log in using your NetID credentials.
  • Click on your STA 210 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Select all team members’ names, so they receive credit on the assignment. Click here for video on adding team members to assignment on Gradescope.
  • Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
  • Select the first page of your PDF submission to be associated with the “Workflow & formatting” section.

Grading

Total points available: 50 points.

Component Points
Ex 1 - 12 45
Workflow & formatting 51

Footnotes

  1. The “Workflow & formatting” grade is to assess the reproducible workflow. This includes having at least one meaningful commit from each team member and updating the team name and date in the YAML.↩︎