HW 04: Logistic regression

due Monday, November 21 at 11:59pm

Introduction

In this assignment you will analyze results from multiple studies that utilize logistic regression.

Learning goals

In this assignment, you will…

  • Explore how logistic regression is used in real-world analyses
  • Analyze and interpret results from studies using logistic regression
  • Evaluate performance of logistic regression models

Getting started

The repo for this assignment is available on GitHub at github.com/sta210-fa22 and starts with the prefix hw-04. See Lab 01 for more detailed instructions on getting started.

Packages

No R packages are needed for this assignment.

Exercises

Impacts of tea consumption

The 2016 article “Tea consumption reduces the incidence of neurocognitive disorders: Findings from the Singapore longitudinal aging study” (Feng et al. 2016) examined the association between tea consumption habits and neurocognitive disorders (NCD), such as Alzheimer’s disease, in adults age 55 and older. Portions of the abstract are below:

Objectives

To examine the relationships between tea consumption habits and incident neurocognitive disorders (NCD) and explore potential effect modification by gender and the apolipoprotein E (APOE) genotype.

Participants

957 community-living Chinese elderly who were cognitively intact at baseline.

Measurements

We collected tea consumption information at baseline from 2003 to 2005 and ascertained incident cases of neurocognitive disorders (NCD) from 2006 to 2010. Odds ratio (OR) of association were calculated in logistic regression models that adjusted for potential confounders.

Results

A total of 72 incident NCD cases were identified from the cohort. Tea intake was associated with lower risk of incident NCD, independent of other risk factors. Reduced NCD risk was observed for both green tea (OR=0.43) and black/oolong tea (OR=0.53) and appeared to be influenced by the changing of tea consumption habit at follow-up. Using consistent nontea consumers as the reference, only consistent tea consumers had reduced risk of NCD (OR=0.39). Stratified analyses indicated that tea consumption was associated with reduced risk of NCD among females (OR=0.32) and APOE e4 carriers (OR=0.14) but not males and non APOE e4 carriers.

Exercise 1

The odds ratios reported in the abstract are the adjusted odds ratios, i.e., the odds ratios after adjusting for potential confounders such as age, pre-existing health conditions, diet, and behavioral factors. Interpret the following odds ratios from the abstract. Write the interpretations in the context of the data.

  • OR = 0.39
  • OR = 0.32

Exercise 2

An online article based on the results of Feng et al. states the following:

“And for people who carry a gene that puts them at higher risk for Alzheimer’s disease (the APOE e4 gene), enjoying the beverage is even more important: Daily tea consumption could reduce their risk of cognitive decline by up to 86 percent.”

Is this statement supported by the results of the study? Briefly explain why or why not.

Understanding unemployment

In the 2014 article “The Biggest Predictor of How Long You’ll Be Unemployed Is When You Lose Your Job”, author Ben Casselman analyzes the relationship between numerous factors such as age, race, and education and the odds an adult is unemployed for over a year.

Exercise 3

According to the article, among those unemployed for over a year, 16% are under 25 years old, 62% are 25 to 54 years old, and 22% are 55 and up. Based on this data…

  • What are the odds a randomly selected person who has been unemployed over a year is 55 and up?
  • What are the odds a randomly selected person who has been unemployed over a year is not 25 to 54 years old?

Exercise 4

Casselman fits a logistic regression model using the unemployment rate at the time the person lost their job to predict whether an adult is unemployed for over a year. He states the following from the model:

“A one-point increase in the unemployment rate raises an individual’s odds of becoming long-term unemployed by 35 percent.”

What is the coefficient for unemployment rate in this model? Show how you calculated the answer.

Rearrest risk algorithms

In the paper “Employing Standardized Risk Assessment in Pretrial Release Decisions: Association With Criminal Justice Outcomes and Racial Equity” Marlowe et al. (2020) analyze the risk predictions produced by a black-box algorithm used to determine whether a defendant is considered “high risk” of being rearrested if they are released while awaiting trial. Such algorithms are used by judges in some states to help determine whether or not defendants are released while awaiting trial.

The authors examine the algorithm’s risk predictions and whether a person was rearrested for over 500 defendants released pretrial in a southern state. For each person, the algorithm produced one of the following predictions: “High Risk” or “Low Risk”. The observed outcome was “Rearrested” or “Not Rearrested”. Below are some results from the analysis:

  • Sensitivity: 86%
  • Specificity: 24%
  • Positive predictive power: 57%
  • Negative predictive power: 60%
Tip
  • Positive Predictive Power: P(Y = 1 | Y classified as 1 from the model)
  • Negative Predictive Power: P(Y = 0 | Y classified as 0 from the model)

Exercise 5

Explain what each of the following mean in the context of the analysis:

  • Sensitivity
  • Specificity
  • Positive predictive power
  • Negative predictive power

Exercise 6

What is the false positive rate? What does this value mean in the context of the analysis?

Exercise 7

The AUC for this algorithm is 0.55. Based on this value, do you think this algorithm a good fit for the population examined in the paper? Why or why not?

Identifying spam emails

Suppose you fit a logistic regression to aid in spam classification for individual emails. The output from the logistic regression model is below:

term estimate std.error statistic p.value
(Intercept) -0.81 0.09 -9.34 <0.0001
to_multiple1 -2.64 0.30 -8.68 <0.0001
winneryes 1.63 0.32 5.11 <0.0001
format1 -1.59 0.12 -13.28 <0.0001
re_subj1 -3.05 0.36 -8.40 <0.0001

Exercise 8

Use the model to answer the following:

  • Write down the model using the coefficients from the model fit.

  • Suppose we have an observation where to_muliple = 0, winner = 1 , format = 1, and re_subj = 0. What is the predicted probability that this message is spam?

  • Suppose you are a data scientist working on a spam filter. For a given message, how high must the probability a message is spam be before you think it would be reasonable to put it in a spambox/ junk folder (which the user is unlikely to check)? What are 2 tradeoffs you might consider?

Exercise 8 was adapted from an exercise in Introduction to Modern Statistics

Before submitting, make sure you render your document and commit (with a meaningful commit message) and push all updates.

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
  • Click on your STA 210 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
  • Select the first page of your PDF submission to be associated with the “Workflow & formatting” section.

Grading (50 points)

Component Points
Ex 1 - 8 47
Workflow & formatting 31

References

Feng, L., M. -S. Chong, W. -S. Lim, Q. Gao, M. S. Z. Nyunt, T. -S. Lee, S. L. Collinson, T. Tsoi, E. -H. Kua, and T. -P. Ng. 2016. “Tea Consumption Reduces the Incidence of Neurocognitive Disorders: Findings from the Singapore Longitudinal Aging Study.” The Journal of Nutrition, Health & Aging 20 (10): 1002–9. https://doi.org/10.1007/s12603-016-0687-0.
Marlowe, Douglas B., Timothy Ho, Shannon M. Carey, and Carly D. Chadick. 2020. “Employing Standardized Risk Assessment in Pretrial Release Decisions: Association with Criminal Justice Outcomes and Racial Equity.” Law and Human Behavior 44 (5): 361–76. https://doi.org/10.1037/lhb0000413.

Footnotes

  1. The “Workflow & formatting” grade is to assess the reproducible workflow. This includes having at least 3 informative commit messages and updating the name and date in the YAML.↩︎