HW 02: Multiple linear regression

due Wednesday, October 19 at 11:59pm

Introduction

In this analysis you will use multiple linear regression to analyze relationships between variables in three different scenarios.

Learning goals

In this assignment, you will…

  • Fit and interpret multiple linear regression models
  • Evaluate and compare multiple linear regression models
  • Continue developing a workflow for reproducible data analysis.

Getting started

The repo for this assignment is available on GitHub at github.com/sta210-fa22 and starts with the prefix hw-02. See Lab 01 for more detailed instructions on getting started.

Packages

The following packages will be used in this assignment:

library(tidyverse)
library(tidymodels) 
library(knitr) 
library(palmerpenguins)
Important

All narrative should be written in complete sentences, and all visualizations should have informative titles and axis labels.

Part 1: Palmer Penguins

Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. (Gorman, Williams, and Fraser 2014)


These data can be found in the palmerpenguins package. We’re going to be working with the penguins dataset from this package. The dataset contains data for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica. Click here to see the codebook.

Exercise 1

Our first goal is to fit a model predicting body mass (which is more difficult to measure) from bill length, bill depth, flipper length, species, and sex.

We will start by preparing the data.

  • Use the drop_na() function to remove any observations from the penguins data frame that has missing values. Your resulting data frame should have 333 observations.

  • Split the data into training (75%) and testing (25%) sets. Use a seed of 123.

Exercise 2

  • Use the training data to fit a model predicting body mass (which is more difficult to measure) from the other variables listed above. Only include main effects, i.e., no interaction terms, in this model. Fit the model in a way such that the intercept has a meaningful interpretation.

    Neatly display the model using 3 digits.

  • Write estimated regression equation. Use the variable names in your equation.

Exercise 3

  • Interpret each slope coefficient in the context of the data.

  • Interpret the intercept in the context of the data.

Exercise 4

  • Calculate the residual for a male Adelie penguin that weighs 3750 grams with the following body measurements: bill_length_mm = 39.1, bill_depth_mm = 18.7, flipper_length_mm = 181. Does the model overpredict or underpredict this penguin’s weight?

  • Calculate \(R^2\) of this model based on the training data and interpret this value in context of the data and the model.

Exercise 5

Next, we will focus on a model using bill length and species to predict body mass.

Use the training data to make a visualization of the relationship between bill length and body mass by species. Does the visualization give evidence of a potential interaction term? Briefly explain your response.

Exercise 6

Use the training data to fit a model using bill length, species, and the interaction between the two variables to predict body mass. Fit the model in a way such that the intercept has a meaningful interpretation.

Neatly display the model using 3 digits.

Tip

You can use the step_interact() to add interactions in the recipe.

Exercise 7

Use the test data to compare the model fit in Exercise 2 to the model fit in Exercise 6. Which model is the “best” for predicting body mass? Briefly explain your response showing the code and output to support your choice.

Part 2: Perceived threat of Covid-19

Garbe, Rau, and Toppe (2020) published in June 2020, aims to examine the relationship between personality traits, perceived threat of Covid-19 and stockpiling toilet paper. For this study titled Influence of perceived threat of Covid-19 and HEXACO personality traits on toilet paper stockpiling, researchers conducted an online survey March 23 - 29, 2020 and used the results to fit multiple linear regression models to draw conclusions about their research questions. From their survey, they collected data on adults across 35 countries. Given the small number of responses from people outside of the United States, Canada, and Europe, only responses from people in these three locations were included in the regression analysis.

Let’s consider their results for the model looking at the effect on perceived threat of Covid-19. The model can be found on page 6 of the paper. The perceived threat of Covid was quantified using the responses to the following survey question:

How threatened do you feel by Coronavirus? [Users select on a 10-point visual analogue scale (Not at all threatened to Extremely Threatened)]

As stated on page 5 of the paper “To ease interpretation, continuous variables were z-standardized and categorical variables were dummy-coded in all models.”

You are familiar with dummy-coding, i.e., creating indicator variables for categorical predictors. For the continuous variables, “z-standardized” means each continuous predictor was shifted by its mean and rescaled by its standard deviation. For example, if \(X\) is continuous predictor, the z-standardized value of \(X\) is

\[z_X = \frac{x - \bar{x}}{s_X}\]

Exercise 8

  • Interpret the coefficient of Age (0.072) in the context of the analysis.

  • Interpret the coefficient of Place of residence in the context of the analysis.

Exercise 9

The model includes an interaction between Place of residence and Emotionality (capturing differential tendencies in to worry and be anxious).

  • What does the coefficient for the interaction (0.101) mean in the context of the data?

  • Interpret the estimated effect of Emotionality for a person who lives in the US/Canada.

  • Interpret the estimated effect of Emotionality for a person who lives in Europe.

Part 3: World Bank

Exercise 10

Data on countries’ Gross Domestic Product (GDP) and percentage of urban population was collected and made available by The World Bank in 2020. A description of the variables as defined by The World Bank are provided below.

  • GDP: “GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in current U.S. dollars.”

  • Urban Population (% of total): “Urban population refers to people living in urban areas as defined by national statistical offices. It is calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects.”

The linear model of the relationship between GDP and urban population is as follows

\[ \widehat{\log(GDP)} = 6.11 + 0.042 \times urban \]

  • Interpret the slope in terms of the GDP in the context of the data.

  • Interpret the intercept in terms of the GDP in the context of the data.

Before submitting, make sure you render your document and commit (with a meaningful commit message) and push all updates.

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
  • Click on your STA 210 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
  • Select the first page of your PDF submission to be associated with the “Workflow & formatting” section.

Grading (50 points)

Component Points
Ex 1 - 10 47
Workflow & formatting 31

References

Garbe, Lisa, Richard Rau, and Theo Toppe. 2020. “Influence of Perceived Threat of Covid-19 and HEXACO Personality Traits on Toilet Paper Stockpiling.” Edited by Valerio Capraro. PLOS ONE 15 (6): e0234232. https://doi.org/10.1371/journal.pone.0234232.
Gorman, Kristen B., Tony D. Williams, and William R. Fraser. 2014. “Ecological Sexual Dimorphism and Environmental Variability Within a Community of Antarctic Penguins (Genus Pygoscelis).” Edited by André Chiaradia. PLoS ONE 9 (3): e90081. https://doi.org/10.1371/journal.pone.0090081.

Footnotes

  1. The “Workflow & formatting” grade is to assess the reproducible workflow. This includes having at least 3 informative commit messages and updating the name and date in the YAML.↩︎