library(tidyverse)
library(tidymodels)
library(palmerpenguins)
library(knitr)
# add other packages as needed
Lab 06: Adelie Penguins
Logistic regression intro
Introduction
In this assignment, you’ll get to put into practice the logistic regression skills you’ve developed to analyze data about Palmer Penguins.
Learning goals
By the end of the lab you will be able to
- conduct exploratory data analysis for logistic regression
- fit logistic regression models and write the regression equation
- use the model to calculate predicted probabilities
- continue developing a collaborative workflow with your teammates
Getting started
A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
Go to the sta210-fa22 organization on GitHub. Click on the repo with the prefix lab-06. It contains the starter documents you need to complete the lab.
Each person on the team should clone the repository and open a new project in RStudio. Throughout the lab, each person should get a chance to make commits and push to the repo.
Workflow: Using Git and GitHub as a team
Packages
The following packages are used in the lab.
Data: Palmer Penguins
We will go back to the Palmer penguins data used in HW 02.
Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. (Gorman, Williams, and Fraser 2014)
These data can be found in the palmerpenguins package. We’re going to be working with the penguins
dataset from this package. The dataset contains data for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica. Click here to see the codebook.
We will focus on the following variables:
variable | class | description |
---|---|---|
species |
integer | Penguin species (Adelie, Gentoo, Chinstrap) |
island |
integer | Island where recorded (Biscoe, Dream, Torgersen) |
bill_depth_mm |
integer | Bill depth in mm |
Exercises
The goal of this analysis is to use logistic regression to understand the relationship between bill depth, island, and whether a penguin is from the Adelie species. First, we need to create a new response variable to identify whether a penguin is from the Adelie species.
<- penguins |>
penguins mutate(adelie = factor(if_else(species == "Adelie", 1, 0)))
And let’s check to make sure the new variable looks how we would expect before we continue with the analysis.
|>
penguins count(adelie, species)
# A tibble: 3 × 3
adelie species n
<fct> <fct> <int>
1 0 Chinstrap 68
2 0 Gentoo 124
3 1 Adelie 152
Exercise 1
Let’s start by examining the relationship between adelie
and island
.
Visualize the relationship between adelie
and island
. What is something you observe about the relationship between these two variables based on the plot?
Exercise 2
What does the values_fill
argument do in the following chunk? The documentation for the function will be helpful in answering this question.
|>
penguins count(island, adelie) |>
pivot_wider(names_from = adelie, values_from = n, values_fill = 0)
# A tibble: 3 × 3
island `0` `1`
<fct> <int> <int>
1 Biscoe 124 44
2 Dream 68 56
3 Torgersen 0 52
Exercise 3
Calculate the probability a randomly selected penguin is from the Adelie species if it was recorded on Biscoe island.
Calculate the odds a randomly selected penguin is from the Adelie species if it was recorded on Biscoe island.
Exercise 4
You want to fit a model using island
to predict the odds of being from the Adelie species. Let \(\pi\) be the probability a penguin is from the Adelie species. The model has the form shown below.
\[ \log\Big(\frac{\pi}{1-\pi}\Big) = \beta_0 + \beta_1~ Dream + \beta_2 ~ Torgersen \]
- Fit the model and neatly display the model output using three digits.
What are the predicted odds of a penguin being from the Adelie species if it was recorded on Biscoe island?
What are the predicted odds of a penguin being from the Adelie species if it was recorded on Dream island?
Exercise 5
Next, we’d like to add bill depth to the model. We’ll start by examining the relationship between these two variables.
Visualize the relationship between bill_depth_mm
and adelie
. What is something you observe about the relationship between these two variables based on the plot?
Exercise 6
Add bill depth to the previous model so that there are two predictors,
island
andbill_depth_mm
. Neatly display the model output using three digits.Write the estimated regression equation.
Exercise 7
Use the model from Exercise 6.
How do you expect the log-odds of being from the Adelie species to change when going from a penguin with bill depth 17 mm to a penguin with bill depth 20 mm? Assume both penguins were recorded on the Dream island.
How do you expect the odds of being from the Adelie species to change when going from a penguin with bill depth 17 mm to a penguin with bill depth 20 mm? Assume both penguins were recorded on the Dream island.
Exercise 8
Use the model from Exercise 6.
How do you expect the log-odds of being from the Adelie species to change when going from a penguin with bill depth 18 mm recorded on Biscoe island to a penguin with bill depth 21 mm recorded on Dream island?
How do you expect the odds of being from the Adelie species to change when going from a penguin with bill depth 18 mm recorded on Biscoe island to a penguin with bill depth 21 mm recorded on Dream island?
Submission
One team member submit the assignment:
- Go to http://www.gradescope.com and log in using your NetID credentials.
- Click on your STA 210 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Select all team members’ names, so they receive credit on the assignment. Click here for video on adding team members to assignment on Gradescope.
- Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
- Select the first page of your PDF submission to be associated with the “Workflow & formatting” section.
Grading
Total points available: 50 points.
Component | Points |
---|---|
Ex 1 - 8 | 45 |
Workflow & formatting | 51 |
References
Footnotes
The “Workflow & formatting” grade is to assess the reproducible workflow. This includes having at least one meaningful commit from each team member and updating the team name and date in the YAML.↩︎