library(tidyverse)
library(tidymodels)
library(knitr)
Lab 02: Ice duration and air temperature in Madison, WI
Inference for simple linear regression
Introduction
In today’s lab, you’ll use simple linear regression to analyze the relationship between air temperature and ice duration for two lakes in Wisconsin.
Learning goals
By the end of the lab you will…
- Be able to fit a simple linear regression model using R.
- Be able to interpret the slope and intercept for the model.
- Be able to use simulation-based inference to draw conclusions about the slope.
- Continue developing a workflow for reproducible data analysis.
Getting started
Go to the sta210-fa22 organization on GitHub. Click on the repo with the prefix lab-02-. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 01 instructions for details on cloning a repo, starting a new project in R, and configuring git.
Packages
The following packages are used in the lab:
Data: Ice cover and air temperature
The datasets wi-icecover.csv
and wi-air-temperature.csv
contain information about ice cover and air temperature, respectively, at Lake Monona and Lake Mendota for days in 1885 through 2019. The data were obtained from the ntl_icecover
and ntl_airtemp
data frames in the lterdatasampler R package. The data were originally collected at a US Long Term Ecological Research program (LTER) Network.
The primary variables for this analysis are
year
: a number denoting the year of observationlakeid
: a factor denoting the lake nameice_duration
: a number denoting the number of days between the freeze and breakup dates of each lakeave_air_temp_adjusted
: a number denoting the air temperature in degrees Celsius, collected in Madison, WI and adjusted for biases
<- read_csv("data/wi-icecover.csv")
icecover <- read_csv("data/wi-air-temperature.csv") airtemp
Exercises
Goal: The goal of this analysis is to use linear regression understand the association between average air temperature and ice duration for two lakes in Madison, Wisconsin that freeze for a portion of the year. Because ice cover is impacted by various environmental factors, researchers are interested in examining the association between these two factors to understand how quickly the climate is changing.
Write all code and narrative in your Quarto file. Write all narrative in complete sentences. Throughout the assignment, you should periodically render your Quarto document to produce the updated PDF, commit the changes in the Git pane, and push the updated files to GitHub.
Exercise 1
Fill in the code below to create a new data frame, icecover_avg
, of the average ice duration by lake and year. How many observations (rows) are in icecover_avg
? How many variables (columns)?
<- icecover |>
icecover_avg group_by(____, _____) |>
summarise(avg_ice_duration = mean(_____))
Exercise 2
Create a new data frame
airtemp_avg
, of the average air temperature by year.Then, join
icecover_avg
andairtemp_avg
to make a new data frameice_air_joined
. Remove the years that have missing data for the average annual air temperature or the average annual ice duration.
The data frameice_air_joined
has 268 observations.
Exercise 3
Create a histogram to examine the distribution of avg_ice_duration
and calculate summary statistics for the center and spread of the distribution. Include informative axis labels and an informative title on the visualization.
Use the visualization and summary statistics to describe the distribution of avg_ice_duration
. Include the shape, center, spread, and potential presence of outliers in the description.
Exercise 4
Create a histogram to examine the distribution of avg_air_temp
and calculate summary statistics for the center and spread of the distribution. Include informative axis labels and an informative title on the visualization.
Use the visualization and summary statistics to describe the distribution of avg_air_temp
. Include the shape, center, spread, and potential presence of outliers in the description.
This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.
Exercise 5
Fill in the code below to create a visualization of avg_ice_duration
versus year
by lakeid
. Write two observations from the visualization.
ggplot(data = ice_air_joined, aes(x = _____, y = ______, color = _____)) +
geom_line(alpha = 0.7) +
labs(x = "_______",
y = "_______",
color = "______",
title = "_______",
subtitle = "________")
Exercise 6
Create a visualization of avg_air_temp
versus year
. Include informative axis labels and an informative title on the visualization.
Use the visualization to write two observations about the trend of average air temperature over time.
Based on the visualizations of average ice duration and average air temperature over time, would you expect the linear model describing the association between average ice duration and average air temperature to have a positive or negative slope? Briefly explain.
Exercise 7
Researchers would like to use a linear model to understand variability in the average ice duration based on the average air temperature. Write the form of the statistical model the researchers would like to estimate. Use mathematical notation and variable names (avg_air_temp
and avg_ice_duration
) in the equation.1
This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.
Exercise 8
Fit and display the model described in the previous exercise. Use the kable
function to neatly display the model output using three decimal places.
Write the equation of the fitted model. Use mathematical notation and variable names (
avg_air_temp
andavg_ice_duration
) in the equation.Interpret the slope in the context of the data.
Does the intercept have a meaningful interpretation in this context? If so, interpret the intercept in the context of the data. Otherwise, explain why not.
Exercise 9
“Do the data provide sufficient evidence of a linear relationship between average air temperature and average ice duration , i.e. is \(\beta_1\) different from 0?“
Use a simulation-based hypothesis test to answer this question. In your response:
Write the null and alternative hypotheses in words and using mathematical notation.
Test these hypotheses using a permutation test.
First set a seed for simulating reproducibly. Use the seed
9
.Use the permutation pipeline with 1,000 permuted samples to simulate new samples from the original sample via permutation under the assumption that the null hypothesis is true and fit models to each of the samples and estimate the slope. Save the results as
perm_fits
.Create a histogram of the slope estimates in
perm_fits
.Approximate the p-value of the hypothesis test based on this distribution.
State your conclusion for the test in the context of the data.
Exercise 10
Next, use bootstrapping to construct 95% confidence interval for the slope Follow these steps to accomplish this:
- First set a seed for simulating reproducibly. Use the seed
10
. - Then, simulate the bootstrap distribution of the slope using 1,000 bootstrap samples.
- Visualize the bootstrap distribution.
- Calculate the bounds of the confidence interval using the percentile method.
- Interpret the confidence interval in the context of the data.
- Indicate whether or not the confidence interval is consistent with the results of the hypothesis test from the previous exercise. Briefly explain your response.
This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.
Submission
In this class, we’ll be submitting PDF documents to Gradescope.
To submit your assignment:
- Go to http://www.gradescope.comand click Log in in the top right corner.
- Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
- Click on your STA 210 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
- Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.
Grading (50 pts)
Component | Points |
---|---|
Ex 1 - 10 | 47 |
Workflow & formatting | 32 |
Footnotes
Click here for a guide on writing mathematical symbols using LaTex.↩︎
The “Workflow & formatting” grade is to assess the reproducible workflow. This includes having at least 3 informative commit messages and updating the name and date in the YAML.↩︎