Lab 02: Ice duration and air temperature in Madison, WI

Inference for simple linear regression

Important

Due:

  • Monday, September 19, 11:59pm (Thursday labs)
  • Tuesday, September 20, 11:59pm (Friday labs)

Introduction

In today’s lab, you’ll use simple linear regression to analyze the relationship between air temperature and ice duration for two lakes in Wisconsin.

Learning goals

By the end of the lab you will…

  • Be able to fit a simple linear regression model using R.
  • Be able to interpret the slope and intercept for the model.
  • Be able to use simulation-based inference to draw conclusions about the slope.
  • Continue developing a workflow for reproducible data analysis.

Getting started

  • Go to the sta210-fa22 organization on GitHub. Click on the repo with the prefix lab-02-. It contains the starter documents you need to complete the lab.

  • Clone the repo and start a new project in RStudio. See the Lab 01 instructions for details on cloning a repo, starting a new project in R, and configuring git.

Packages

The following packages are used in the lab:

library(tidyverse)
library(tidymodels)
library(knitr)

Data: Ice cover and air temperature

The datasets wi-icecover.csv and wi-air-temperature.csv contain information about ice cover and air temperature, respectively, at Lake Monona and Lake Mendota for days in 1885 through 2019. The data were obtained from the ntl_icecover and ntl_airtemp data frames in the lterdatasampler R package. The data were originally collected at a US Long Term Ecological Research program (LTER) Network.

The primary variables for this analysis are

  • year: a number denoting the year of observation

  • lakeid: a factor denoting the lake name

  • ice_duration: a number denoting the number of days between the freeze and breakup dates of each lake

  • ave_air_temp_adjusted: a number denoting the air temperature in degrees Celsius, collected in Madison, WI and adjusted for biases

icecover <- read_csv("data/wi-icecover.csv")
airtemp <- read_csv("data/wi-air-temperature.csv")

Exercises

Goal: The goal of this analysis is to use linear regression understand the association between average air temperature and ice duration for two lakes in Madison, Wisconsin that freeze for a portion of the year. Because ice cover is impacted by various environmental factors, researchers are interested in examining the association between these two factors to understand how quickly the climate is changing.


Write all code and narrative in your Quarto file. Write all narrative in complete sentences. Throughout the assignment, you should periodically render your Quarto document to produce the updated PDF, commit the changes in the Git pane, and push the updated files to GitHub.

Tip

Make sure we can read all of your code in your PDF document. This means you will need to break up long lines of code. One way to help avoid long lines of code is is start a new line after every pipe (|>) and plus sign (+).

Exercise 1

Fill in the code below to create a new data frame, icecover_avg, of the average ice duration by lake and year. How many observations (rows) are in icecover_avg? How many variables (columns)?

icecover_avg <- icecover |>
  group_by(____, _____) |>
  summarise(avg_ice_duration = mean(_____))

Exercise 2

  • Create a new data frame airtemp_avg, of the average air temperature by year.

  • Then, join icecover_avg and airtemp_avg to make a new data frame ice_air_joined. Remove the years that have missing data for the average annual air temperature or the average annual ice duration.

    The data frame ice_air_joined has 268 observations.

Important

You will use ice_air_joined for the remainder of this lab.

Exercise 3

Create a histogram to examine the distribution of avg_ice_duration and calculate summary statistics for the center and spread of the distribution. Include informative axis labels and an informative title on the visualization.

Use the visualization and summary statistics to describe the distribution of avg_ice_duration . Include the shape, center, spread, and potential presence of outliers in the description.

Exercise 4

Create a histogram to examine the distribution of avg_air_temp and calculate summary statistics for the center and spread of the distribution. Include informative axis labels and an informative title on the visualization.

Use the visualization and summary statistics to describe the distribution of avg_air_temp. Include the shape, center, spread, and potential presence of outliers in the description.

This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.

Exercise 5

Fill in the code below to create a visualization of avg_ice_duration versus year by lakeid. Write two observations from the visualization.

ggplot(data = ice_air_joined, aes(x = _____, y = ______, color = _____)) +
  geom_line(alpha = 0.7) +
  labs(x = "_______", 
       y = "_______", 
       color = "______", 
       title = "_______", 
       subtitle = "________")

Exercise 6

Create a visualization of avg_air_temp versus year. Include informative axis labels and an informative title on the visualization.

  • Use the visualization to write two observations about the trend of average air temperature over time.

  • Based on the visualizations of average ice duration and average air temperature over time, would you expect the linear model describing the association between average ice duration and average air temperature to have a positive or negative slope? Briefly explain.

Exercise 7

Researchers would like to use a linear model to understand variability in the average ice duration based on the average air temperature. Write the form of the statistical model the researchers would like to estimate. Use mathematical notation and variable names (avg_air_temp and avg_ice_duration) in the equation.1

This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.

Exercise 8

Fit and display the model described in the previous exercise. Use the kable function to neatly display the model output using three decimal places.

  • Write the equation of the fitted model. Use mathematical notation and variable names (avg_air_temp and avg_ice_duration) in the equation.

  • Interpret the slope in the context of the data.

  • Does the intercept have a meaningful interpretation in this context? If so, interpret the intercept in the context of the data. Otherwise, explain why not.

Exercise 9

“Do the data provide sufficient evidence of a linear relationship between average air temperature and average ice duration , i.e. is \(\beta_1\) different from 0?“

Use a simulation-based hypothesis test to answer this question. In your response:

  • Write the null and alternative hypotheses in words and using mathematical notation.

  • Test these hypotheses using a permutation test.

    • First set a seed for simulating reproducibly. Use the seed 9.

    • Use the permutation pipeline with 1,000 permuted samples to simulate new samples from the original sample via permutation under the assumption that the null hypothesis is true and fit models to each of the samples and estimate the slope. Save the results as perm_fits.

    • Create a histogram of the slope estimates in perm_fits.

    • Approximate the p-value of the hypothesis test based on this distribution.

    • State your conclusion for the test in the context of the data.

Exercise 10

Next, use bootstrapping to construct 95% confidence interval for the slope Follow these steps to accomplish this:

  • First set a seed for simulating reproducibly. Use the seed 10.
  • Then, simulate the bootstrap distribution of the slope using 1,000 bootstrap samples.
  • Visualize the bootstrap distribution.
  • Calculate the bounds of the confidence interval using the percentile method.
  • Interpret the confidence interval in the context of the data.
  • Indicate whether or not the confidence interval is consistent with the results of the hypothesis test from the previous exercise. Briefly explain your response.

This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.

Submission

In this class, we’ll be submitting PDF documents to Gradescope.

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

  • Go to http://www.gradescope.comand click Log in in the top right corner.
  • Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
  • Click on your STA 210 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
  • Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading (50 pts)


Component Points
Ex 1 - 10 47
Workflow & formatting 32

Footnotes

  1. Click here for a guide on writing mathematical symbols using LaTex.↩︎

  2. The “Workflow & formatting” grade is to assess the reproducible workflow. This includes having at least 3 informative commit messages and updating the name and date in the YAML.↩︎