library(tidyverse)
library(tidymodels)
library(openintro)
library(knitr)
AE 04: Simulation-based hypothesis testing
Houses in Duke Forest
Data
The data are on houses that were sold in the Duke Forest neighborhood of Durham, NC around November 2020. It was originally scraped from Zillow, and can be found in the duke_forest
data set in the openintro R package.
glimpse(duke_forest)
Rows: 98
Columns: 13
$ address <chr> "1 Learned Pl, Durham, NC 27705", "1616 Pinecrest Rd, Durha…
$ price <dbl> 1520000, 1030000, 420000, 680000, 428500, 456000, 1270000, …
$ bed <dbl> 3, 5, 2, 4, 4, 3, 5, 4, 4, 3, 4, 4, 3, 5, 4, 5, 3, 4, 4, 3,…
$ bath <dbl> 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 5.0, 3.0, 5.0, 2.0, 3.0, 3.0,…
$ area <dbl> 6040, 4475, 1745, 2091, 1772, 1950, 3909, 2841, 3924, 2173,…
$ type <chr> "Single Family", "Single Family", "Single Family", "Single …
$ year_built <dbl> 1972, 1969, 1959, 1961, 2020, 2014, 1968, 1973, 1972, 1964,…
$ heating <chr> "Other, Gas", "Forced air, Gas", "Forced air, Gas", "Heat p…
$ cooling <fct> central, central, central, central, central, central, centr…
$ parking <chr> "0 spaces", "Carport, Covered", "Garage - Attached, Covered…
$ lot <dbl> 0.97, 1.38, 0.51, 0.84, 0.16, 0.45, 0.94, 0.79, 0.53, 0.73,…
$ hoa <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ url <chr> "https://www.zillow.com/homedetails/1-Learned-Pl-Durham-NC-…
Exploratory data analysis
ggplot(duke_forest, aes(x = area, y = price)) +
geom_point(alpha = 0.7) +
labs(
x = "Area (square feet)",
y = "Sale price (USD)",
title = "Price and area of houses in Duke Forest"
+
) scale_y_continuous(labels = label_dollar())
Model
<- linear_reg() |>
df_fit set_engine("lm") |>
fit(price ~ area, data = duke_forest)
tidy(df_fit) |>
kable(digits = 2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 116652.33 | 53302.46 | 2.19 | 0.03 |
area | 159.48 | 18.17 | 8.78 | 0.00 |
Hypothesis test
State the null and alternative hypotheses
[Add hypotheses in mathematical notation]
Generate null distribution using permutation
Fill in the code, then set eval: true
.
= 100
n set.seed(09142022)
<- _____ |>
null_dist specify(______) |>
hypothesize(null = "independence") |>
generate(reps = _____, type = "permute") |>
fit()
Visualize distribution
# Code for histogram of null distribution
Calculate the p-value.
# get observed fit
<- duke_forest |>
observed_fit specify(price ~ area) |>
fit()
# calculate p-value
get_p_value(
____,obs_stat = ____,
direction = "two-sided"
)
- What does the warning message mean?
State conclusion
[Write your conclusion in the context of the data.]