Evaluate a claim about the slope using hypothesis testing
Define mathematical models to conduct inference for slope
Computational setup
# load packageslibrary(tidyverse) # for data wrangling and visualizationlibrary(tidymodels) # for modelinglibrary(usdata) # for the county_2019 datasetlibrary(openintro) # for Duke Forest datasetlibrary(scales) # for pretty axis labelslibrary(glue) # for constructing character stringslibrary(knitr) # for neatly formatted tableslibrary(kableExtra) # also for neatly formatted tablesf# set default theme and larger font size for ggplot2ggplot2::theme_set(ggplot2::theme_bw(base_size =16))
# A tibble: 2 × 3
term lower_ci upper_ci
<chr> <dbl> <dbl>
1 area 91.7 211.
2 intercept -18290. 287711.
Hypothesis test for the slope
Research question and hypotheses
“Do the data provide sufficient evidence that β1 (the true slope for the population) is different from 0?”
Null hypothesis: there is no linear relationship between area and price
H0:β1=0
Alternative hypothesis: there is a linear relationship between area and price
HA:β1≠0
Hypothesis testing as a court trial
Null hypothesis, H0: Defendant is innocent
Alternative hypothesis, HA: Defendant is guilty
Present the evidence: Collect data
Judge the evidence: “Could these data plausibly have happened by chance if the null hypothesis were true?”
Yes: Fail to reject H0
No: Reject H0
Hypothesis testing framework
Start with a null hypothesis, H0 that represents the status quo
Set an alternative hypothesis, HA that represents the research question, i.e. what we’re testing for
Conduct a hypothesis test under the assumption that the null hypothesis is true and calculate a p-value (probability of observed or more extreme outcome given that the null hypothesis is true)
if the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, stick with the null hypothesis
if they do, then reject the null hypothesis in favor of the alternative
Quantify the variability of the slope
for testing
Two approaches:
Via simulation
Via mathematical models
Randomization to quantify the variability of the slope for the purpose of testing, under the assumption that the null hypothesis is true:
Simulate new samples from the original sample via permutation
Fit models to each of the samples and estimate the slope
Use features of the distribution of the permuted slopes to conduct a hypothesis test
Permutation, described
Set the null hypothesis to be true, and measure the natural variability in the data due to sampling but not due to variables being correlated by permuting
Permute one variable to eliminate any existing relationship between the variables
Each price value is randomly assigned to area of a given house, i.e. area and price are no longer matched for a given house
Each of the observed values for area (and for price) exist in both the observed data plot as well as the permuted price plot
The permutation removes the linear relationship between area and price
Permutation, repeated
Repeated permutations allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e., that the null hypothesis is true)
Concluding the hypothesis test
Is the observed slope of ^β1=159 (or an even more extreme slope) a likely outcome under the null hypothesis that β=0? What does this mean for our original question: “Do the data provide sufficient evidence that β1 (the true slope for the population) is different from 0?”
null_dist |>filter(term =="area") |>ggplot(aes(x = estimate)) +geom_histogram(binwidth =10, color ="white")null_dist |>filter(term =="area") |>ggplot(aes(x = estimate)) +geom_histogram(binwidth =10, color ="white")
Reason around the p-value
In a world where the there is no relationship between the area of a Duke Forest house and in its price (β1=0), what is the probability that we observe a sample of 98 houses where the slope fo the model predicting price from area is 159 or even more extreme?
Warning: Please be cautious in reporting a p-value of 0. This result is an
approximation based on the number of `reps` chosen in the `generate()` step. See
`?get_p_value()` for more information.
Warning: Please be cautious in reporting a p-value of 0. This result is an
approximation based on the number of `reps` chosen in the `generate()` step. See
`?get_p_value()` for more information.
# A tibble: 2 × 2
term p_value
<chr> <dbl>
1 area 0
2 intercept 0
Earlier we computed a confidence interval and conducted a hypothesis test via simulation:
CI: Bootstrap the observed sample to simulate the distribution of the slope
HT: Permute the observed sample to simulate the distribution of the slope under the assumption that the null hypothesis is true
Now we’ll do these based on theoretical results, i.e., by using the Central Limit Theorem to define the distribution of the slope and use features (shape, center, spread) of this distribution to compute bounds of the confidence interval and the p-value for the hypothesis test
Mathematical representation of the model
Y=Model+Error=f(X)+ϵ=μY|X+ϵ=β0+β1X+ϵ
where the errors are independent and normally distributed:
independent: Knowing the error term for one observation doesn’t tell you anything about the error term for another observation
normally distributed: ϵ∼N(0,σ2ϵ)
Mathematical representation, visualized
Y|X∼N(β0+β1X,σ2ϵ)
Mean: β0+β1X, the predicted value based on the regression model
Variance: σ2ϵ, constant across the range of X
How do we estimate σ2ϵ?
Regression standard error
Once we fit the model, we can use the residuals to estimate the regression standard error (the spread of the distribution of the response, for a given value of the predictor variable):
ˆσϵ=√n∑i=1(yi−ˆyi)2n−2=√n∑i=1e2in−2
Why divide by n−2?
Why do we care about the value of the regression standard error?