# load packages
library(tidyverse)
library(tidymodels)
library(knitr)
ANOVA Output in R
We will use the Tips data set for this example.
The variables of interest in this analysis are
Party
: Number of people in the partyMeal
: Time of day (Lunch, Dinner, Late Night)Age
: Age category of person paying the bill (Yadult, Middle, SenCit)
Model fit
<- linear_reg() |>
tip_fit set_engine("lm") |>
fit(Tip ~ Party + Age, data = tips)
tidy(tip_fit) |>
kable(digits = 2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.17 | 0.37 | -0.46 | 0.64 |
Party | 1.84 | 0.12 | 14.76 | 0.00 |
AgeMiddle | 1.01 | 0.41 | 2.47 | 0.01 |
AgeSenCit | 1.39 | 0.48 | 2.86 | 0.00 |
ANOVA
Below is the ANOVA output for the model fit above.
anova(tip_fit$fit) |>
tidy() |>
kable(digits = 2)
term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|
Party | 1 | 1188.64 | 1188.64 | 285.71 | 0.00 |
Age | 2 | 38.03 | 19.01 | 4.57 | 0.01 |
Residuals | 165 | 686.44 | 4.16 | NA | NA |
We will focus on the sum of squares in this document. The sum of squares are as follows:
$$
\[\begin{aligned} &SS_{Party} = 1188.64 \\ &SS_{Age|Party} = 38.03 \\ &SS_{Error} = SS_{Residuals} = 686.44 \\ &SS_{Total} = 1913.11 \end{aligned}\]$$
Sum of squares in ANOVA table
R uses a sequential method to calculate sum of squares for the variables in the model. This means that the sum of squares attributed to each variable is the variation in the response explained by that variable after accounting for the total variation explained by the other variables already in the model.
The order of the sequence is determined by the order of the variables in the model fit code. This order is reflected in the order the variables appear in the ANOVA output. The sequential sum of squares attributed to each variable will change if the order of the variables in the model changes; however, the sum of squares attributed to the model overall will not change, regardless of the order of the variables.
Let’s take a look at the sum of squares for Party
, the first variable in the model. This value is calculated as the total variation in Tips
explained by Party
only. We can calculate this value by looking at the ANOVA table for simple linear regression model where Party
is the only predictor.
<- linear_reg() |>
party_fit set_engine("lm") |>
fit(Tip ~ Party, data = tips)
anova(party_fit$fit) |>
tidy() |>
kable(digits = 2)
term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|
Party | 1 | 1188.64 | 1188.64 | 274 | 0 |
Residuals | 167 | 724.47 | 4.34 | NA | NA |
Notice that the sum of squares in this table is the value of \(SS_{Party}\) above.
Next, let’s add Age
to the model. The sum of squares associated with Age
is the additional variation in Tips
explained by Age
after accounting for variation explained by Party
. This can be understood as the additional model variation in the model with Party
and Age
compared to a model that only includes Party
. We can calculate this additional variation as follows:
anova(party_fit$fit, tip_fit$fit) |>
tidy() |>
kable(digits = 2)
term | df.residual | rss | df | sumsq | statistic | p.value |
---|---|---|---|---|---|---|
Tip ~ Party | 167 | 724.47 | NA | NA | NA | NA |
Tip ~ Party + Age | 165 | 686.44 | 2 | 38.03 | 4.57 | 0.01 |
Notice that the sum of squares in this table is the value of \(SS_{Age|Party}\) above.
When we use the ANOVA table, we are most interested in the variation in the response explained by the entire model, not the contribution from each variable. Therefore, we will primarily consider the \(SS_{Model}\), Sum of Squares Model. Because sum of squares are additive, it can be calculated as
\[ \begin{aligned} SS_{Model} &= SS_{Total} - SS_{Error} \\ &= 1913.11 - 686.44 \\ & = 1226.67 \end{aligned} \]It can also be calculated as
\[ \begin{aligned} SS_{Model} &= SS_{Party} + SS_{Age | Party} \\ & = 1188.64 + 38.03 \\ & = 1226.67 \end{aligned} \]