ANOVA Output in R

We will use the Tips data set for this example.

# load packages
library(tidyverse)
library(tidymodels)
library(knitr)

The variables of interest in this analysis are

Model fit

tip_fit <- linear_reg() |>
  set_engine("lm") |>
  fit(Tip ~ Party + Age, data = tips)

tidy(tip_fit) |>
  kable(digits = 2)
term estimate std.error statistic p.value
(Intercept) -0.17 0.37 -0.46 0.64
Party 1.84 0.12 14.76 0.00
AgeMiddle 1.01 0.41 2.47 0.01
AgeSenCit 1.39 0.48 2.86 0.00

ANOVA

Below is the ANOVA output for the model fit above.

anova(tip_fit$fit) |>
  tidy() |>
  kable(digits = 2)
term df sumsq meansq statistic p.value
Party 1 1188.64 1188.64 285.71 0.00
Age 2 38.03 19.01 4.57 0.01
Residuals 165 686.44 4.16 NA NA

We will focus on the sum of squares in this document. The sum of squares are as follows:

$$

\[\begin{aligned} &SS_{Party} = 1188.64 \\ &SS_{Age|Party} = 38.03 \\ &SS_{Error} = SS_{Residuals} = 686.44 \\ &SS_{Total} = 1913.11 \end{aligned}\]

$$

Sum of squares in ANOVA table

R uses a sequential method to calculate sum of squares for the variables in the model. This means that the sum of squares attributed to each variable is the variation in the response explained by that variable after accounting for the total variation explained by the other variables already in the model.

The order of the sequence is determined by the order of the variables in the model fit code. This order is reflected in the order the variables appear in the ANOVA output. The sequential sum of squares attributed to each variable will change if the order of the variables in the model changes; however, the sum of squares attributed to the model overall will not change, regardless of the order of the variables.

Let’s take a look at the sum of squares for Party, the first variable in the model. This value is calculated as the total variation in Tips explained by Party only. We can calculate this value by looking at the ANOVA table for simple linear regression model where Party is the only predictor.

party_fit <- linear_reg() |>
  set_engine("lm") |>
  fit(Tip ~ Party, data = tips)

anova(party_fit$fit) |>
  tidy() |>
  kable(digits = 2)
term df sumsq meansq statistic p.value
Party 1 1188.64 1188.64 274 0
Residuals 167 724.47 4.34 NA NA

Notice that the sum of squares in this table is the value of \(SS_{Party}\) above.

Next, let’s add Age to the model. The sum of squares associated with Age is the additional variation in Tips explained by Age after accounting for variation explained by Party. This can be understood as the additional model variation in the model with Party and Age compared to a model that only includes Party. We can calculate this additional variation as follows:

anova(party_fit$fit, tip_fit$fit) |>
  tidy() |>
  kable(digits = 2)
term df.residual rss df sumsq statistic p.value
Tip ~ Party 167 724.47 NA NA NA NA
Tip ~ Party + Age 165 686.44 2 38.03 4.57 0.01

Notice that the sum of squares in this table is the value of \(SS_{Age|Party}\) above.

Note

When we input two model is the anova() function, e.g., anova(Model 1, Model 2), the output produced is the additional sum of squares accounted for by the new variable(s) in Model 2 after accounting for the variables in Model 1. In this case, it is the additional sum of squares accounted for by Age after accounting for Party.

When we use the ANOVA table, we are most interested in the variation in the response explained by the entire model, not the contribution from each variable. Therefore, we will primarily consider the \(SS_{Model}\), Sum of Squares Model. Because sum of squares are additive, it can be calculated as

\[ \begin{aligned} SS_{Model} &= SS_{Total} - SS_{Error} \\ &= 1913.11 - 686.44 \\ & = 1226.67 \end{aligned} \]It can also be calculated as

\[ \begin{aligned} SS_{Model} &= SS_{Party} + SS_{Age | Party} \\ & = 1188.64 + 38.03 \\ & = 1226.67 \end{aligned} \]