Prof. Maria Tackett
Oct 12, 2022
Log transformation on the response variable
Log transformation on the predictor variable
A high respiratory rate can potentially indicate a respiratory infection in children. In order to determine what indicates a “high” rate, we first want to understand the relationship between a child’s age and their respiratory rate.
The data contain the respiratory rate for 618 children ages 15 days to 3 years. It was obtained from the Sleuth3 R package and is originally form a 1994 publication “Reference Values for Respiratory Rate in the First 3 Years of Life”.
Variables:
Age
: age in monthsRate
: respiratory rate (breaths per minute)What do you notice in this plot?
What do you notice in this plot?
\[ \log(Y) = \beta_0+ \beta_1 X + \epsilon, \hspace{10mm} \epsilon \sim N(0,\sigma^2_\epsilon) \]
\[\widehat{\log(Y)} = \hat{\beta}_0+ \hat{\beta}_1 X\]
We want to interpret the model in terms of the original variable \(Y\), not \(\log(Y)\), so we need to write the model in terms of \(Y\)
\[\hat{Y} = \exp\{\hat{\beta}_0 + \hat{\beta}_1 X\} = \exp\{\hat{\beta}_0\}\exp\{\hat{\beta}_1X\}\]
\[\widehat{\text{Median}({Y|X})} = \exp\{\hat{\beta}_0\}\exp\{\hat{\beta}_1 X\}\]
\[\hat{Y} = \exp\{\hat{\beta}_0 + \hat{\beta}_1 X\} = \exp\{\hat{\beta}_0\}\exp\{\hat{\beta}_1X\}\]
Intercept: When \(X=0\), the median of \(Y\) is expected to be \(\exp\{\hat{\beta}_0\}\)
Slope: For every one unit increase in \(X\), the median of \(Y\) is expected to multiply by a factor of \(\exp\{\hat{\beta}_1\}\)
Why is the interpretation in terms of a multiplicative change?
Suppose we have a set of values
Note: \(\overline{\log(x)} \neq \log(\bar{x})\)
Note: \(\text{Median}(\log(x)) = \log(\text{Median}(x))\)
\[\overline{\log(x)} \neq \log(\bar{x})\]
Recall that \(Y = \beta_0 + \beta_1 X\) is the mean value of the response at the given value of the predictor \(X\). This doesn’t hold when we log-transform the response variable.
Mathematically, the mean of the logged values is not necessarily equal to the log of the mean value. Therefore at a given value of \(X\)
$$
\[\begin{aligned}\exp\{\text{Mean}(\log(Y|X))\} \neq \text{Mean}(Y|X) \\[5pt] \Rightarrow \exp\{\beta_0 + \beta_1 X\} \neq \text{Mean}(Y|X) \end{aligned}\]$$
\[\exp\{\text{Median}(\log(Y|X))\} = \text{Median}(Y|X)\]
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 3.831 | 0.015 | 259.086 | 0 |
Age | -0.018 | 0.001 | -21.243 | 0 |
Interpret the slope and intercept in the context of the data.
04:00
Try a transformation on \(X\) if the scatterplot shows some curvature but the variance is constant for all values of \(X\)
Suppose we have the following regression equation:
\[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 \log(X)\]
Intercept: When \(X = 1\) \((\log(X) = 0)\), \(Y\) is expected to be \(\hat{\beta}_0\) (i.e. the mean of \(Y\) is \(\hat{\beta}_0\))
Slope: When \(X\) is multiplied by a factor of \(\mathbf{C}\), the mean of \(Y\) is expected to increase by \(\boldsymbol{\hat{\beta}_1}\mathbf{\log(C)}\) units
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 49.397 | 0.755 | 65.436 | 0 |
log_age | -5.668 | 0.311 | -18.248 | 0 |
Interpret the slope and intercept in the context of the data.
04:00
Recall the goal of the analysis:
In order to determine what indicates a “high” rate, we first want to understand the relationship between a child’s age and their respiratory rate.
Which is the preferred metric to compare the models - \(R^2\) or RMSE?
Rate vs. Age | log(Rate) vs. Age | Rate vs. log(Age) |
---|---|---|
0.549 | 0.596 | 0.559 |
Which model would you choose?
See Log Transformations in Linear Regression for more details about interpreting regression models with log-transformed variables.