Prof. Maria Tackett
Oct 12, 2022
Log transformation on the response variable
Log transformation on the predictor variable
A high respiratory rate can potentially indicate a respiratory infection in children. In order to determine what indicates a “high” rate, we first want to understand the relationship between a child’s age and their respiratory rate.
The data contain the respiratory rate for 618 children ages 15 days to 3 years. It was obtained from the Sleuth3 R package and is originally form a 1994 publication “Reference Values for Respiratory Rate in the First 3 Years of Life”.
Variables:
Age
: age in monthsRate
: respiratory rate (breaths per minute)What do you notice in this plot?
What do you notice in this plot?
log(Y)=β0+β1X+ϵ,ϵ∼N(0,σ2ϵ)
^log(Y)=ˆβ0+ˆβ1X
We want to interpret the model in terms of the original variable Y, not log(Y), so we need to write the model in terms of Y
ˆY=exp{ˆβ0+ˆβ1X}=exp{ˆβ0}exp{ˆβ1X}
^Median(Y|X)=exp{ˆβ0}exp{ˆβ1X}
ˆY=exp{ˆβ0+ˆβ1X}=exp{ˆβ0}exp{ˆβ1X}
Intercept: When X=0, the median of Y is expected to be exp{ˆβ0}
Slope: For every one unit increase in X, the median of Y is expected to multiply by a factor of exp{ˆβ1}
Why is the interpretation in terms of a multiplicative change?
Suppose we have a set of values
Note: ¯log(x)≠log(ˉx)
Note: Median(log(x))=log(Median(x))
¯log(x)≠log(ˉx)
Recall that Y=β0+β1X is the mean value of the response at the given value of the predictor X. This doesn’t hold when we log-transform the response variable.
Mathematically, the mean of the logged values is not necessarily equal to the log of the mean value. Therefore at a given value of X
$$
exp{Mean(log(Y|X))}≠Mean(Y|X)⇒exp{β0+β1X}≠Mean(Y|X)$$
exp{Median(log(Y|X))}=Median(Y|X)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 3.831 | 0.015 | 259.086 | 0 |
Age | -0.018 | 0.001 | -21.243 | 0 |
Interpret the slope and intercept in the context of the data.
04:00
Try a transformation on X if the scatterplot shows some curvature but the variance is constant for all values of X
Suppose we have the following regression equation:
ˆY=ˆβ0+ˆβ1log(X)
Intercept: When X=1 (log(X)=0), Y is expected to be ˆβ0 (i.e. the mean of Y is ˆβ0)
Slope: When X is multiplied by a factor of C, the mean of Y is expected to increase by ˆβ1log(C) units
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 49.397 | 0.755 | 65.436 | 0 |
log_age | -5.668 | 0.311 | -18.248 | 0 |
Interpret the slope and intercept in the context of the data.
04:00
Recall the goal of the analysis:
In order to determine what indicates a “high” rate, we first want to understand the relationship between a child’s age and their respiratory rate.
Which is the preferred metric to compare the models - R2 or RMSE?
Rate vs. Age | log(Rate) vs. Age | Rate vs. log(Age) |
---|---|---|
0.549 | 0.596 | 0.559 |
Which model would you choose?
See Log Transformations in Linear Regression for more details about interpreting regression models with log-transformed variables.