data(mtcars)
$hp_z <- scale(mtcars$hp) mtcars
Scaling variables in R
Herein I present a walkthrough of a recent improvement I made to my analyses when scaling (i.e., normalizing) variables.
Scaling or normalizing variables in R is super convenient by using the scale(x, center = TRUE, scale = TRUE)
function and default arguments. Here, the variable x
is mean centered and scaled to unit variance, also called a Z-score.
Let’s do this on the mtcars
dataset variable hp
:
Here’s proof that the newly scaled hp_z
variable was indeed scaled:
library(tidyverse)
%>%
mtcars summarise(across(contains("hp"), list(mean = mean, sd = sd))) %>%
round(2)
hp_mean hp_sd hp_z_mean hp_z_sd
1 146.69 68.56 0 1
As we can see, the M=0 and Standard Deviation(SD)=1 for hp_z
.
The scale()
function, however, does something really useful that I used to ignore for the longest time; it saves the scaling factors as attributes in the column/variable. This is seen in a few ways:
class(mtcars$hp_z)
[1] "matrix" "array"
$hp_z mtcars
[,1]
[1,] -0.53509284
[2,] -0.53509284
[3,] -0.78304046
[4,] -0.53509284
[5,] 0.41294217
[6,] -0.60801861
[7,] 1.43390296
[8,] -1.23518023
[9,] -0.75387015
[10,] -0.34548584
[11,] -0.34548584
[12,] 0.48586794
[13,] 0.48586794
[14,] 0.48586794
[15,] 0.85049680
[16,] 0.99634834
[17,] 1.21512565
[18,] -1.17683962
[19,] -1.38103178
[20,] -1.19142477
[21,] -0.72469984
[22,] 0.04831332
[23,] 0.04831332
[24,] 1.43390296
[25,] 0.41294217
[26,] -1.17683962
[27,] -0.81221077
[28,] -0.49133738
[29,] 1.71102089
[30,] 0.41294217
[31,] 2.74656682
[32,] -0.54967799
attr(,"scaled:center")
[1] 146.6875
attr(,"scaled:scale")
[1] 68.56287
The attribute for the mean-centering (i.e., the mean) can be accessed like this:
attr(mtcars$hp_z, "scaled:center")
[1] 146.6875
And the attribute for the scaling factor (i.e., the SD) can be accessed like this:
attr(mtcars$hp_z, "scaled:scale")
[1] 68.56287
In sum, the output of scale()
is a "matrix" "array"
with the centering and scaling factors, if requested, attached as attributes. Notice how they match the M and SD of the original mtcars$hp
variable above.
The problem
However, there are a few issues with using scale()
. The "matrix" "array"
output is little annoying because it prints differently:
as_tibble(select(mtcars, mpg, contains("hp")), rownames = "car")
# A tibble: 32 × 4
car mpg hp hp_z[,1]
<chr> <dbl> <dbl> <dbl>
1 Mazda RX4 21 110 -0.535
2 Mazda RX4 Wag 21 110 -0.535
3 Datsun 710 22.8 93 -0.783
4 Hornet 4 Drive 21.4 110 -0.535
5 Hornet Sportabout 18.7 175 0.413
6 Valiant 18.1 105 -0.608
7 Duster 360 14.3 245 1.43
8 Merc 240D 24.4 62 -1.24
9 Merc 230 22.8 95 -0.754
10 Merc 280 19.2 123 -0.345
# ℹ 22 more rows
and it does not play nicely with predict()
:
<- lm(mpg ~ 1 + hp_z, data = mtcars)
mod <- data.frame(hp_z = seq(-2, 2, length.out = 100))
new_data predict(mod, newdata = new_data, type = "response", se.fit = TRUE)
Error: variable 'hp_z' was fitted with type "nmatrix.1" but type "numeric" was supplied
In the past, I would normally strip these features by wrapping the call with as.numeric()
:
$hp_zn <- as.numeric(scale(mtcars$hp)) mtcars
Although this will work, you lose the awesome feature of having access to the centering/scaling attributes.
The solution
I found a nice elegant solution to this issue via this Stack Overflow post.
The drop()
function will drop the formatting of scale()
without the loss of attributes:
$hp_z <- drop(scale(mtcars$hp)) # using drop
mtcarsclass(mtcars$hp_z)
[1] "numeric"
as_tibble(select(mtcars, mpg, contains("hp")), rownames = "car")
# A tibble: 32 × 5
car mpg hp hp_z hp_zn
<chr> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 110 -0.535 -0.535
2 Mazda RX4 Wag 21 110 -0.535 -0.535
3 Datsun 710 22.8 93 -0.783 -0.783
4 Hornet 4 Drive 21.4 110 -0.535 -0.535
5 Hornet Sportabout 18.7 175 0.413 0.413
6 Valiant 18.1 105 -0.608 -0.608
7 Duster 360 14.3 245 1.43 1.43
8 Merc 240D 24.4 62 -1.24 -1.24
9 Merc 230 22.8 95 -0.754 -0.754
10 Merc 280 19.2 123 -0.345 -0.345
# ℹ 22 more rows
Using drop()
no longer causes an issue with predict()
:
<- lm(mpg ~ 1 + hp_z, data = mtcars)
mod <- data.frame(hp_z = seq(-2, 2, length.out = 100))
new_data <- predict(mod, newdata = new_data, type = "response", se.fit = TRUE)
pred <- cbind(new_data, pred) new_data
# plots
ggplot(new_data, aes(hp_z, fit)) +
geom_line () +
geom_ribbon(aes(ymin = fit-se.fit, ymax = fit+se.fit), alpha = 1/3) +
labs(x = "HP (in SD units)", y = "Predicted MPG", caption = "SE shading.") +
scale_x_continuous(breaks = seq(-2, 2, .5)) +
theme_bw()
But what if we want to represent the x-axis in its original units? No problem! Just use the attributes from the scaled column to convert back to the original unstandardized units:
<- attr(mtcars$hp_z, "scaled:center")
m <- attr(mtcars$hp_z, "scaled:scale")
sd $hp_o <- new_data$hp_z * sd + m
new_data
# plots
ggplot(new_data, aes(hp_o, fit)) +
geom_line () +
geom_ribbon(aes(ymin = fit-se.fit, ymax = fit+se.fit), alpha = 1/3) +
labs(x = "HP (in original units)", y = "Predicted MPG", caption = "SE shading.") +
theme_bw()
Thank you drop()
!
Summary
When scaling variables in R via scale()
, be sure to use the drop()
function to strip away the dimensions of scale()
’s output. This new variable/column will play nicely with predict()
and retain the center and scale attributes for later use.