Have you ever thought about centering your variables before running an regression based analysis? From my personal experiences, chances are high are that you haven’t: Although a useful data transformation procedure for many statistical analyses, centering is seldom taught in fundamental statistic courses. Although the mathematical principles behind it may seem arbitrary, its implications are oftentimes quite strong. Although centering is already useful in standard statistical modeling (e.g., OLS regression), its usefulness is particularly evident in multilevel analyses. Here, data is clustered within two or more levels and it should be carefully investigated what type of centering procedure is most appropriate. Although great papers on this topic do exit^{1}, I felt that there is not enough hands-on R code and a corresponding documentation of how to apply it to actual multilevel problems. In the following, I will hence discuss different centering approaches and how they affect the results in multilevel model while providing example code.

Let us first revise what centering actually means. It generally refers to establishing a meaningful zero point on scales that otherwise lack such a value^{2}. In communication science, for example, we often measure some sort of psychological concept (e.g., attitudes, concerns, feelings…) on rather arbitrary metrics (e.g., 1 = do not at all agree to 5 = agree fully and completely). The range of a mean index computed from such variables only includes values between 1 and 5. Why is this problematic? Let us consider a basic linear regression example.

### Centering in linear regression models

Let us quickly simulate two correlated variables which we can use to exemplify a classical centering approach. We use the function `rnorm()`

to simulate a normally distributed variable `x`

and then write a regression formula for variable `y`

```
library(tidyverse) # for data wrangling
library(multilevel) # for simulating multilevel data
library(lme4) # for estimating multilevel models
library(texreg) # for better outputs
select <- dplyr::select # to avoid clashes with the MASS package
set.seed(1000)
x <- rnorm(100, 2, 0.5)
y <- 3 + 3*x + rnorm(100, 2, .65)
d <- cbind(x,y) %>%
as.tibble
summary(d)
```

```
## x y
## Min. :0.8388 Min. : 7.363
## 1st Qu.:1.7204 1st Qu.:10.052
## Median :2.0206 Median :11.000
## Mean :2.0082 Mean :11.090
## 3rd Qu.:2.2858 3rd Qu.:12.068
## Max. :3.3350 Max. :15.375
```

As we can see, both variables have a range above 0. Let us now estimate a simple linear regression model.

```
# predicting y with the raw variable x
lm1 <- lm(y ~ x, d)
# Results
lm1$coef %>%
round(2)
```

```
## (Intercept) x
## 5.20 2.93
```

We get two coefficients. Whereas the second parameter – the slope – can be interpreted easily (1 unit increase in `x`

results in an 2.93 unit increase in `y`

), the first parameter – the intercept – remains somewhat mysterious. For now, let’s assume that `x`

represents attitude towards environmental protection on a scale from 1 to 5 (with higher values indicating a positive attitude towards environmental protection) and `y`

representing the frequency of taking public transport per month on a scale from 1 to 15. In this case, the intercept could be interpreteted as follows: Participants with a value of `x`

= 0 would take the bus 5.20 times per months. The problem is obvious: such a person does not exist in the sample. An intercept of 0 is not meaningful if the scale ranges from 1 to 5.

This is were centering comes into play. In this case, we could simply center around the population’s mean. This can be done by subtracting the population’s mean from each person’s individual value. Such a procedure does not change the variable per se and most importantly not the variable’s relation to `y`

, yet the values do differ.

```
# centering by subtracting mean
d$x.c <- d$x - mean(d$x)
d[, c("x", "x.c")]
```

```
## # A tibble: 100 x 2
## x x.c
## <dbl> <dbl>
## 1 1.78 -0.231
## 2 1.40 -0.611
## 3 2.02 0.0124
## 4 2.32 0.312
## 5 1.61 -0.401
## 6 1.81 -0.201
## 7 1.76 -0.246
## 8 2.36 0.352
## 9 1.99 -0.0174
## 10 1.31 -0.695
## # … with 90 more rows
```

```
# plot distributions
d %>%
gather(Variable, val, c("x", "x.c")) %>%
ggplot(aes(x = val, group = Variable, fill = Variable)) +
geom_density(alpha = .4) +
labs(x = "Variable X" , y = "Count") +
geom_vline(xintercept = mean(d$x),
linetype = "dashed") +
geom_vline(xintercept = mean(d$x.c),
linetype = "dashed") +
xlim(-1.5, 4) +
labs(x = "" , y = "Count") +
theme_classic()
```

`ggsave("plot.png", height = 3.4, width = 7.5)`

In a next step, let us estimate the same linear regression model with the mean-centered `x.c`

variable and compare the results with the model estimated above.

```
# predicting y with the mean-centered variable x.c
lm2 <- lm(y ~ x.c, d)
# Comparing result from lm1 and lm2
cbind(uncentered = lm1$coef,
centered = lm2$coef) %>%
round(2)
```

```
## uncentered centered
## (Intercept) 5.20 11.09
## x 2.93 2.93
```

As we can see, the relationship (slope) between both variables remains the same (which of course it should, because we don’t want to mess with that relationship). But, the intercept changed! It is still the value of `y`

when `x`

equals 0. But 0 in the second model equals the population’s mean. We can thus interpret: Participants with an average value of `x`

(e.g., an average attitude toward environmental protection) have a value of 11.09 on the `y`

variable (e.g., the frequency of using public transport). This is now a meaningful value – a value that provides useful information about the central tendency of the studied sample.

Centering around the population’s mean is the most frequently used centering approach. It becomes particularly important when we include interaction terms in classical linear regression models. When we conduct such moderation analyses, we receive conditional effects that (who would have guessed) represent the relationship between variables for those participants who have a zero on the moderator variable. It thus makes sense to center the independent variables (including the moderator) to get more meaningful coefficients.

Centering thus helps us to establish meaningful zero points which, in turn, affects our regression output. It is important to note that centering is not only meaningful for scales that measure some abstract construct on arbitrary metrics. When naturally continuous variables (e.g., age or TV use duration in minutes) are included in a regression model, we likewise get intercepts (or conditional effects in moderation analyses) that are bound to their original zero value (e.g., participants whose age is 0 years or who use the TV 0 minutes per day). Also bear in mind that we do not have to center around a population’s mean (although this approach is quite common). We could similarly center around the minimum or the maximum, one standard deviation below or above the population’s mean, and so on. The appropriate centering strategy again depends on the question of interest.

### Centering in multilevel analyses

Although mean-centering is pretty straight-forward in simple linear regression models with non-hierarchical data, it becomes a bit more complex when we deal with clustered data and want to estimate multilevel models. For the following example, let us assume we conducted an experience sampling study in which 100 participants answered 10 situational questionnaires (e.g., over the course of 5 days). Repeated measurements (level 1) are hence nested in persons (level 2). Of course, we could also think of other multilevel structures (e.g., pupils in schools, articles in newspapers, citizens in countries, diaries of persons…).

Let us further assume that we want to estimate the relationship between two level 1 variables that were assessed in each situational questionnaire (the relationhip will be simulated as b = .50). As the assumption of independent observations is violated because situational measures within each person show higher correlations than situational measures between persons (expressed in the intraclass correlation coefficient = .30), we need to estimate a multilevel model. To simulate the data, we can use the package “multilevel” and the function “sim.icc”.

```
# Simulation of multilevel data with two level 1 variables
set.seed(15324)
d <- sim.icc(gsize = 10,
ngrp = 100,
icc1 = .30,
nitems = 2,
item.cor = .50)
# Renaming variables
d <- d %>%
rename(id = GRP,
dv = VAR1,
iv = VAR2) %>%
as.tibble
d
```

```
## # A tibble: 1,000 x 3
## id dv iv
## <dbl> <dbl> <dbl>
## 1 1 -2.62 -1.99
## 2 1 -4.47 -2.66
## 3 1 -2.63 -1.85
## 4 1 -2.09 -1.98
## 5 1 -1.48 -1.50
## 6 1 -1.38 -1.97
## 7 1 -2.33 -0.633
## 8 1 -2.53 -2.02
## 9 1 -0.693 -0.691
## 10 1 -1.49 -0.459
## # … with 990 more rows
```

The resulting data sets is organized in a long format: Each row represents one situational assessment. The `id`

variable shows us which situational assessments belong to which person. We can quickly check how well the simulation worked and estimate the null model and use the variance parameters to estimate the ICC.^{3}

```
# estimating the null model using lme4
m0 <- lmer(dv ~ 1 + (1|id), d)
# Function to compute the intraclass correlation coefficient (ICC)
compute_icc <- function(lmer_object){
var_dat <- lmer_object %>% VarCorr %>% as.data.frame
icc <- var_dat$vcov[1]/(var_dat$vcov[1]+var_dat$vcov[2])
return(icc)
}
compute_icc(m0) %>%
round(2)
```

`## [1] 0.32`

The ICC of the dependent variable is .32 and thus very close to the value that we wanted to simulate. It means that about 32% of the variance in this variable is explainable by interindividual differences (i.e., person-related characteristics). This also means that a larger part of the variance is potentially explainable by situational characteristics (e.g., by other variables measured on level 1).

If we would have variables on level 2, we could center them around the population mean (just as we have done earlier in the linear regression example). Level 1 variables can be centered in two ways:^{4}

- Centering around the grand mean (grand mean centering = GMC)
- Centering around the person’s mean (also known as centering within clusters = CWC)

In the first case, we ignore that several situational assessments are clustered within persons. We hence just take the overall mean of all situations regardless of any interindividual differences. In the second case, we center around each person’s mean. Bear in mind that we have to create each person’s mean of the situational variable in order to be able to subtract it from the raw variable to create the person-mean centered variable. It is useful to keep the person’s mean as an additional variable, because we will need it later. Creating a person’s mean is also know as aggregation. We take a level 1 variable and aggregate it to the next higher level (in this case, the person’s level or level 2). As this new variable is essentially a level 2 variable, we can again center it around the grand mean.

```
d <- d %>%
# Grand mean centering (CMC)
mutate(iv.gmc = iv-mean(iv)) %>%
# Person mean centering (more generally, centering within cluster)
group_by(id) %>%
mutate(iv.cm = mean(iv),
iv.cwc = iv-iv.cm) %>%
ungroup %>%
# Grand mean centering of the aggregated variable
mutate(iv.cmc = iv.cm-mean(iv.cm))
# Comparing the results of the centering approaches
d %>%
select(iv, iv.gmc, iv.cm,
iv.cwc, iv.cmc) %>%
summary
```

```
## iv iv.gmc iv.cm iv.cwc
## Min. :-4.11272 Min. :-4.09607 Min. :-1.57442 Min. :-3.08968
## 1st Qu.:-0.80006 1st Qu.:-0.78342 1st Qu.:-0.51000 1st Qu.:-0.61803
## Median :-0.03193 Median :-0.01529 Median : 0.04509 Median :-0.02651
## Mean :-0.01665 Mean : 0.00000 Mean :-0.01665 Mean : 0.00000
## 3rd Qu.: 0.79000 3rd Qu.: 0.80665 3rd Qu.: 0.47743 3rd Qu.: 0.64295
## Max. : 3.78723 Max. : 3.80388 Max. : 1.48612 Max. : 2.68093
## iv.cmc
## Min. :-1.55777
## 1st Qu.:-0.49336
## Median : 0.06173
## Mean : 0.00000
## 3rd Qu.: 0.49407
## Max. : 1.50276
```

First we can see, that both GMC (`iv.gmc`

) and CWC (`iv.cwc`

) can be used to establish a meaningful zero point. However, bear in mind that these zero’s differ: In the GMC variable that value zero represents the overall mean across all situations. In the CWC variable, zero represents the average value across one person’s situations. It is important to understand that the variable representing the persons’ mean (`iv.cm`

) and the variable that was person mean centered (`iv.cwc`

) represents two parts of the variance that was originally contained in the raw variable (`iv`

). The first is the trait or stable component that stays the same across all situations. It differs between persons but not between situations. The second represents situational deviations from this person mean across the 10 situations. We thus have separated the variable iv into level 2 and level 1 variance. The grand-mean centered person mean (`iv.cmc`

) is qualitatively similar to `iv.cm`

but is again centered around the population`s mean.

Let us now use these newly created variables and estimate several multilevel models.

```
m1.nc <- lmer(dv ~ 1 + iv + (1|id), d) # no centering
m1.gmc <- lmer(dv ~ 1 + iv.gmc + (1|id), d) # grand mean centering
m1.cwc <- lmer(dv ~ 1 + iv.cwc + (1|id), d) # centering within cluster
m1.cmc <- lmer(dv ~ 1 + iv.cmc + iv.cwc + (1|id), d) # including the person's mean (cluster's mean)
# Create a nice output (use screenreg() to get a readable output in the console)
screenreg(list(m1.nc, m1.gmc, m1.cwc, m1.cmc),
single.row = T,
stars = numeric(0),
caption = "",
custom.note = "Model 1 = Uncentered predictor,
Model 2 = grand-mean centered predictor,
Model 3 = person-mean centered predictor,
Model 4 = person-mean centered predictor and centered person mean")
```

```
##
## ==========================================================================
## Model 1 Model 2 Model 3 Model 4
## --------------------------------------------------------------------------
## (Intercept) 0.07 (0.04) 0.06 (0.04) 0.06 (0.07) 0.06 (0.04)
## iv 0.57 (0.03)
## iv.gmc 0.57 (0.03)
## iv.cwc 0.50 (0.03) 0.50 (0.03)
## iv.cmc 0.90 (0.05)
## --------------------------------------------------------------------------
## AIC 2616.85 2616.85 2719.30 2578.19
## BIC 2636.48 2636.48 2738.93 2602.73
## Log Likelihood -1304.43 -1304.43 -1355.65 -1284.09
## Num. obs. 1000 1000 1000 1000
## Num. groups: id 100 100 100 100
## Var: id (Intercept) 0.11 0.11 0.47 0.05
## Var: Residual 0.72 0.72 0.71 0.71
## ==========================================================================
## Note: Model 1 = Uncentered predictor,
## Model 2 = grand-mean centered predictor,
## Model 3 = person-mean centered predictor,
## Model 4 = person-mean centered predictor and centered person mean
```

Several things in the result table are noteworthy: First, Model 1 and 2 only differ with regard to the intercept coefficients. Due to the GMC, the intercept in Model 2 can be interpreted more meaningfully. It is the value of the dependent variable when the independent variable equals the mean across all situations (neglecting its clustered nature). The slope is unfortunately pretty hard to interpret in both models. It represents a combination of both trait (level 2) and state (level 1) influence of the independent variable on both variance components of the dependent variable. Furthermore, the model fit indices (e.g., AIC, BIC, etc.) show that both models are identical.

Second, we can observe that using only the CWC variable (Model 3) changes the coefficients and also the model fit. The intercept now represents the average value of `dv`

when `iv`

equals the person’s average value on `iv`

. The relationship coefficients represents the situational effect of `iv`

on the level 1 variance of `dv`

. Hence a positive 1 unit deviation from the person’s mean results in a 0.50 increase in the dependent variable `dv`

. It is important to bear in mind that we quantify only the situational effects in this model. It is hence not surprising that the model fits a bit worse that the other two models.

Third, we can also include both components of the `iv`

in one model (Model 4). In this case, we get two coefficients which represents both trait and state influences of `iv`

on `dv`

. Interestingly, this model (in which both variance components are neatly separated) fits better than model 1 and 2. Bear in mind that once again, the centering also helped to make the intercept interpretable. In this last model, the intercept represents the value of `dv`

when both the trait and the deviation from this trait per situation is 0.

Finally, we can look at the amount of explained variance for each model.

```
# Variance reduction approach according to Hox, 2010, pp. 69-78
var_reduction = function(m0, m1){
library(tidyverse)
VarCorr(m0) %>%
as.data.frame %>%
select(grp, var_m0 = vcov) %>%
left_join(VarCorr(m1) %>%
as.data.frame %>%
select(grp, var_m1 = vcov)) %>%
mutate(var_red = 1 - var_m1 / var_m0)
}
# Create comparison table
cbind(M1 = var_reduction(m0, m1.nc)[,4],
M2 = var_reduction(m0, m1.gmc)[,4],
M3 = var_reduction(m0, m1.cwc)[,4],
M4 = var_reduction(m0, m1.cmc)[,4]) %>%
round(2)
```

```
## M1 M2 M3 M4
## [1,] 0.76 0.76 -0.05 0.88
## [2,] 0.24 0.24 0.25 0.25
```

The table shows the explained variance on level 2 (first row) and on level 1 (second row). As we can see, the first model – which included the raw variable – explains 76% on the person level and 24% on the situational level. The second model in which we used the GMC variable yields similar results. This is not surprising because we did not change the variable fundamentally, but only centered it around a new value (the mean).

The third model now yields a different results. We only included the CWC variable which represents situational deviations from the person’s mean. It should hence not be too surprising that it explains only variance on level 1. The negative variance on level 2 should not be too alarming as this is a known problem in multilevel analyses ^{5}.

Finally, the last model – which includes both the grand-mean centered person’s mean (the aggregated variable `iv.cmc`

) and the situational deviations (`iv.cwc`

) – of course explains variance on both levels. Interestingly, however, it is able to explain more variance than model 1 and 2. This is due to the better model fit already identified by looking at the AIC and BIC which results from the clean separation of both variance components and thus the reducing of covariance. This further shows the usefulness of this last approach.

### Conclusion

If we are interested in relationships between level 1 variables, it makes sense to separate the dependent variable into level 1 and level 2 variance. We can do so by first aggregating the variable to the second level (i.e., estimating the cluster’s mean) and second, by using this aggregated variable to produce a cluster mean centered variable (i.e., subtracting the cluster mean from the original variable) to produce the level 1 deviations from the cluster’s mean. This way, we can investigate both the influence of level 2 and level 1 variance (e.g., trait and state variance in experience sampling studies, school and pupil variance in school surveys, person and wave variance in panel studies, and so on) on the dependent variable and thereby acquire more precise estimates of the multilevel relationship.

*Note: Post was slightly edited on the 03-02-2021 to account for smaller mistakes and some comments made by readers (see comments). *

### References

- e.g., Enders & Tofighi, 2007
- e.g., Enders, C. K. & Tofighi, D. (2007). Centering Predictor Variables in Cross-Sectional Multilevel Models: A New Look at an Old Issue.
*Psychological Methods, 12*(2). 121-138 http://psycnet.apa.org/record/2007-07830-001; Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Newbury Park, CA: Sage. - see e.g., Hox, J. J. (2010). Multilevel analysis: Techniques and applications (2nd ed.). Quantitative methodology series. New York: Routledge, p. 15. You can access a preview of the book here: http://joophox.net/mlbook1/preview.pdf
- Depending on the value that we center around, we could of course distinguish many more ways. Yet, in general, we either center on the first or the second level.
- Hox, 2010, p. 72-73.

Hello,

I am running a binomial logistic regression model and it turns out that one of my IVs has an extremely small Beta coefficient in the models such as 0.0000014 and this is difficult to report in the tables. This IV has extreme min and max values and therefore, a very high range. I thought of centering this variable while keeping other IVs at their original values, but I am not sure whether this is the right thing to do. Would you mind expressing your idea on this issue and maybe giving an advice?

Thank you very much in advance.

Hey Mleda,

In your particular case, I would not necessarily recommend centering. Instead, you should rescale your variable so that it is more interpretable. Let’s say you have an IV that has values from 1 to 1000 and your DV ranges from 1 to 5. Assuming a small effect size, a regression would probably reveal a very small unstandarized coefficient. Why is that so? It is because 1 change in the IV (e.g., 105 to 106) is not a big differences. So why not simply transform the IV so that the change that we get a coefficient for is more meaningfull. In this case, we could e.g., divide the IV by 100 and would thus get a variable ranging from 1 to 10. The resulting coefficient would then refer to a change of 100 in the IV. As long as you bear that in mind in your interpretation, everythings fine.

In your particular case, I would rescale the IV and then compute Odds ratio (Exp(b)). Unstandardized coefficients in logistic regressions are hard to interpret anyway.

Hope this helps! 😉

Could it be possible that you interchanged some variables at and after the summary? CMC is summarized and not mentioned, while CWC is not summarized but mentioned. Furthermore, you use the CM in model 4, but calling it the *centered* aggregated mean, which sounds like it should be CMC? Great subject anyway, I am just a bit confused.

EDIT: you named the model variables correctly. Then what’s left over of my confusion is the summary part. Where did we lose CMC on the line? I think this is a more valuable variable if one implements interactions, compared to the absolute CM.

Dear Hub, please excuse the late reply. I am sorry for your confusion. I just realized that my model naming convention (m1.nc, m1.gmc…) is indeed annoyingly confusing and unnecessary. And you’re right, I used CM (cluster mean or person mean) instead of CMC (grand-mean centered cluster mean or person mean). Although the results should not change (at least not the resulting slope), it would change the intercept. And I do agree that it many cases, using a centered person-mean makes a lot of sense.

Will update as soon as I have some time. Thanks for your comment! 😉

I spot a small error in this line “d %>% select(iv, iv.gmc, iv.cm, iv.cmc, iv.cmc) %>%”, the first “iv.cmc” should be “iv.cwc”. I guess this could explain Hub’s question a bit.

Small update: I corrected the post in this regard.

Amazing tutorial on the effects of centering, Philipp! Very clear and easy to follow. In fact, it’s probably the only tutorial that includes with clear and well-documented R code I can find online. Do you know of any other tutorials or resources (with R code) that explain the effects of centering? Your tutorial is incredible, but it’s always good to read explanations from different perspectives!

I would recommend the paper by Enders & Tofighi, 2007 (see references).

Very clear and useful explanation!! Thank You!

I have one question: When I create level1 interaction terms is it wrong to create the interaction term from the raw score and then center it later around the group mean? I have multilevel data from employees nested in departments, and the two versions (centering the L1 variables before or after creating the interaction term) changes the results quite a lot! I am curious what you think about this.

It really depends what you are interested in. If it is just the level-1 variance, I would always used centering within cluster (cwc) first, and then build interaction terms.

Dear Phillip,

Thank you for such a great post, I think that’s a great explanation.

I just have one comment regarding mean centering: is not the same while using a wide format dataset compared with a long format dataset.

Thus, I share with you my code to perform gran mean centering in a wide format dataset (gran mean centering is computed for each wave):

library(dplyr)

# Grand mean centering function

gmc <- function (column) {

column – mean(column, na.rm=TRUE)

}

# Mean centering main predictor: "predict1"

data %

mutate(w1_predict1_gmc = gmc(w1_predict1),

w2_predict1_gmc = gmc(w2_predict1),

w3_predict1_gmc = gmc(w3_predict1),

)

# Show new mean values together with old ones

data %>% select(w1_predict1, w1_predict1_gmc, w2_predict1, w2_predict1_gmc, w3_predict1, w3_predict1_gmc) %>%

summary

Dear Armando,

yes, it is of course not the same. Thanks for the extra code .

Really nice and interesting post. Keep posting. Thanks for sharing.

360DigiTMG

Just a quick comment. why `iv.cmc` (Grand mean centering of the aggregated variable) was created but never used? What was its purpose?

Thanks for your comment. This was already mentioned earlier, and I finally got around to adjust the post in this regard.

Short answer: No it was not on purpose and it does make a lot of sense to use the grand-mean person mean (iv.cmc) instead of the simple person mean (iv.cm). This is also referred to as double-mean centering.

It is good article, thank you!

Vielen Dank für das Teilen dieses Artikels!

Your content is very unique and understandable useful for the readers keep update more article like this.

data science certification in noida

I am genuinely thankful to the holder of this web page who has shared this wonderful paragraph at this place

artificial intelligence institute in Guwahati