I recently gave a workshop on data visualization. One of the topics was visualizing interaction effects (or “moderation” analyses). I think it is a great topic because it exemplifies quite well that there is not one solution to all problems. In general, I would argue that trying to visualize interaction effects is great idea. Due to the conditional nature of the effects obtained from standard regression analyses that include an interaction term, it is often hard to understand the direction and size of an interaction effect only based on regression results tables. In this post, I simply want to share some code that can be used to visualize a variety of different interaction effects.
Visualizing a 2×2 experimental design
# Loading packages
library(tidyverse) # includes 'ggplot2' which is probably the best package for data viz
library(ggExtra) # extends ggplot2
# Simulating the data
set.seed(100)
m1 <- sample(rep(c(0, 1), 250), 500) # Factor A
m2 <- sample(rep(c(0, 1), 250), 500) # Factor B
y <- 8 + -2*m1 + 2*m2 + 3*(m1*m2) + rnorm(500, 0, 2)
(data <- data.frame(m1 = factor(m1,
label = c("A1", "A2")),
m2 = factor(m2,
label = c("B1", "B2")),
y) %>%
as.tibble)
## # A tibble: 500 x 3 ## m1 m2 y ## <fct> <fct> <dbl> ## 1 A2 B1 3.11 ## 2 A1 B1 8.63 ## 3 A2 B1 5.31 ## 4 A1 B2 6.14 ## 5 A1 B2 10.5 ## 6 A2 B1 5.27 ## 7 A2 B2 15.9 ## 8 A1 B2 11.2 ## 9 A1 B1 6.85 ## 10 A2 B2 11.8 ## # ... with 490 more rows
As we can see, we have a dataset that resembles data obtained from a 2×2 experimental design: m1
and m2
represent the experimental manipulations A and B resulting in four conditions (A1, A2, B1, B2). In such a case, were we are interesting in the effects of both conditions on the dependent variable y
and in the effect of a potential interaction, a simple bar chart with a specific type of facetting is probably the best soluation.
ggplot(data,
aes(x = m1,
fill = m1,
y = y)) +
stat_summary(fun.y = mean,
geom = "bar") +
stat_summary(fun.data = mean_cl_normal,
geom="errorbar",
width = 0.25) +
ylim(0, 17) +
facet_wrap(~m2) +
labs(x = "Experimental Manipulations",
y = "Dependent Variable (M)") +
theme_minimal() +
scale_fill_brewer(palette = "Blues") +
theme(legend.position="none")
An alternative is a modified line plot.
# Summarizing data
data2 <- data %>%
group_by(m1, m2) %>%
summarise(y_mean = mean(y),
y_se = psych::describe(y)$se)
# Creating the plot
data2 %>%
ggplot(aes(x = m1,
y = y_mean,
color = m2)) +
geom_line(aes(group = m2)) +
geom_point() +
geom_errorbar(aes(ymin = y_mean-1.96*y_se,
ymax = y_mean+1.96*y_se),
width = .1) +
ylim(0, 17) +
labs(x = "Factor A",
color = "Factor B",
y = "Dependent Variable") +
theme_minimal() +
scale_color_brewer(palette = "Blues")
Visualizing the moderation effect of a dichotomous variable
# Simulating the data
x <- rnorm(500, 2, 3)
m <- rep(c(0,1), 250)
y <- 5 + -4*x + 2*m + 4*(x*m) + rnorm(500, 0, 5)
(data <- data.frame(x, m = factor(m), y) %>%
as.tibble)
## # A tibble: 500 x 3 ## x m y ## <dbl> <fct> <dbl> ## 1 5.29 0 -21.9 ## 2 5.54 1 9.32 ## 3 3.76 0 -6.41 ## 4 5.23 1 9.05 ## 5 5.41 0 -23.9 ## 6 4.28 1 6.91 ## 7 2.44 0 3.70 ## 8 5.20 1 7.05 ## 9 0.728 0 4.57 ## 10 0.875 1 7.83 ## # ... with 490 more rows
The moderator m
here is a factor variable. The independent variable x
and the dependent variable y
, however, are continuous.
When the relationship of interested is between to continuous variables and only the moderator is dichotomous (e.g., gender, an experimental manipulation, intervention vs. no intervention,…), we can try to show as much information as possible and visualize the relationships fully using a colored scatterplot with separate regression lines for each group.
# Setting up the building blocks
basic_plot <- ggplot(data,
aes(x = x,
y = y,
color = m)) +
theme_bw() +
labs(x = "Independent variable",
y = "Dependent variable",
color = "Moderator")
# Colored scatterplot
basic_plot +
geom_point()
# Colored scatterplot and regression lines
basic_plot +
geom_point(alpha = .3,
size = .9) +
geom_smooth(method = "lm")
ggsave("test2.png",
height = 4,
width = 5)
# Only regression lines
basic_plot +
geom_smooth(method = "lm",
se = F)
Note: Many papers only show the last plot. However, I would always prefer the second plot which shows the actual scatterplots for each groups, the regression lines and confidence bounds.
Visualizing the moderation effect of metric variable
# Simulating the data
n <- 500
x <- runif(n)
m <- rnorm(n, 0.5, 1)
y <- 2*x + -2*m + -6*(x*m) + rnorm(500, 0, 5.88)
(data <- data.frame(x, m, y) %>%
as.tibble)
## # A tibble: 500 x 3 ## x m y ## <dbl> <dbl> <dbl> ## 1 0.911 2.29 -19.5 ## 2 0.352 -0.0184 3.78 ## 3 0.409 0.754 3.61 ## 4 0.867 -0.598 4.12 ## 5 0.0416 0.629 -10.6 ## 6 0.121 2.58 -10.0 ## 7 0.265 1.43 -8.19 ## 8 0.766 0.492 -0.599 ## 9 0.405 0.410 3.19 ## 10 0.290 -0.502 7.92 ## # ... with 490 more rows
If the moderator (in this case m
) is also continuous, visualizing interaction effects becomes more complex. There are generally three ways of presenting results from such a moderation analysis:
- Estimating the regression for different values of the moderator (e.g., -SD, M, +SD) and reporting all three results in a table
- Visualizing thee regression lines for three (arbitrary) groups, also know as the “pick-a-point” approach
- Plotting the conditional effects for all values of the moderator (Conditional plot)
Pick-a-point approach
For this approach, we have to manually recode the moderator variable into three meaningful subgroups. Cutting points could be -SD, M, +SD or certain quantiles. In this case, we simply use the cut
function to create subgroups1
data$m_groups <- cut(data$m, breaks = 3) %>%
factor(., labels = c("small", "average", "large"))
table(data$m_groups)
## ## small average large ## 95 345 60
Next, we simply repeat the procedure that we already used for visualizing interaction effects with a dichotomuous moderator. We use the newly computed variable to color the points in the scatterplot and create separate regression lines.
ggplot(data,
aes(x = x,
y = y,
color = m_groups)) +
geom_point(size = .9,
alpha = .3) +
geom_smooth(method = "lm") +
theme_bw() +
scale_color_brewer(type = "qual",
palette = 3) +
labs(x = "Independent variable",
y = "Dependent variable",
color = "Moderator")
Conditional effect plot
A problem with the pick-a-point approach is, however, that we only see the conditinal effects for three (somewhat arbitrary) groups. A conditional plot which visualizes the conditional effects for all possible values of the moderator is much more valuable.
For this approach, we first need to specify the interaction model itself.
model <- lm(y ~ x + m + x:m, data)
summary(model)
## ## Call: ## lm(formula = y ~ x + m + x:m, data = data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -18.8403 -3.8552 -0.2198 3.9432 17.5529 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.2601 0.5798 -0.449 0.6539 ## x 2.3908 0.9955 2.402 0.0167 * ## m -2.1500 0.4717 -4.558 6.51e-06 *** ## x:m -6.0053 0.8369 -7.176 2.63e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.869 on 496 degrees of freedom ## Multiple R-squared: 0.4768, Adjusted R-squared: 0.4736 ## F-statistic: 150.6 on 3 and 496 DF, p-value: < 2.2e-16
Now, we have to create a function that uses the output of this model and creates a data frame that includes the conditional effects for several values of the moderator.
# Function that estimates the conditinal effects for certain values of the moderator
conditional.effects <- function(model, x, m, quantiles = 10){
interact = paste0(x,':',m)
beta.hat = coef(model)
covs = vcov(model)
z0 = quantile(model$model[,m], seq(0 , 1, 1/quantiles))
dy.dx = beta.hat[x] + beta.hat[interact]*z0
se.dy.dx = sqrt(covs[x, x] + z0^2*covs[interact, interact] + 2*z0*covs[x, interact])
upr = dy.dx+1.96*se.dy.dx
lwr = dy.dx-1.96*se.dy.dx
data.frame(m = z0, b = dy.dx, lwr, upr)
}
(con_effects <- conditional.effects(model,
x = "x",
m = "m",
quantiles = 500) %>%
as.tibble)
## # A tibble: 501 x 4 ## m b lwr upr ## * <dbl> <dbl> <dbl> <dbl> ## 1 -2.54 17.6 12.3 22.9 ## 2 -2.24 15.8 11.0 20.7 ## 3 -2.20 15.6 10.8 20.4 ## 4 -2.13 15.2 10.5 19.8 ## 5 -2.10 15.0 10.4 19.6 ## 6 -2.05 14.7 10.2 19.3 ## 7 -1.85 13.5 9.23 17.7 ## 8 -1.79 13.1 8.98 17.3 ## 9 -1.70 12.6 8.56 16.6 ## 10 -1.68 12.5 8.49 16.5 ## # ... with 491 more rows
The dataframe includes the values of the moderator (m
), the respective conditional effects (b
) and their 95% confidence intervals. We can now use this data frame to create a new plot.
(mod_plot <- ggplot(con_effects,
aes(x = m,
y = b,
ymin = lwr,
ymax = upr)) +
geom_smooth(stat = "identity",
color = "black") +
geom_hline(yintercept = 0, linetype = "dashed", color = "darkgrey") +
theme_bw() +
labs(x = "Moderator",
y = "Conditional effect of x on y"))
This plot now shows the conditional effects for each value of the moderator. We can see that the effect of x on y is positive for lower values of m
(the moderator) and turns negative for higher values of m
. We can also asses at which values of m
, the conditional effect of x
on y
becomes non-significant (Johnson-Neyman technique: see also https://rpubs.com/bachl/jn-plot).
I would suggest to also plot a histogram of the moderator so that one can assess how many values provide information for the information of the conditional effects.
ggMarginal(mod_plot, # function of the package `ggExtra`
margins = c("x"),
type = "histogram",
color = "white",
fill = "darkgrey",
size = 3)
Hope this is helpful to some people! 😉
Thank you so much!
This is awesome! Is there also the code for visualizing poisson interaction effects?
Whether or not you can visualize poisson effects might depend on the type of variables you use (ie, categorical or continuous) for both x and m. For two continuous variables the ‘Conditional effect plot’ model should work.
I successfully created the ‘Conditional effect plot’ visual with a glm model specified as a ‘poisson’ model, ie.:
model <-glm(y ~ x + m + x:m, family = "poisson", data = dataframe) [1]
Whether or not the model is true to the actual effects of the moderator, I am unsure; I would think the 'conditional.effects' function creates conditional effects which can be used without inferential errors for a 'poisson' specified glm model with an interaction effect similar to a non-specified linear model (lm) with an interaction effect because I believe the variables requested by the 'conditional.effects' function come from the same objects within the model, ie, coef(model). If both the glm and lm models store the requested variables in the same locations, the conditional effects (represented in the 'con_effects' data frame) should provide accurate estimations; that is, unless of course, users must 'correct' the variables requested in the 'conditional.effects' for 'poisson' specified generalized linear models. I'm curious to see if the author will respond to this point.
Assuming users do not need to correct for any mathematical or to be clear, inferential, based errors, the 'family' argument set to 'poisson' within the glm function should run your model under the 'poisson' parametric. You can use the code Philipp provides under the '# Function that estimates the conditi(o)nal effects for certain values of the moderator' with the poisson specified model to get the conditional effect model.
For more information on the different types of models you can specify within a generalized linear model (glm) I recommend:
The page provides examples of the functions with different specifications (similar to my examples ([1]), above). Just make sure to ignore the link part of the argument within the argument (As I did, and as they do in their examples! :)).
I really love the pick a point approach! However, is it possible to use the estimates of a previous estimated model? How would the code look like, if we would call the previous estimates model “model1”?
A quick word of advice: when running the conditional effects function (created under the “# Function that estimates the conditi(o)nal effects for certain values of the moderator” notation) you’ll want to be sure both your x and m variable are numerics within your original dataset. If both your x and m variables are not numerics (ie, you may run one of them ‘as a numeric’ with the ‘as.numeric’ function within your model) the function will not work. To ensure your x and m variables are numeric you can simply run:
dataframe$x <- as.numeric(dataframe$x)
dataframe$m <- as.numeric(dataframe$m)
This will ensure both your x and m values are coded as numbers within your dataset (it should also ensure you don't have to run the 'as.numeric' function within your model (ie, model <- lm(y ~ x + m + x:m)) which is helpful! 🙂