A computational description of simple mediation analysis

Simple mediation analysis is an increasingly popular statistical analysis in psychology and in other social sciences. However, there is very few detailed account of the computationswithin the model. Articles are more often focusing on explaining mediation analysis conceptually rather than mathematically. Thus, the purpose of the current paper is to introduce the computational modelling within simple mediation analysis accompanied with examples with R. Firstly, mediation analysis will be described. Then, the method to simulate data in R (with standardized coefficients) will be presented. Finally, the bootstrap method, the Sobel test and the Baron and Kenny test all used to evaluate mediation (i.e., indirect effect) will be developed. The R code to implement the computation presented is offered as well as a script to carry a power analysis and a complete example.


Introduction
Mediation analysis is an increasingly popular statistical analysis in psychology and in other social sciences.It seeks to explain the (biological, psychological, cognitive, etc.) mechanism that underlies the relationship between an independent variable and a dependent variable by the inclusion of a third variable, i.e., the mediator variable.As mediation analysis becomes more and more popular, there is also an increasing body of scientific literature on the subject.However, very few detail the computation within the model.They are more often focusing on explaining conceptually rather than mathematically (see, for instance, Kane & Ashbaugh, 2017).Herein, this paper will adopt the latter approach to help the reader understand and apply the modelling within mediation analysis.
The purpose of the current article is to introduce the computational modelling within simple mediation analysis.It is worth noting that only simple mediation (a single mediator variable) will be presented, but that other forms of mediation (parallel, serial or moderated), may be understood by extending the presented formulas.The first part consists of the description of mediation analysis.Then, a method to simulate data (with standardized coefficients) will be presented.Finally, the bootstrap method used to evaluate mediation (i.e., indirect effect), the Baron and Kenny test and the Sobel test will be developed.The R code to implement the computation will be presented.For the sake of simplicity and without lack of generality, the presentation will mainly focus on standardized regression coefficients.The computation to unstandardize data will be presented.As a cautionary reminder, strong statistical analyses do not supersede strong theoretical framework and experimental design which are imperative when investigating potential mediating variable.Mediation analysis is useful, but must be used properly.

Simple mediation analysis
Mediation analysis is a subset of path analysis in which the researcher is interested in the relation between the independent variable (x) on the dependent variable (y) through the mediator variable (m).The path diagram corresponding to a simple mediation model is presented in the top panel of Figure 1.When there is no m, the existing relation The Quantitative Methods for Psychology between x and y is said to be the total effect, represented by c xy .It corresponds to the regression coefficient between x and y.The total effect can be divided into two other effects : the direct effect (c ) and the indirect effect (ab).Deterministically, the indirect effect is the interpretation that x causes variability to m, which then causes variability to y. Mathematically speaking, the indirect effect is the product of the paths between x and m, and m and y (or paths a and b in top panel of Figure 1).The indirect effect is the effect of interest in mediation analysis.The other effect is the direct effect which is the relation remaining between x and y when the effect of m has been partialled out.As such, the more correct mathematical representation of c is c xy|m .
Mediation analysis can be seen as a regression analysis carried in two steps.The first step is to regress m on x to obtain the parameter a.Then, the second step is to regress y on x and m to obtain c and b respectively.Finally, the product ab is tested to see if it is statistically different from 0 which would support a mediating effect.As it should become apparent, top panel of Figure 1, even though it is widespread, is conceptually ambiguous and can be misleading.For instance, a and c are simple coefficients whereas b and c are partial coefficients.We will thus more clearly defined each parameter in the mediation model.Bottom panel of Figure 1 depicts the mediation models with the more appropriately labelled parameters.The path a is more appropriately the path a xm which represents the correlation between x and m.As already pointed out for the relations between x and y, there is a total effect, c xy , and the direct effect c xy|m .The parameter b usually presented in mediation analysis is b my|x , that is, the relation between m and y when controlling for the effect of x.There is also a parameter for the relation between m and y, b my , which exists but is neglected, because it plays no role in the interpretation of mediation analysis.Both are dependent from one another with the partial correlation equation :

Generating data
Modelling of the data is presented using, as it was previously mentioned, standardized coefficients.Parameters could be any value between -1 and 1.In mediation analysis, there are two predictors (x and m) and two dependent variables (m and y).In order to generate data, we must first generate data for X (capital letters represent data).
Let X be a normally distributed variable with a mean of 0 and variance of 1, X ∼ N (0, 1), then generate M with is computed by where e m is the error in M (i.e., var(e m ) is the variance of the residual).The structural equation modelling of the mediation analysis (showing the error parameters) is presented in the bottom panel of Figure 1.Residual error has a mean of 0 and, to keep variance to 1, the error variance, e m , is set to : the variance is additive, to get a variance equals to 1, the variance of other sources have to be subtracted.Finally, Y is generated in the following manner which corresponds to the second regression analysis of mediation analysis.The variance of the error term of Y , e y , is computed by so that Y follows a normal distribution, Y ∼ N (0, 1).The first two terms refer to the coefficients in equation 5 and the last one refers to the covariance between x and m (that is, the sum of two correlated variables is the sum of their variance plus twice their covariance; Howell, 2012).Equation 6 comes from the fact that the sum of two normally distributed correlated random variables is Listing 1 shows the code to implement the generation of data in R with a xm = .50,b my|x = .60,c xy = .000.
From equation 1, we can compute b my which is .45.The mediation model is presented in Figure 2. The resulting variance-covariance matrix is showed in Table 1.The covariance matrix is approximately the same as a correlation matrix in this case.Since data contain some error, it is only approximately the same.Given that the sample size was 10 6 , results are strongly accurate.As such, the above values were true to the population parameters.It is worth noting that the variance of each variable is very close to 1.000 as expected from equations 3 to 6.
The Quantitative Methods for Psychology Illustration of mediation analysis.Top panel depicts the usual diagram describing mediation.Bottom panel shows the parameters with a more appropriate notation which is used throughout the current paper.It depicts a mediation analysis from a structural equation modelling perspective as it includes error parameters.To unstandardized data, if needed, the data contained in a standardized variable (X, M , or Y ) after being computed must be multiplied by the desired standard deviation (square root of the variance, σ 2 ) and the mean, µ, has to be added, such as, for the variable x : in which x unstd represents unstandardized data and x std refers to standardized data.One could also use the code in Listing 1 to generate unstandardized data by specifying means and standard deviations.

Hypothesis testing
There are three ways to determine if ab is statistically significant.The first is the Baron and Kenny (1986) method, which is a three-step regression analysis.The first step is to check if the relation between x and y, that is c xy , is significant, meaning there is a relation to be potentially explained by a mediator.The second step is to check if a xm is significant, or testing if there is a relation between the mediator and the predictor.Finally, the last step is to regress y on x and m to obtain b my|x and c xy|m .If b my|x is significant then the method suggests that a mediation process occurred.If c xy|m no longer is significant (compared to c xy ), the mediation is said to be complete, otherwise it is deemed partially mediated.We offer a R script to carry out the Baron and Kenny method in Listing 2. The Baron and Kenny method has been left out of favor because of its inappropriate assumptions, mostly on whether the hierar- The Quantitative Methods for Psychology x <-rnorm(n, mean = 0, sd = 1) em <-sqrt(1-a^2) m <-a * x + em * rnorm(n, mean = 0, sd = 1) ey2 <-sqrt(ey) y <-cp * x + b * m + ey2 * rnorm(n, mean = 0, sd = 1) x <-x * sd.x + mean.xm <-m * sd.m + mean.m y <-y * sd.y + mean.ydata <-as.data.frame(cbind(x,m, y)) return(data) } chical steps have to be followed, and the rise of newer and more powerful statistical techniques (Hayes, 2013).
The second test to assess mediation is the Sobel test, which is a z-distributed statistic computed from the indirect effect as where SE is the standard error of the indirect effect computed with the following equation : and where s 2 i represents the variance of the path i, i = a xm , b mx|y .Listing 3 shows the R code to implement the Sobel test.This test has the assumption that the product of two correlation coefficients is normally distributed, which is not always true in practice.Consequently, it is less powerful than the last method, which is the bootstrap method, emphasized by Hayes (2013).The bootstraps test resamples data in order to build a 95% confidence interval (or any percentage actually) of the indirect effect and test if it entails the null hypothesis (i.e., the indirect effect is 0).As it is a bootstrap method, it is free from the statistical distribution assumption (more robustness) compared to the Sobel test, because even if data is normally distributed, this is not necessarily true for the indirect effect, and is more powerful (less type II error) than the Baron and Kenny test and the Sobel test (Preacher, Rucker, & Hayes, 2007).
The bootstrap method (Efron & Tibshirani, 1979) is a computer-based method which treats the sample as a pseudo-population (that is, the sample distributions reflect the population distribution).It randomly selects with replacement subjects of the original sample in order to generate another sample and compute the desired statistics.Then, it repeatedly does this last step a tremendous amount of time (for instance, a general recommendation is over 5 000) in order to create an empirical sampling distribution of the desired statistics.Confidence intervals can be computed from the sampling distribution and inference regarding hypothesis testing can be done.Bootstrapping is easily implemented in R. The bias-corrected and accel- The Quantitative Methods for Psychology Figure 2 Illustration of the mediation for the example.The population parameters are also used for the power analysis.erated (BCa) bootstrap interval is a method introduced to correct bias and skewness in the distribution of bootstrap estimates.Listing 4 shows the code to apply the bootstrap method to mediation analysis.It also uses an additional function to compute the indirect effect for the boot function that needs to be called in the primary function.

Power analysis
It might be also interesting to put the previous tutorial into practice.For instance, let us consider a power analysis to evaluate the type II error rate of BootTest(), SolbelTest() and BaronKenny() functions.Power refers to the probability to find a significant result when the null hypothesis is false (there is an indirect effect).Failure to find a significant result is a type II error.Listings 5 and 6 shows the code to implement a power analysis.The purpose of power analysis is to simulate an experiment with known and non-null population parameters, check whether the result is significant or not, and redo the above a tremendous amount of times.There are two main components in the script: the generation of data (Listing 1) and the indirect effect test (Listings 2 to 4).The outcome of the function is the power of the mediation test given a sample size n.
To conclude this section, three power analyses were carried out following the parameters of the previous example with a sample size of 40.Table 2 shows the results of the power analysis of the three tests.The Baron and Kenny test had a poor performance (power of .029),because of the really low (null) total effect which is a tricky scenario for that test.The Sobel test had a power of 0.606.Finally, the bootstrap method obtained a power of 0.786.To sum up, the results demonstrate the lack of power of the Baron and Kenny test and the Sobel test, and the more powerful estimation of the BootTest.

A complete example
In order to illustrate mediation analysis, a complete example will be carried.Listing 7 shows the complete script to run the example.Four hundred twenty-nine people were asked to complete the Beck Depression Inventory (BDI; Beck, Steer, & Brown, 1996) and a short survey that included questions about the average weekly alcoholic beverage intake (further referenced as alcohol intake) and number of weekly positive social interaction (further referenced as positive social interaction).The BDI is a short self-report questionnaire used to assess intensity of depression.The main hypothesis is the effect of the alcohol intake (the independent variable x) on depression (the dependent variable y) will be partially mediated by positive social interaction (the mediator m).Table 3 presents the population parameters of the example.
Table 4 shows the descriptive analysis and histogram with density curve (see Figure 3) for the three variables showed a normal distribution of data.These information can be found with the of the psych package (Revelle, 2017).Tables 5 presents the variance-covariance matrix with function cov() and correlation matrix with the cor() function.It is worth to note that the correlation matrix summarizes approximately the expected relations given by the population parameters.Table 6 depicts the first step of the mediation analysis conducted by testing a regression model of alcohol intake on positive social interaction (using the function lm() in R) with a significant model, F (1, 427) = 67.72,p < .001,and a significant effect of alcoholic intake over positive social interaction, β = 0.189, p < .001.
Step two tests the regression model of alcohol intake and positive social interaction on depression (see table 6) found a significant model, F (2, 426) = 38.08,p < .001,and significant effects of alcoholic intake, β = 0.824, p < .001,and positive social The Quantitative Methods for Psychology   interaction, β = −1.629,p < .001,over BDI score.
To test for the significant mediation effect, the three methods are used with the unstandardized data in order to demonstrate the non-necessity of using standardized dataset.Table 7 summarizes the results.All tests yield the same outcome (regardless whether data were standardized or not).The Baron & Kenny test suggests a significant partial mediation.The Sobel test shows a significant mediation, z = −5.445,p < 0.001, for both dataset.Finally, the bootstrap BCa confidence intervals had a lower limit of 1.043 and an upper limit of 1.754.The confidence interval does not include 0 and, therefore, the indirect effect is deemed significant.We could interpret the results as the number of weekly positive social interaction partially mediate the effect of weekly alcoholic beverage intake on depression by reducing the later scores, but these data were generated using the code provided in this article.

Discussion
The purpose of the current paper was to introduce the computation within mediation analysis.Firstly, we detailed the parameters in the conceptual diagram and labelled them appropriately.We then showed some examples using R and gave the code for the readers to implement it themselves.We hope this work will encourage statistical research in the analysis of mediation models and help the reader to better understand them.
is the indirect effect ab, which is the product of a xm and b my|x .The indirect effect, ab, and the direct effect, c xy|m , sum to the total effect c xy .Hence mathematically, c xy = c xy|m + a xm × b my|x .In order to simulate a mediation model, three parameters must be known and defined because a, b and c are interrelated.To help illustrate, the next section explains how to generate data containing mediation.

bc
my = b my|x 1 − a 2 xm + a xm c xy c xy xy − ab ab -0.140 a xm × b my|x n 429 -Note.a because their counterpart (b my and c xy|m ) were fixed first.

Table 1
Variance-Covariance matrix of simulated data a Y-.006 0.451 1.003Note.a obtained with the function var()

Table 2
Summary of power analyses

Table 3
Population parameters of the complete example