Today, on Twitter, I was involved in a discussion with statistical psychologist (or psychological statistician) Daniël Lakens on replication. Not to break the rule that any Twitter-discussion Daniël is involved in ends up in a blog-post, I’ve decided to write a blog-post on it myself.

**Introduction**

Essentially, our discussion was about the following. Data was collected with a certain sample size *n* and subsequently some type of standard (frequentist) statistical test, such as a *t*-test, ANOVA or linear regression test was performed (and for sake of simplicity we assume that all statistical assumptions are met). Is there any benefit the following approach of splitting the data into two equal parts, such that you have a smaller sample and a replication of the test? One might think so, given that replication and reproducibility are the new hypes in psychological methodology.

However, in my opinion, the main strength or replication lies in having an experiment that took place in Laboratory A replicated in Laboratory B. Perhaps the most obvious benefit of performing a replication is that you increase the sample size. If Laboratory A performed a study with *n* = 40, and you performed one with *n* = 40, then in the end you have *n *= 80. Obviously, this benefit is lost when you don’t really replicate, but cut your sample in half and call one half the replication. With this type of replication, you can check whether the significant result in Laboratory A was not simply due to coincidence (which happens α = 5% of times when there is no true effect).

Some other benefits of “real” replication are concerned with checking whether the experiment is reproducible and generalisable at all. If the experimenter used *n *= 40 local undergraduate students for his experiment (because it is so easy to oblige your students to be participants), it is of course unclear whether this result is generalisible to the population of interest (e.g. “everyone”). It helps if someone re-does the study with undergraduate students from another university. It is still very unclear whether the study is generalisable to non-students, but at least you can sort of find out whether students at different universities are similar. Again, this benefit only is there for real replications.

**Formalisation**

Let’s formalise the setting a bit and let’s keep things simple (it’s too sunny to stay too long behind the computer) and it doesn’t get much simpler than the one-sample *t*-test. Given is a random sample *X*_{1}, …, *X*_{n} from a *N*(*μ*, *σ*^{2}) distribution. Required is the test for H_{0}: *μ* = 0 versus two-sided alternative and, specifically, the *p*-value of this test. For sake of simplicity assume that we are in the ideal world: the sample is truly random and the population distribution is indeed truly normal. Also, we assume that *n *is even (otherwise we can’t split it in exact halves).

*Standard Approach (SA). *The standard-approach would be to perform the standard *t*-test on the data. Any textbook on statistics will tell you how to do this.

*Replication Approach (RA). *The “replication”-approach would be to perform two *t*-tests; one on observation 1 up to *n/*2 and one on observation (*n*/2 + 1)up to *n*. This way we obtain two *p*-values which we need to combine into one overall *p*-value. For this, we can simply use Fisher’s method, which boils down to the following. If H_{0} is true, then both *p*-values are independent and uniformly distributed on [0, 1]. Standard distribution theory then provides that *X* = -2(ln(p_{1}) + ln(p_{2})) follows a χ^{2}-distribution with 4 degrees of freedom and for this distribution we can compute the *p*-value given *X*.

**Answer using mathematical statistics**

Now we have both approaches, we can return to the fundamental question: *is there a benefit in applying RA over SA*? The direct answer is **no,** there is not. For the given setting, the *t*-test is the so-called Uniformly Most Powerful Unbiased (UMPU) test (see, e.g., Lehmann, 1959, Testing Statistical Hypotheses). This means that (i) the test is unbiased (when there is no effect – H_{0} is true – the test rejects α = 5% of times) and (ii) the test is uniformly most powerful: no other test has higher power, whatever the circumstances. In laymen terms: under the settings of the experiment, no other test can perform better. This is obviously quite a good property for a test to have. Both in general as now: it automatically answers our question. The replication approach is another test based on the same data and can therefore not perform better than the standard approach (it can, at best, perform just as well). This answer also holds true when we move away from the super-simplified *t*-test setting to ANOVA or linear regression: also there the default tests are UMPU.

**Answer using simulations**

The theory behind most powerful tests does answer the question “is there a benefit in the replication approach” (with “no”) but it does *not* quantify the difference between both approaches.

To this end, I ran the following simulation. For given settings of sample size *n* (either 40 or 80) and true population mean *μ* (from 0 to 1 in steps of 0.125), I’ve simulated 10,000 data sets of size *n* from a N(*μ*, 1) distribution. For each data set, I’ve computed the corresponding *p*-value for SA and RA. Furthermore, I’ve dichotomised these *p*-values into “significant”/”not-significant” based on α = 5%. *R-*code is provided at the bottom of this post.

Let’s focus first on *n* = 40. Above, a comparison of the average* p*-value (over the 10,000) replications for the SA (black) and the RA (red). (Please note that the uncertainty due to simulation error is really small, since I work with 10,000 repetitions. At first, I’ve created this plot including 95% CI, but this interval was so narrow, it was often only one or two pixels wide.)

When μ = 0, then H_{0} is true: the *p*-values are distributed according to a U(0, 1) distribution, thus should have mean 1/2 and variance 1/12. Both SA and RA yield values very close to this (SA: mean = 0.5005, var = 0.0834; RA mean = 0.5004, var = 0.0833). So, both methods have a Type I error rate of (about) 5%, which is what you want. When μ > 0, the alternative hypothesis is true, thus you hope to reject the null and you want small *p*-values. As expected, the larger *μ*, and thus the larger the effect size, the smaller the average *p*-value. The figure shows that the Standard Approach beats the Replication Approach.

Next, we look at the proportion of results that are flagged as significant (at a nominal level of 5%). For *μ* = 0, you expect this to be 5% (the Type I Error Rate), and it is 5% for both SA as RA. For *μ* > 0, this proportion is 1 – the Type II Error Rate, or the power, and you expect it to go up when *μ* goes up. And it does. Again, it is clear that the Standard Approach performs better than the Replication Approach, especially for smaller effect sizes. (When the effect size is huge, then also clearly sub-optimal procedures have no problems with classifying the result as ‘significant’. The difference in power between SA and RA certainly is non-neglectible; it goes up to 0.117 10.8% when *μ* = 0.375 (in which case SA has power 0.640 and RA has power 0.523).The last two images are concerned with the simulations for *n* = 80. They show a similar pattern: the standard approach is indeed the better approach. Now, the maximal difference in power is 0.108 when *μ* = 0.25 (in which case SA has power 0.600 and RA has power 0.492).

**Conclusion**

This type of replication is not useful, at least not in the current setting. It would be more useful if one for instance seriously doubts the distributional assumption underlying the one-sample *t*-test or doubts the independence of observations. In such cases, non-parametric approaches could be preferred over parametric ones, and the Replication Approach applied here is a basic version of split-half cross-validation, a commonly used non-parametric technique.In the above, I’ve limited myself to the frequentist setting. However, in a Bayesian setting under similar circumstances, the RA would also not be beneficial. Just as in the frequentist setting, the Bayesian version for the *t*-test is developed to be *uniformly optimal* in some (Bayesian) sense. Other approaches, based on the same data, therefore can never be more optimal.

**R code**

Below is the R-code. The first part runs the simulations, which could take some time, and the second part creates the figures.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | set.seed(31415) X <- (0:8)/8 n <- 40 mu <- 0 repetitions <- 10^5 p.all.40 <- matrix(NA, nrow=9, ncol=repetitions) p.split.40 <- p.all.40 p.all.80 <- matrix(NA, nrow=9, ncol=repetitions) p.split.80 <- p.all.80 for(i in 1:repetitions){ basedata <- rnorm(n,0,1) for(j in 0:8){ data <- basedata + j/8 dataA <- data[1:(n/2)] dataB <- data[(n/2 +1):n] p.all.40[j+1,i] <- t.test(data,var.equal=TRUE)$p.value p.split.40[j+1,i] <- pchisq(-2*(log(t.test(dataA, var.equal=TRUE)$p.value) + log(t.test(dataB, var.equal=TRUE)$p.value)), df=4, lower.tail=FALSE) } } set.seed(31415) n <- 80 for(i in 1:repetitions){ basedata <- rnorm(n,0,1) for(j in 0:8){ data <- basedata + j/8 dataA <- data[1:(n/2)] dataB <- data[(n/2 +1):n] p.all.80[j+1,i] <- t.test(data,var.equal=TRUE)$p.value p.split.80[j+1,i] <- pchisq(-2*(log(t.test(dataA, var.equal=TRUE)$p.value) + log(t.test(dataB, var.equal=TRUE)$p.value)), df=4, lower.tail=FALSE) } } p.all <- p.all.40 # or p.split.80; manually change p.split <- p.split.40 # same comment issig.all <- (p.all < .05) issig.split <- (p.split < .05) plot(X,apply(p.all,1,mean),type="b", ylab="mean p-value", xlab=expression(mu), ylim=c(0,.5),main="n = ...",pch=19, col=rgb(0,.5,.5,.8)) lines(X,apply(p.split,1,mean), col=rgb(.5,0,0,.8), type="b",pch=19) legend("topright",c("Standard approach","Replication approach"), col=c(rgb(0,.5,.5,.8), rgb(.5,0,0,.8)), lty=c(1,1),pch=c(19,19)) plot(X,apply(issig.all,1,sum)/repetitions,type="b", ylab="% significant results (alpha = 5%)",xlab="mu", ylim=c(0,1.02), pch=19, yaxs="i", col=rgb(0,.5,.5,.8), main = "n = ...") lines(X,apply(issig.split,1,sum)/repetitions,type="b", col=rgb(.5,0,0,.8),pch=19) lines(c(-1,2),c(.05,.05),lty=2) legend("bottomright",c("Standard approach","Replication approach"), col=c(rgb(0,.5,.5,.8), rgb(.5,0,0,.8)), lty=c(1,1),pch=c(19,19)) |