Yesterday, my attention was grabbed by a paper, published in Nature Communications, titled “Regulation of REM and non-REM sleep by periaqueductal GABAergic neurons“. It seems to be a very complicated paper, where the authors put in a lot of hard work. In addition, the journal deserves praise for having an open peer review system (the reviews are available online), something which I strongly support.

The reason I looked at this paper – being so outside of my own field of work – is because of one of the last paragraphs, one sample size. Here, the authors write:

For optogenetic activation experiments, cell-type-specific ablation experiments, and in vivo recordings (optrode recordings and calcium imaging), we continuously increased the number of animals until statistical significance was reached to support our conclusions.

I was extremely surprised to read this, because of three reasons:

- This is not a correct way to decide upon the sample size. To be more precise: this is a very wrong way of doing so, kind of invalidating all the results;
- The authors were so open about this – usually questionable research practices are more hidden;
- None of the three reviewers, nor the editor, has spotted this blatant statistical mistake – even though it’s a textbook example of a QRP and the journal has an astronomical impact factor.

The second reason is reassuring to some extent: it is clear that there’s no ill intent from the authors. Without proper and thorough statistical training, it actually sounds like a good idea. Rather than collecting a sample of, say, size *n* = 50, let’s see step by step if we can work with a smaller sample. Especially when you are conducting animal studies (like these authors are), it’s your ethical obligation to select the sample size as efficient as possible.

My tweet about this yesterday received quite some attention: clearly I’m not the only one who was surprised to read this. Andrew Gelman wrote a blog post after seeing the tweet, in which he indicates that this type of sequential analysis doesn’t have to be problematic, if you steer away from null hypothesis significance testing (NHST). He makes some valid points but I think that in practice researchers often want to use NHST anyway. Below, I will outline (i) what the problem is with sequential analyses with unadjusted testing; (ii) what you could do to avoid this issue.

**Unadjusted sequential testing**

The story here holds true for all kinds of tests, but let’s stick to a straightforward independent *t*-test. You begin with 2 mice in each group (with 1 mouse per group, you cannot compute the within-group-variance, thus cannot conduct a *t*-test). You put some electrodes in their brains, or whatever it is you have to do for your experiment, take your measurements and conduct your *t*-test. It gives a *p*-value above 0.05. It must be because of the small sample, let’s add another mouse per group. Again, non-significant. You go on, and on, and on, until you reach significance.

If there is no effect, a single statistical test will yield a false positive, so *p* < 0.05, in 5% of the times. This 5% is something we think is an acceptable percentage for the false discovery rate (although you can make a motivated choice for another rate – but that’s another discussion). If you would do two independent tests (and there is no effect), you would reach a significant result 1 – (1 – 0.05)^{2} = 90.25%, and with *k* tests, this is 1 – (1 – 0.05)^{k}, which goes towards 1 pretty fast if *k* goes up. This is the basis behind the Bonferroni correction.

Here, the situation is slightly different: you’re not performing independent tests. The *p*-value for a *t*-test with 30 measurements will be not too dissimilar from a *p*-value for a *t*-test with those 30 measurements and 1 more. Still, the multiple testing issue remains – albeit not as severe as with independent tests. You can prove mathematically (don’t worry, I won’t do that here) that with this sequential approach it actually is guaranteed (i.e. probability of 1) that you will reach significance at some point. Even if there is no effect! This approach will give a guaranteerd false discovery rate of 1 – and that is as bad as it sounds…

*Example*

We can use a computer simulation to see what happens. This is a situation in which H0 is true: there is no effect, i.e. both groups are not different. Rejecting H0 in this situation is an error (Type I error). In the picture below, I did just what I described: starting with *n* = 2, I kept on increasing *n* by 1. As you can see, the *p*-value ‘converged to significance’ at *n* = 42. But it also moved away from it! At *n* = 150, we’re kind of back where we started, with a very non-significant *p*-value.

*Simulation*

So, in this instance it happened at *n* = 42. With a new simulation it might happen at some other point, but two things are for sure: you will reach significance and you will reach non-significance after that…

Let’s now study how bad the problem is. I simulated 1000 of these sequential strategies, and recorded at what value of *n* significance was reached for the first time. Sometimes you’re “lucky” and have it with a small *n*, sometimes you have to wait for ages. The simulation results are as follows:

As you can see, the problem is huge. Even if you would apply some rule where you stop the strategy once *n* = 25, your False Discovery Rate exceeds 25%, more than five times what you want.

Note that this problem not only affects the *p*-values, but also the estimates. Using this strategy, the distance between the means of both groups will sometimes increase, sometimes decrease – just as a consequence of coincidence. If we continue sampling until the means of the experimental and control group are sufficiently far apart in order to call it significant, it means we overestimate the effects. Not only is the significance biased, so is the effect size.

So, in an attempt to ‘use’ as few animals as possible – something that should be applauded – the authors actually and accidentally invalidated their study, leading to more test animals that are used unnecessarily…

**So, what can we do?**

Hopefully, I’ve managed to explain that unadjusted sequential analysis is problematic. It is, however, possible to still apply this approach – increasing your sample size in small bits until you meet some threshold. The main difference is that the threshold should not be taken fixed at 5%, but should take the issue of multiple testing into account. The mathematical backbone to this approach was developed in the 1940’s by Abraham Wald, with a pivotal paper in 1945. Around the same time, and independent of Wald, British war hero and polymath Alan Turing derived a similar approach based on Bayesian reasoning. This sequential approach helped Turing to crack the German Enigma machines and thus saved millions of lives.

These sequential approaches are more technical than the standard t-test, and they are usually not included in easy to use software packages. Recently, several people have written accessible tutorial papers on how to perform such a sequential analysis. A good starting point is this paper by Daniel Lakens.

**Conclusion**

In their paper, Franz Weber and colleagues used an incorrect method to decide upon the sample size. As a consequence, all test results in this paper are invalid. How this passed peer review in a top journal, is difficult to understand, but these things happen. It’d be interesting to see how Nature Communications deals with the aftermath of this paper…