Yesterday, my attention was grabbed by a paper, published in Nature Communications, titled “Regulation of REM and non-REM sleep by periaqueductal GABAergic neurons“. It seems to be a very complicated paper, where the authors put in a lot of hard work. In addition, the journal deserves praise for having an open peer review system (the reviews are available online), something which I strongly support.

The reason I looked at this paper – being so outside of my own field of work – is because of one of the last paragraphs, one sample size. Here, the authors write:

For optogenetic activation experiments, cell-type-specific ablation experiments, and in vivo recordings (optrode recordings and calcium imaging), we continuously increased the number of animals until statistical significance was reached to support our conclusions.

I was extremely surprised to read this, because of three reasons:

- This is not a correct way to decide upon the sample size. To be more precise: this is a very wrong way of doing so, kind of invalidating all the results;
- The authors were so open about this – usually questionable research practices are more hidden;
- None of the three reviewers, nor the editor, has spotted this blatant statistical mistake – even though it’s a textbook example of a QRP and the journal has an astronomical impact factor.

The second reason is reassuring to some extent: it is clear that there’s no ill intent from the authors. Without proper and thorough statistical training, it actually sounds like a good idea. Rather than collecting a sample of, say, size *n* = 50, let’s see step by step if we can work with a smaller sample. Especially when you are conducting animal studies (like these authors are), it’s your ethical obligation to select the sample size as efficient as possible.

My tweet about this yesterday received quite some attention: clearly I’m not the only one who was surprised to read this. Andrew Gelman wrote a blog post after seeing the tweet, in which he indicates that this type of sequential analysis doesn’t have to be problematic, if you steer away from null hypothesis significance testing (NHST). He makes some valid points but I think that in practice researchers often want to use NHST anyway. Below, I will outline (i) what the problem is with sequential analyses with unadjusted testing; (ii) what you could do to avoid this issue.

**Unadjusted sequential testing**

The story here holds true for all kinds of tests, but let’s stick to a straightforward independent *t*-test. You begin with 2 mice in each group (with 1 mouse per group, you cannot compute the within-group-variance, thus cannot conduct a *t*-test). You put some electrodes in their brains, or whatever it is you have to do for your experiment, take your measurements and conduct your *t*-test. It gives a *p*-value above 0.05. It must be because of the small sample, let’s add another mouse per group. Again, non-significant. You go on, and on, and on, until you reach significance.

If there is no effect, a single statistical test will yield a false positive, so *p* < 0.05, in 5% of the times. This 5% is something we think is an acceptable percentage for the false discovery rate (although you can make a motivated choice for another rate – but that’s another discussion). If you would do two independent tests (and there is no effect), you would reach a significant result 1 – (1 – 0.05)^{2} = 90.25%, and with *k* tests, this is 1 – (1 – 0.05)^{k}, which goes towards 1 pretty fast if *k* goes up. This is the basis behind the Bonferroni correction.

Here, the situation is slightly different: you’re not performing independent tests. The *p*-value for a *t*-test with 30 measurements will be not too dissimilar from a *p*-value for a *t*-test with those 30 measurements and 1 more. Still, the multiple testing issue remains – albeit not as severe as with independent tests. You can prove mathematically (don’t worry, I won’t do that here) that with this sequential approach it actually is guaranteed (i.e. probability of 1) that you will reach significance at some point. Even if there is no effect! This approach will give a guaranteerd false discovery rate of 1 – and that is as bad as it sounds…

*Example*

We can use a computer simulation to see what happens. This is a situation in which H0 is true: there is no effect, i.e. both groups are not different. Rejecting H0 in this situation is an error (Type I error). In the picture below, I did just what I described: starting with *n* = 2, I kept on increasing *n* by 1. As you can see, the *p*-value ‘converged to significance’ at *n* = 42. But it also moved away from it! At *n* = 150, we’re kind of back where we started, with a very non-significant *p*-value.

*Simulation*

So, in this instance it happened at *n* = 42. With a new simulation it might happen at some other point, but two things are for sure: you will reach significance and you will reach non-significance after that…

Let’s now study how bad the problem is. I simulated 1000 of these sequential strategies, and recorded at what value of *n* significance was reached for the first time. Sometimes you’re “lucky” and have it with a small *n*, sometimes you have to wait for ages. The simulation results are as follows:

As you can see, the problem is huge. Even if you would apply some rule where you stop the strategy once *n* = 25, your False Discovery Rate exceeds 25%, more than five times what you want.

Note that this problem not only affects the *p*-values, but also the estimates. Using this strategy, the distance between the means of both groups will sometimes increase, sometimes decrease – just as a consequence of coincidence. If we continue sampling until the means of the experimental and control group are sufficiently far apart in order to call it significant, it means we overestimate the effects. Not only is the significance biased, so is the effect size.

So, in an attempt to ‘use’ as few animals as possible – something that should be applauded – the authors actually and accidentally invalidated their study, leading to more test animals that are used unnecessarily…

**So, what can we do?**

Hopefully, I’ve managed to explain that unadjusted sequential analysis is problematic. It is, however, possible to still apply this approach – increasing your sample size in small bits until you meet some threshold. The main difference is that the threshold should not be taken fixed at 5%, but should take the issue of multiple testing into account. The mathematical backbone to this approach was developed in the 1940’s by Abraham Wald, with a pivotal paper in 1945. Around the same time, and independent of Wald, British war hero and polymath Alan Turing derived a similar approach based on Bayesian reasoning. This sequential approach helped Turing to crack the German Enigma machines and thus saved millions of lives.

These sequential approaches are more technical than the standard t-test, and they are usually not included in easy to use software packages. Recently, several people have written accessible tutorial papers on how to perform such a sequential analysis. A good starting point is this paper by Daniel Lakens.

**Conclusion**

In their paper, Franz Weber and colleagues used an incorrect method to decide upon the sample size. As a consequence, all test results in this paper are invalid. How this passed peer review in a top journal, is difficult to understand, but these things happen. It’d be interesting to see how Nature Communications deals with the aftermath of this paper…

I’m surprised you didn’t mention that Bayesian analyses (e.g. Bayes Factors) do not have that problem, and you can do optional stopping in the way the authors did without having to worry about increasing your false discovery rate.

That does depend on how you do the Bayesian analysis – see also the comments on Gelman’s blog post. I do mention Turing though, who derived the first Bayesian sequential approach.

Thanks for this fascinating and detailed post, Casper. It’s inspired me to write a statistics resource for high-school students for the NRICH site (nrich.maths.org), exploring these ideas.

Question: can you give a reference to the proof that sequential approaches have probability 1 of eventually reaching significance? I haven’t seen this result before (though it is probably well-known to professional statisticians). Also, how does it depend upon the nature of the underlying distribution of the random variable being repeatedly measured?

Thanks for the comment, Julian. Please give me a link to your post once you’ve written it.

I recall this property from a course on Mathematical Statistics I took when I was an undergraduate (ages ago). We used some chapters from (an older version) Siegmund’s Sequential Analysis for that course, but I wouldn’t know where exactly in the book to find this.

I do know that this is a very theoretical property. As you can see from Fig. 2, the curve gets flatter and flatter if n goes up, and with n = 500, the FDR still is below 50%.

In a tweet response to my post, Luke Jostins did some simulations (see

https://twitter.com/lukejostins/status/992079066995003392 ) and showed that n already must be in the trillions to have the FDR > 0.90.

It’s interesting to think about how this depends on the nature of the underlying distribution – but also difficult. Under H0, the distribution of the p-value based on a sample of size n of course is uniform – but not anymore once you’ve seen the p-value based on n-1 measurements: the dependence of the p-values is the whole reason why the problem is smaller here than with independent tests. How exactly this dependence is, I don’t know.

Thanks, Casper, I shall do. It will probably be published at the end of the month, though my initial GeoGebra activity is already live at https://ggbm.at/TpDC4jWJ – this allows one to perform repeated binomial trials (up to 200 of them) and look at how the p-value changes as the trials are performed.

I looked up Siegmund’s book – cool stuff for a newbie, good pointer! It turns out that this result is mentioned in the introduction as being a consequence of, for example, the law of the iterated logarithm, https://en.wikipedia.org/wiki/Law_of_the_iterated_logarithm

Best wishes,

Julian

Thank you for this very informative post. However an additional complication is that it is never made clear by the investigators what the experimental unit is, so the true sample size is actually unknown. It would be easy to assume that the individual mouse is the EU, but the total number of mice used is never identified; groups sizes range from 3 to 12, a control group is described as n =5, animals are housed in groups of up to 6, etc. However the investigators also refer to neural “units” and REM “episodes” within each mouse in the context of an EU (e.g. 972 episodes in 27 mice). This no doubt is why they obtained ridiculous p-values such as p = 2.8 x 10^-48

Indeed, there were more problems with this paper than the one I described here.

I decided to stick to just this one, as (i) I think this mistake is often made, so a blog about it would be useful; (ii) it is not my intent to list everything what is wrong with this single paper (especially since I hardly understand anything of the non-stats in there)