Daniel Wilson's Blog: January 2019

Published on Friday in Proceedings of the National Academy of Sciences USA, "The harmonic mean p-value for combining dependent tests" reports a new method for performing combined tests. A revised R package with detailed examples is now available online as the harmonicmeanp package on CRAN.

The method has two stages:

Compute a test statistic: the harmonic mean of the p-values (HMP) of the tests to be combined. Remarkably, this HMP is itself a valid p-value for small values (e.g. below 0.05).
Calculate an asymptotically exact p-value from the test statistic using generalized central limit theorem. The distribution is a type of Stable distribution first described by Lev Landau.

The method, which controls the strong-sense family-wise error rate (ssFWER), has several advantages over existing alternatives to combining p-values:

Combining p-values allows information to be aggregated over multiple tests and requires less stringent significance thresholds.
The HMP procedure is robust to positive dependence between the p-values, making it more widely applicable than Fisher's method which assumes independence.
The HMP procedure is more powerful than the Bonferroni and Simes procedures.
The HMP procedure is more powerful than the Benjamini-Hochberg (BH) procedure, even though BH only controls the weaker false discovery rate (FDR) and weak-sense family-wise error rate (wsFWER) in the sense that whenever the BH procedure detects one or more significant p-values, the HMP procedure will detect one or more significant p-values or groups of significant p-values.

The ssFWER can be considered gold-standard control of false positives because it aims to control the probability of one or more false positives even in the presence true positives. The HMP is inspired by Bayesian model averaging and approximates a model-averaged Bayes factor under certain conditions.

In researching and revising the paper, I looked high and low for previous uses of the harmonic-mean p-value because most ideas have usually been had already. Although there is a class of methods that use different types of average p-value (without compelling motivation), I did not find a precedent. Until today, a few days too late, so I may as well get in there and declare it before anyone else. I. J. Good published a paper in 1958 that mysteriously appeared when I googled the new publication on what he called the "harmonic mean rule-of-thumb", effectively for model-averaging. Undeniably, I did not do my homework thoroughly enough. Still, I would be interested if others know more about the history of this rule-of-thumb.

Good's paper, available on Jstor, proposes that the HMP "should be regarded as an approximate tail-area probability" [i.e. p-value], although he did not propose the asymptotically exact test (Eq. 4) or the multilevel test procedure (Eq. 6) that are important to my approach. His presentation is amusingly apologetic, e.g. "an approximate rule of thumb is tentatively proposed in the hope of provoking discussion", "this rule of thumb should not be used if the statistician can think of anything better to do" and "The 'harmonic-mean rule of thumb' is presented with some misgivings, because, like many other statistical techniques, it is liable to be used thoughtlessly". Perhaps this is why the method (as far as I could tell) had disappeared from the literature. Hopefully the aspects new to my paper will shake off these misgivings and provide users with confidence that the procedure is interpretable and well-motivated on theoretical as well as empirical grounds. Please give it a read!

Work cited

R. A. Fisher (1934) Statistical Methods for Research Workers (Oliver and Boyd, Edinburgh), 5th Ed.
L. D. Landau (1944) On the energy loss of fast particles by ionization. Journal of Physics U.S.S.R. 8: 201-205.
I. J. Good (1958) Significance tests in parallel and in series. Journal of the American Statistical Association 53: 799-813. (Jstor)
R. J. Simes (1986) An improved Bonferroni procedure for multiple tests of significance. Biometrika 73: 751-754.
Y. Benjamini and Y. Hochberg (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 57: 289-300.
D. J. Wilson (2019) The harmonic mean p-value for combining dependent tests. Proceedings of the National Academy of Sciences U.S.A. published ahead of print January 4, 2019. (PNAS)

Daniel Wilson's Blog

Monday, 7 January 2019

New paper in PNAS: harmonic mean p-value