The correction to "The harmonic mean p-value for combining dependent tests" has been published www.pnas.org/content/early/2019/10/02/1914128116 and the main article has been corrected online www.pnas.org/content/116/4/1195.
I have posted the source code for the harmonicmeanp R package on GitHub. This means there is now a development version with the latest updates, and instructions for installing it: github.com/danny-wilson/harmonicmeanp/tree/dev.
Tuesday, 22 October 2019
Monday, 19 August 2019
Updated correction: The harmonic mean p-value for combining dependent tests
— To view a PDF version of this announcement click here —
Important: Please update to version 3.0 of the harmonicmeanp R package to address critical errors in the main function p.hmp.I would like to update the correction I issued on July 3, 2019 to cover a second error I discovered that affects the main function of the R package, p.hmp. There are two errors in the original paper:
- The paper (Wilson 2019 PNAS 116: 1195-1200) erroneously stated that the test \(\overset{\circ}{p}_\mathcal{R} \leq \alpha_{|\mathcal{R}|}\,w_\mathcal{R}\) controls the strong-sense family-wise error rate asymptotically when it should read \(\overset{\circ}{p}_\mathcal{R} \leq \alpha_{L}\,w_\mathcal{R}\).
- The paper incorrectly stated that one can produce adjusted p-values that are asymptotically exact, as intended in the original Figure 1, by transforming the harmonic mean p-value with Equation 4 before adjusting by a factor \(1/w_{\mathcal{R}}\). In fact the harmonic mean p-value must be multiplied by \(1/w_{\mathcal{R}}\) before transforming with Equation 4.
The p.hmp function prior to version 3.0 of the R package was affected by these errors.
In the above,
In the above,
- L is the total number of individual p-values.
- \(\mathcal{R}\) represents any subset of those p-values.
- \(\overset{\circ}{p}_\mathcal{R} = \left(\sum_{i\in\mathcal{R}} w_i\right)/\left(\sum_{i\in\mathcal{R}} w_i/p_i\right)\) is the HMP for subset \(\mathcal{R}\).
- \(w_i\) is the weight for the ith p-value. The weights must sum to one: \(\sum_{i=1}^L w_i=1\). For equal weights, \(w_i=1/L\).
- \(w_\mathcal{R}=\sum_{i\in\mathcal{R}}w_i\) is the sum of weights for subset \(\mathcal{R}\).
- \(|\mathcal{R}|\) gives the number of p-values in subset \(\mathcal{R}\).
- \(\alpha_{|\mathcal{R}|}\) and \(\alpha_{L}\) are significance thresholds provided by the Landau distribution (Table 1).
The tutorial, available as a vignette in the R package and online, is affected quantitatively by both errors, and has been extensively updated for version 3.0.
The second error affects only one line of the corrected paper (issued July 2019). I have updated it to address the second error and two typos in Figure legends 1 and 2: http://www.danielwilson.me.uk/files/wilson_2019_annotated_corrections.v2.pdf. You will need Adobe Reader to properly view the annotations and the embedded corrections to Figures 1 and 2.
I would like to deeply apologise to users for the inconvenience the two errors have caused.
More information follows under the headings:
- Why does this matter?
- How does it affect the paper?
- Where did the errors come from?
- How do I update the R package?
- What if I have already reported results?
Why does this matter?
The family-wise error rate (FWER) controls the probability of falsely rejecting any null hypotheses, or groups of null hypotheses, when they are true. The strong-sense FWER maintains control even when some null hypotheses are false, thereby offering control across much broader and more relevant scenarios than the weak-sense FWER.
The ssFWER is not controlled at the expected rate if:
- The more lenient threshold \(\alpha_{|\mathcal{R}|}\) is used rather than the corrected threshold \(\alpha_L\), both derived via Table 1 of the paper from the desired ssFWER \(\alpha\).
- Raw p-values are transformed with Equation 4 before adjusting by a factor \(w_{\mathcal{R}}^{-1}\), rather than adjusting the raw p-values by a factor \(w_{\mathcal{R}}^{-1}\) before transforming with Equation 4.
The p.hmp function of the R package suffers both issues in version 1.0, and the second issue in version 2.0. Please update to version 3.0.
Tests in which significance is marginal or non-significant at \(\alpha=0.05\) are far more likely to be affected in practice.
Regarding error 1, individual p-values need to be assessed against the threshold \(\alpha_{L}/L\) when the HMP is used, not the more lenient \(\alpha_{1}/L\) nor the still more lenient \(\alpha/L\) (assuming equal weights). This shows that there is a cost to using the HMP compared to Bonferroni correction in the evaluation of individual p-values (and indeed small groups of p-values). For one billion tests \(\left(L=10^9\right)\) and a desired ssFWER of \(\alpha=0.01\), the fold difference in thresholds from Table 1 would be \(\alpha/\alpha_L=0.01/0.008=1.25\).
However, it remains the case that HMP is more powerful than Bonferroni for assessing the significance of large groups of hypotheses. This is the motivation for using the HMP, and combined tests in general, because the power to find significant groups of hypotheses will be higher than the power to detect significant individual hypotheses when the total number of tests (L) is large and the aim is to control the ssFWER.
Regarding error 1, individual p-values need to be assessed against the threshold \(\alpha_{L}/L\) when the HMP is used, not the more lenient \(\alpha_{1}/L\) nor the still more lenient \(\alpha/L\) (assuming equal weights). This shows that there is a cost to using the HMP compared to Bonferroni correction in the evaluation of individual p-values (and indeed small groups of p-values). For one billion tests \(\left(L=10^9\right)\) and a desired ssFWER of \(\alpha=0.01\), the fold difference in thresholds from Table 1 would be \(\alpha/\alpha_L=0.01/0.008=1.25\).
However, it remains the case that HMP is more powerful than Bonferroni for assessing the significance of large groups of hypotheses. This is the motivation for using the HMP, and combined tests in general, because the power to find significant groups of hypotheses will be higher than the power to detect significant individual hypotheses when the total number of tests (L) is large and the aim is to control the ssFWER.
How does it affect the paper?
I have submitted a request to correct the paper to PNAS. It is up to the editors whether to agree to this request. A copy of the published paper, annotated with the requested corrections, is available here: http://www.danielwilson.me.uk/files/wilson_2019_annotated_corrections.v2.pdf. Please use Adobe Reader to properly view the annotations and the embedded corrections to Figures 1 and 2.
Where did the errors come from?
Regarding the first error, page 11 of the supplementary information gave a correct version of the full closed testing procedure that controls the ssFWER (Equation 37). However, it went on to erroneously claim that "one can apply weighted Bonferroni correction to make a simple adjustment to Equation 6 by substituting \(\alpha_{|\mathcal{R}|}\) for \(\alpha\)." This reasoning would only be valid if the subsets of p-values to be combined were pre-selected and did not overlap. However, this would no longer constitute a flexible multilevel test in which every combination of p-values can be tested while controlling the ssFWER. The examples in Figures 1 and 2 pursued multilevel testing, in which the same p-values were assessed multiple times in subsets of different sizes, and in partially overlapping subsets of equal sizes. For the multilevel test, a formal shortcut to Equation 37, which makes it computationally practicable to control the ssFWER, is required. The simplest such shortcut procedure is the corrected test
$$\overset{\circ}{p}_\mathcal{R} \leq \alpha_{L}\,w_\mathcal{R}$$
One can show this is a valid multilevel test because if $$\overset{\circ}{p}_\mathcal{R}\leq\alpha_L\,w_\mathcal{R}$$ then $$\overset{\circ}{p}=\left(w_\mathcal{R}\,\overset{\circ}{p}^{-1}_\mathcal{R}+w_{\mathcal{R}^\prime}\,\overset{\circ}{p}^{-1}_{\mathcal{R}^\prime}\right)^{-1} \leq w^{-1}_\mathcal{R}\,\overset{\circ}{p}_\mathcal{R}\leq\alpha_L$$an argument that mirrors the logic of Equation 7 for direct interpretation of the HMP (an approximate procedure), which is not affected by this correction.
The second error, which was also caused by carelessness on my part, occurred in the main text in the statement "(Equivalently, one can compare the exact p-value from Eq. 4 with \(\alpha\,w_{\mathcal{R}}\).)" I did not identify it sooner because the corrected version of the paper no longer uses Equation 4 to transform p-values in Figure 1.
The second error, which was also caused by carelessness on my part, occurred in the main text in the statement "(Equivalently, one can compare the exact p-value from Eq. 4 with \(\alpha\,w_{\mathcal{R}}\).)" I did not identify it sooner because the corrected version of the paper no longer uses Equation 4 to transform p-values in Figure 1.
How do I update the R package?
The R package is maintained at https://cran.r-project.org/package=harmonicmeanp.
In R, test whether you have version 3.0 of the package installed as follows:
packageVersion("harmonicmeanp")
The online binaries take a few days or weeks to update, so to ensure you install the most recent version of the package install from source by typing:
install.packages("harmonicmeanp", dependencies=TRUE, type="source")
You can additionally specify the CRAN repository for example:
install.packages("harmonicmeanp", dependencies=TRUE, type="source", repos="https://cran.wu.ac.at")
After installation, check again the version number:
stopifnot(packageVersion("harmonicmeanp")>=3.0)
What if I have already reported results?
I am very sorry for inconvenience caused in this case.
As long as the 'headline' test was significant with p.hmp under R package versions 1.0 or 2.0, then the weak-sense FWER can be considered to have been controlled. The 'headline' test is the test in which all p-values are included in a single combined test. The headline test is not affected by either error, because \(|\mathcal{R}|=L\) and \(w_{\mathcal{R}}=1\). The headline test controls the weak-sense FWER, and therefore so does a two-step procedure in which subsets are only deemed significant when the headline test is significant (Hochberg and Tamhane, 1987, Multiple Comparison Procedures, p. 3, Wiley).
If the headline test was not significant, re-running the analysis with version 3.0 will not produce significant results either because the stringency is greater for controlling the strong-sense FWER. If the headline test was significant, you may wish to reanalyse the data with version 3.0 to obtain strong-sense FWER control, because this was the criterion the HMP procedure was intended to control.
If some results that were significant under version 1.0 or 2.0 of the R package are no longer significant, you may conclude they are not significant or you may report them as significant subject to making clear that only the weak-sense FWER was controlled.
More information
For more information please leave a comment below, or get in touch via the contact page.
Saturday, 6 July 2019
Correction: The harmonic mean p-value for combining dependent tests
Important: this announcement has been superceded. Please see updated correction.
I would like to issue the following correction to users of the harmonic mean p-value (HMP), with apologies: The paper (Wilson 2019 PNAS 116: 1195-1200) erroneously states that the following asymptotically exact test controls the strong-sense family-wise error rate for any subset of p-values \(\mathcal{R}\):
$$\overset{\circ}{p}_\mathcal{R} \leq \alpha_{|\mathcal{R}|}\,w_\mathcal{R}$$
when it should read
$$\overset{\circ}{p}_\mathcal{R} \leq \alpha_{L}\,w_\mathcal{R}$$
where:
- L is the total number of individual p-values.
- \(\mathcal{R}\) represents any subset of those p-values.
- \(\overset{\circ}{p}_\mathcal{R} = \left(\sum_{i\in\mathcal{R}} w_i\right)/\left(\sum_{i\in\mathcal{R}} w_i/p_i\right)\) is the HMP for subset \(\mathcal{R}\).
- \(w_i\) is the weight for the ith p-value. The weights must sum to one: \(\sum_{i=1}^L w_i=1\). For equal weights, \(w_i=1/L\).
- \(w_\mathcal{R}=\sum_{i\in\mathcal{R}}w_i\) is the sum of weights for subset \(\mathcal{R}\).
- \(|\mathcal{R}|\) gives the number of p-values in subset \(\mathcal{R}\).
- \(\alpha_{|\mathcal{R}|}\) and \(\alpha_{L}\) are significance thresholds provided by the Landau distribution (Table 1).
An updated tutorial is available as a vignette in the R package and online here: http://www.danielwilson.me.uk/harmonicmeanp/hmpTutorial.html
Why does this matter?
The family-wise error rate (FWER) controls the probability of falsely rejecting any null hypotheses, or groups of null hypotheses, when they are true. The strong-sense FWER maintains control even when some null hypotheses are false, thereby offering control across much broader and more relevant scenarios.
Using the more lenient threshold \(\alpha_{|\mathcal{R}|}\) rather than the corrected threshold \(\alpha_L\), both derived via Table 1 of the paper from the desired ssFWER \(\alpha\), means the ssFWER is not controlled at the expected rate.
Tests with small numbers of p-values are far more likely to be affected in practice. In particular, individual p-values should be assessed against the threshold \(\alpha_{L}/L\) when the HMP is used, not the more lenient \(\alpha_{1}/L\) nor the still more lenient \(\alpha/L\) (assuming equal weights). This shows that there is a cost to using the HMP compared to Bonferroni correction in the evaluation of individual p-values. For one billion tests \(\left(L=10^9\right)\) and a desired ssFWER of \(\alpha=0.01\), the fold difference in thresholds from Table 1 would be \(\alpha/\alpha_L=0.01/0.008=1.25\).
However, it remains the case that HMP is much more powerful than Bonferroni for assessing the significance of groups of hypotheses. This is the motivation for using the HMP, and combined tests in general, because the power to find significant groups of hypotheses will be much higher than the power to detect significant individual hypotheses when the total number of tests (L) is large and the aim is to control the ssFWER.
How does it affect the paper?
I have submitted a request to correct the paper to PNAS. It is up to the editors whether to agree to this request. A copy of the published paper, annotated with the requested corrections, is available here: http://www.danielwilson.me.uk/files/wilson_2019_annotated_corrections.pdf. Please use Adobe Reader to properly view the annotations and the embedded corrections to Figures 1 and 2.
Where did the error come from?
Page 11 of the supplementary information gave a correct version of the full closed testing procedure that controls the ssFWER (Equation 37). However, it went on to erroneously claim that "one can apply weighted Bonferroni correction to make a simple adjustment to Equation 6 by substituting \(\alpha_{|\mathcal{R}|}\) for \(\alpha\)." This reasoning would only be valid if the subsets of p-values to be combined were pre-selected and did not overlap. However, this would no longer constitute a flexible multilevel test in which every combination of p-values can be tested while controlling the ssFWER. The examples in Figures 1 and 2 pursued multilevel testing, in which the same p-values were assessed multiple times in subsets of different sizes, and in partially overlapping subsets of equal sizes. For the multilevel test, a formal shortcut to Equation 37, which makes it computationally practicable to control the ssFWER, is required. The simplest such shortcut procedure is the corrected test
$$\overset{\circ}{p}_\mathcal{R} \leq \alpha_{L}\,w_\mathcal{R}$$
One can show this is a valid multilevel test because if $$\overset{\circ}{p}_\mathcal{R}\leq\alpha_L\,w_\mathcal{R}$$ then $$\overset{\circ}{p}=\left(w_\mathcal{R}\,\overset{\circ}{p}^{-1}_\mathcal{R}+w_{\mathcal{R}^\prime}\,\overset{\circ}{p}^{-1}_{\mathcal{R}^\prime}\right)^{-1} \leq w^{-1}_\mathcal{R}\,\overset{\circ}{p}_\mathcal{R}\leq\alpha_L$$an argument that mirrors the logic of Equation 7 for direct interpretation of the HMP, which is not affected by this correction.
More information
For more information please leave a comment below, or get in touch via the contact page.
Monday, 25 February 2019
New paper: PVL toxin associated with pyomyositis
In a new collaborative study published this week in eLife, we report a strong association between Staphylococcus aureus that carry the PVL toxin and pyomyositis, a muscle infection often afflicting children in the tropics.
Catrin Moore and colleagues at the Angkor Children's Hospital in Siem Reap, Cambodia, spent more than a decade collecting S. aureus bacteria from pyomyositis infections in young children, and built a comparable control group of S. aureus carried asymptomatically in children of similar age and location.
When Bernadette Young in our group compared the genomes of cases and controls using statistical tools we developed, she found some strong signals:
Catrin Moore and colleagues at the Angkor Children's Hospital in Siem Reap, Cambodia, spent more than a decade collecting S. aureus bacteria from pyomyositis infections in young children, and built a comparable control group of S. aureus carried asymptomatically in children of similar age and location.
When Bernadette Young in our group compared the genomes of cases and controls using statistical tools we developed, she found some strong signals:
- Most, but not all, pyomyositis was caused by the CC-121 strain, common in Cambodia.
- The association with CC-121 was driven by the PVL toxin which it carries.
The ability to pinpoint the association to PVL came about because (i) a sub-group of CC-121 that lacked PVL caused no pyomyositis and (ii) pyomyositis-causing S. aureus from backgrounds that rarely caused pyomyositis were unusual in also possessing PVL.
The strength of the PVL-pyomyositis association was extraordinarily strong, so strong that PVL appeared all-but necessary for disease. Moreover, disease appeared to be monogenic, with no other genes involved elsewhere in the bacterial genome. To discover an apparently monogenic disease mechanism for a common disease is very unusual nowadays.
The discovery has immediate practical implications because it draws parallels between pyomyositis and toxin-driven bacterial diseases like tetanus and diphtheria that have proven amenable to immunization. The fact that anti-PVL vaccines have already been developed in other contexts offers hope for the future treatment of this debilitating tropical infection.
Our study throws much-needed light on a subject that has been the subject of heated debate over previous years. Many bacterial toxins, PVL included, have been implicated in diverse S. aureus disease manifestations, often without sound evidence. Because PVL is known to contribute to angry, pus-filled skin infections, and has been observed in bacteria causing rare and severe S. aureus infections, some authors have implicated it in dangerous diseases including necrotizing pneumonia, septic arthritis and pyomyositis, but detailed meta-analyses have dismissed these claims as not substantiated. Our GWAS approach offers unprecedented robustness over previous generations of candidate gene studies by accounting for bacterial genetic variation across the entire genome.
If you are interested, please take a closer look at the paper.
Monday, 7 January 2019
New paper in PNAS: harmonic mean p-value
Published on Friday in Proceedings of the National Academy of Sciences USA, "The harmonic mean p-value for combining dependent tests" reports a new method for performing combined tests. A revised R package with detailed examples is now available online as the harmonicmeanp package on CRAN.
The method has two stages:
The method has two stages:
- Compute a test statistic: the harmonic mean of the p-values (HMP) of the tests to be combined. Remarkably, this HMP is itself a valid p-value for small values (e.g. below 0.05).
- Calculate an asymptotically exact p-value from the test statistic using generalized central limit theorem. The distribution is a type of Stable distribution first described by Lev Landau.
The method, which controls the strong-sense family-wise error rate (ssFWER), has several advantages over existing alternatives to combining p-values:
- Combining p-values allows information to be aggregated over multiple tests and requires less stringent significance thresholds.
- The HMP procedure is robust to positive dependence between the p-values, making it more widely applicable than Fisher's method which assumes independence.
- The HMP procedure is more powerful than the Bonferroni and Simes procedures.
- The HMP procedure is more powerful than the Benjamini-Hochberg (BH) procedure, even though BH only controls the weaker false discovery rate (FDR) and weak-sense family-wise error rate (wsFWER) in the sense that whenever the BH procedure detects one or more significant p-values, the HMP procedure will detect one or more significant p-values or groups of significant p-values.
The ssFWER can be considered gold-standard control of false positives because it aims to control the probability of one or more false positives even in the presence true positives. The HMP is inspired by Bayesian model averaging and approximates a model-averaged Bayes factor under certain conditions.
In researching and revising the paper, I looked high and low for previous uses of the harmonic-mean p-value because most ideas have usually been had already. Although there is a class of methods that use different types of average p-value (without compelling motivation), I did not find a precedent. Until today, a few days too late, so I may as well get in there and declare it before anyone else. I. J. Good published a paper in 1958 that mysteriously appeared when I googled the new publication on what he called the "harmonic mean rule-of-thumb", effectively for model-averaging. Undeniably, I did not do my homework thoroughly enough. Still, I would be interested if others know more about the history of this rule-of-thumb.
Good's paper, available on Jstor, proposes that the HMP "should be regarded as an approximate tail-area probability" [i.e. p-value], although he did not propose the asymptotically exact test (Eq. 4) or the multilevel test procedure (Eq. 6) that are important to my approach. His presentation is amusingly apologetic, e.g. "an approximate rule of thumb is tentatively proposed in the hope of provoking discussion", "this rule of thumb should not be used if the statistician can think of anything better to do" and "The 'harmonic-mean rule of thumb' is presented with some misgivings, because, like many other statistical techniques, it is liable to be used thoughtlessly". Perhaps this is why the method (as far as I could tell) had disappeared from the literature. Hopefully the aspects new to my paper will shake off these misgivings and provide users with confidence that the procedure is interpretable and well-motivated on theoretical as well as empirical grounds. Please give it a read!
Work cited
- R. A. Fisher (1934) Statistical Methods for Research Workers (Oliver and Boyd, Edinburgh), 5th Ed.
- L. D. Landau (1944) On the energy loss of fast particles by ionization. Journal of Physics U.S.S.R. 8: 201-205.
- I. J. Good (1958) Significance tests in parallel and in series. Journal of the American Statistical Association 53: 799-813. (Jstor)
- R. J. Simes (1986) An improved Bonferroni procedure for multiple tests of significance. Biometrika 73: 751-754.
- Y. Benjamini and Y. Hochberg (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 57: 289-300.
- D. J. Wilson (2019) The harmonic mean p-value for combining dependent tests. Proceedings of the National Academy of Sciences U.S.A. published ahead of print January 4, 2019. (PNAS)