Monday, 12 May 2025

Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing

Helen Fryer, Nick Arning and I have posted our new preprint to arxiv. This is the first version of the paper that we have submitted for peer review. Doublethink addresses some long-standing questions in assessing evidence for the purposes of hypothesis testing.

Hypothesis testing is central to scientific enquiry, but conclusions can be heavily influenced by model specification, particularly which variables are included. Bayesian model-averaged hypothesis testing offers a solution, but the sensitivity of posterior odds and Bayesian false discovery rate (FDR) guarantees to prior assumptions limit the appeal. In hypothesis testing, we lack unifying results – like Bernstein-von-Mises’ Theorem – that predict convergence of Bayesian and frequentist results, even in large samples.

Our paper introduces new theory and a practical method, Doublethink, motivated by these issues:
  • A key, and perhaps surprising, result is that Bayesian model-averaged hypothesis testing natively controls not only the Bayesian FDR, but also the frequentist strong-sense familywise error rate (FWER). This duality – which is general – seems to be unknown.
  • For practical application, we derive large-sample asymptotic theory to quantify the rate at which the FWER is controlled. Specifically, we use a BIC-like model to characterize the tail probability of the model-averaged posterior odds via a chi-squared distribution.
  • This result enables simultaneous control of Bayesian FDR and frequentist FWER at quantifiable levels and – equivalently – simultaneous reporting of posterior odds and asymptotic p-values.
  • We explore the method’s benefits – like post-hoc variable selection – and limitations – like finite sample inflation – through a Mendelian Randomization study and detailed simulations, comparing Doublethink to Lasso, stepwise regression, the Benjamini-Hochberg procedure and e-values.
Besides the practical benefits of model-averaged hypothesis testing with frequentist guarantees, and the implications that entails for objective Bayesian hypothesis testing, these results offer fundamental insights likely to trigger renewed discussion of FDR, FWER and the reconcilability of p-values with evidence.

Doublethink is a novel addition to the emerging class of heavy-tailed combination tests. Since 2019, methods like the Cauchy combination test and harmonic mean p-value have surfaced as powerful tools for combining hypothesis tests despite inter-test dependence. Doublethink improves on these methods by allowing model uncertainty in the null hypothesis and by improving power.

We believe this paper will be of broad interest, addressing questions of importance to statistical methodology, big data analysis and scientific enquiry more generally.

Explanation of variables above

  • The model-averaged p-value, adjusted for multiple testing, is p*.
  • The model-averaged posterior odds, calculated from a Bayesian analysis, is PO.
  • The number of variables in the analysis is ν.
  • The prior odds of including each variable are μ.
  • The sample size n is represented by ξn, which decreases as √n increases.

No comments: