Tuesday, 14 October 2025

New paper: Machine learning and statistical inference in microbial population genomics

We have published a new review article in Genome Biology contrasting machine learning and statistics in microbial genomics. This is joint work with Sam Sheppard, Nick Arning and David Eyre.

The availability of large genome datasets has changed the microbiology research landscape. Analyzing such data requires computationally demanding analyses, and new approaches have come from different data analysis philosophies. Machine learning and statistical inference have overlapping knowledge discovery aims and approaches.

In this review, we highlight how machine learning focuses on optimizing prediction, whereas statistical inference focuses on understanding the processes relating variables. We outline the different aims, assumptions, and resulting methodologies, with examples from microbial genomics. These approaches are essentially complementary, and we argue that exploiting both machine learning and statistics - selecting the right tool for the job - has the greatest potential for advancing pathogen research in the big data era.

Monday, 12 May 2025

Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing

Helen Fryer, Nick Arning and I have posted our new preprint to arxiv. This is the first version of the paper that we have submitted for peer review. Doublethink addresses some long-standing questions in assessing evidence for the purposes of hypothesis testing.

Hypothesis testing is central to scientific enquiry, but conclusions can be heavily influenced by model specification, particularly which variables are included. Bayesian model-averaged hypothesis testing offers a solution, but the sensitivity of posterior odds and Bayesian false discovery rate (FDR) guarantees to prior assumptions limit the appeal. In hypothesis testing, we lack unifying results – like Bernstein-von-Mises’ Theorem – that predict convergence of Bayesian and frequentist results, even in large samples.

Our paper introduces new theory and a practical method, Doublethink, motivated by these issues:
  • A key, and perhaps surprising, result is that Bayesian model-averaged hypothesis testing natively controls not only the Bayesian FDR, but also the frequentist strong-sense familywise error rate (FWER). This duality – which is general – seems to be unknown, or forgotten.
  • For practical application, we derive large-sample asymptotic theory to quantify the rate at which the FWER is controlled. Specifically, we use a BIC-like model to characterize the tail probability of the model-averaged posterior odds via a chi-squared distribution.
  • This result enables simultaneous control of Bayesian FDR and frequentist FWER at quantifiable levels and – equivalently – simultaneous reporting of posterior odds and asymptotic p-values.
  • We explore the method’s benefits – like post-hoc variable selection – and limitations – like inflation – through a Mendelian Randomization study and detailed simulations, comparing Doublethink to Lasso, stepwise regression, the Benjamini-Hochberg procedure and e-values.
Besides the practical benefits of model-averaged hypothesis testing with frequentist guarantees, and the implications that entails for objective Bayesian hypothesis testing, these results offer fundamental insights likely to trigger renewed discussion of FDR, FWER and the reconcilability of p-values with evidence.

Doublethink is a novel addition to the emerging class of heavy-tailed combination tests. Since 2019, methods like the Cauchy combination test and harmonic mean p-value have surfaced as powerful tools for combining hypothesis tests despite inter-test dependence. Doublethink improves on these methods by allowing model uncertainty in the null hypothesis and by improving power.

We believe this paper will be of broad interest, addressing questions of importance to statistical methodology, big data analysis and scientific enquiry more generally.

Explanation of variables above

  • The model-averaged p-value, adjusted for multiple testing, is p*.
  • The model-averaged posterior odds, calculated from a Bayesian analysis, is PO.
  • The number of variables in the analysis is ν.
  • The prior odds of including each variable are μ.
  • The sample size n is represented by ξn, which decreases as √n increases.

Wednesday, 2 April 2025

Machine Learning versus Statistical Inference in Microbial Genomics

My talk given today at the 2025 Microbiology Society Conference in Liverpool:

Abstract

The advent of vast genomic datasets has transformed microbiology, presenting opportunities and challenges for data analysis. The distinct philosophies of machine learning (ML) and statistical inference gives them complementarity strengths and weaknesses in tackling big data problems in pathogen research. While statistical inference prioritizes understanding underlying relationships, ML focuses on optimizing predictive performance. In this talk I will contrast the approaches and offer a view on their relative utility for three problems: source attribution, bacterial genome-wide association studies, and predicting antimicrobial resistance phenotypes from whole genome sequences.