Showing posts with label Statistics. Show all posts
Showing posts with label Statistics. Show all posts

Tuesday, 14 October 2025

New paper: Machine learning and statistical inference in microbial population genomics

We have published a new review article in Genome Biology contrasting machine learning and statistics in microbial genomics. This is joint work with Sam Sheppard, Nick Arning and David Eyre.

The availability of large genome datasets has changed the microbiology research landscape. Analyzing such data requires computationally demanding analyses, and new approaches have come from different data analysis philosophies. Machine learning and statistical inference have overlapping knowledge discovery aims and approaches.

In this review, we highlight how machine learning focuses on optimizing prediction, whereas statistical inference focuses on understanding the processes relating variables. We outline the different aims, assumptions, and resulting methodologies, with examples from microbial genomics. These approaches are essentially complementary, and we argue that exploiting both machine learning and statistics - selecting the right tool for the job - has the greatest potential for advancing pathogen research in the big data era.

Monday, 12 May 2025

Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing

Helen Fryer, Nick Arning and I have posted our new preprint to arxiv. This is the first version of the paper that we have submitted for peer review. Doublethink addresses some long-standing questions in assessing evidence for the purposes of hypothesis testing.

Hypothesis testing is central to scientific enquiry, but conclusions can be heavily influenced by model specification, particularly which variables are included. Bayesian model-averaged hypothesis testing offers a solution, but the sensitivity of posterior odds and Bayesian false discovery rate (FDR) guarantees to prior assumptions limit the appeal. In hypothesis testing, we lack unifying results – like Bernstein-von-Mises’ Theorem – that predict convergence of Bayesian and frequentist results, even in large samples.

Our paper introduces new theory and a practical method, Doublethink, motivated by these issues:
  • A key, and perhaps surprising, result is that Bayesian model-averaged hypothesis testing natively controls not only the Bayesian FDR, but also the frequentist strong-sense familywise error rate (FWER). This duality – which is general – seems to be unknown, or forgotten.
  • For practical application, we derive large-sample asymptotic theory to quantify the rate at which the FWER is controlled. Specifically, we use a BIC-like model to characterize the tail probability of the model-averaged posterior odds via a chi-squared distribution.
  • This result enables simultaneous control of Bayesian FDR and frequentist FWER at quantifiable levels and – equivalently – simultaneous reporting of posterior odds and asymptotic p-values.
  • We explore the method’s benefits – like post-hoc variable selection – and limitations – like inflation – through a Mendelian Randomization study and detailed simulations, comparing Doublethink to Lasso, stepwise regression, the Benjamini-Hochberg procedure and e-values.
Besides the practical benefits of model-averaged hypothesis testing with frequentist guarantees, and the implications that entails for objective Bayesian hypothesis testing, these results offer fundamental insights likely to trigger renewed discussion of FDR, FWER and the reconcilability of p-values with evidence.

Doublethink is a novel addition to the emerging class of heavy-tailed combination tests. Since 2019, methods like the Cauchy combination test and harmonic mean p-value have surfaced as powerful tools for combining hypothesis tests despite inter-test dependence. Doublethink improves on these methods by allowing model uncertainty in the null hypothesis and by improving power.

We believe this paper will be of broad interest, addressing questions of importance to statistical methodology, big data analysis and scientific enquiry more generally.

Explanation of variables above

  • The model-averaged p-value, adjusted for multiple testing, is p*.
  • The model-averaged posterior odds, calculated from a Bayesian analysis, is PO.
  • The number of variables in the analysis is ν.
  • The prior odds of including each variable are μ.
  • The sample size n is represented by ξn, which decreases as √n increases.

Wednesday, 2 April 2025

Machine Learning versus Statistical Inference in Microbial Genomics

My talk given today at the 2025 Microbiology Society Conference in Liverpool:

Abstract

The advent of vast genomic datasets has transformed microbiology, presenting opportunities and challenges for data analysis. The distinct philosophies of machine learning (ML) and statistical inference gives them complementarity strengths and weaknesses in tackling big data problems in pathogen research. While statistical inference prioritizes understanding underlying relationships, ML focuses on optimizing predictive performance. In this talk I will contrast the approaches and offer a view on their relative utility for three problems: source attribution, bacterial genome-wide association studies, and predicting antimicrobial resistance phenotypes from whole genome sequences.

Monday, 29 July 2024

Doublethink methods paper

Today we release the first full draft of the Doublethink methods paper. This is an evolution of what was originally conceived as the supplement to the Doublethink COVID-19 paper. The wider significance of the results persuaded us to separate the two, which now focus on:

  • Doublethink methods paper: Broad connections between Bayesian and classical hypothesis testing that we hope bring the best of both world by enabling scientists to simultaneously control the Bayesian false discovery rate and the classical familywise error rate, in big data settings.
  • Doublethink COVID-19 paper: Identifying direct risk factors for COVID-19 hospitalization among 2000 candidate variables in 200,000 UK Biobank participants. Compares results to the literature and considers the limitations imposed by mediation and complex 'exposome-wide' association studies.
After soliciting colleagues for comments and another round of editing, we will move toward submission in later this year.

Wednesday, 23 February 2022

Seeking Postdoc in Statistical Genetics and Infectious Disease

I am seeking a senior postdoc in Statistical Genetics and Infectious Disease to join my research group at the Big Data Institute, University of Oxford. Our research into Infectious Disease Genomics is focused on developing and applying big data methods to identify genetic risk factors for disease, both microbial virulence factors and human susceptibility genes. We are focused on a range of bacterial and viral diseases including staphylococcal sepsis and COVID-19.

The Big Data Institute, part of Oxford Population Health, provides an excellent environment for multi-disciplinary research and teaching. Situated on the modern Old Road Campus in the heart of the medical sciences neighbourhood of Headington, we benefit from outstanding facilities and opportunities to collaborate with world-leading scientists and clinicians to help expand knowledge and improve global health.

As a Senior Postdoc the post-holder will work closely with me to jointly lead the implementation, design and application of new statistical tools for genome-wide association studies, and to lead the biological interpretation of key findings. They will develop novel methodologies for analysis and data collection, take the lead in the production of scientific reports and publications and supervise junior group members.

To be considered applicants will have a PhD and post-doctoral experience in a relevant subject, with direct experience in statistical genetics, demonstrable expertise and knowledge of the statistical genetics literature or a closely related, relevant discipline and a publication record as first author, in statistical genetics.

The position is full time (part time considered) and fixed-term for 3 years.

The closing date for application is 12.00 noon GMT on 18th March.

Click here for more information including how to apply.

Tuesday, 25 January 2022

Announcing ProbGen22 in Oxford 28-30 March

The organizing committee is pleased to announce the 7th Probabilistic Modeling in Genomics Conference (ProbGen22) to be held at the Blavatnik School of Government and Somerville College Oxford from 28th-30th March 2022.

The meeting will be a hybrid in-person and online event. Talk sessions will feature live speakers, both in-person and online, and will take place during the afternoons (making live attendance feasible for US timezones). Talks will be recorded and made available to registrants for a period of one month. Poster sessions will be held online during the evenings.

The conference will cover probabilistic models, algorithms, and statistical methods across a broad range of applications in genetics and genomics. We invite abstract submissions on a range of topics including population genetics, natural selection, Quantitative genetics, Methods for GWAS, Applications to cancer and other diseases, Causal inference in genetic studies, Functional genomics, Assembly and variant identification, Phylogenetics, Single cell 'omics, Deep learning in genomics and Pathogen genomics.

The registration deadline is 28th February 2022.

For more details visit the conference website. 

Tuesday, 7 December 2021

Two new positions: Senior Statistical Geneticist and Bioinformatician

Two new positions are available in my Infectious Disease Genomics group at the Big Data Institute, University of Oxford.

A Senior Postdoctoral Statistical Geneticist to jointly lead the implementation, design and application of new statistical tools for genome-wide association studies, lead the biological interpretation of key findings, develop methodologies and supervise junior group members. This post would suit a candidate with a PhD and relevant post-doctoral experience including direct experience in statistical genetics. Candidates without post-doctoral experience may be considered for a less senior appointment.

A Bioinformatician to provide expertise for computationally intensive analyses including genome-wide association studies and RNAseq studies of differential gene expression, as well as contributing to informatics projects as part of a wider collaboration with national biomedical cohorts. This post would suit a candidate with either a post-graduate degree related to Bioinformatics, Statistics, and Computing or equivalent experience in industry.

The application deadline for both posts is Noon GMT on Friday 7th January 2022.

Monday, 7 September 2020

Postdoc position available in Statistical Genomics

I am seeking someone with a track record in methods development for Statistical Genomics and an interest in Infectious Disease to join the group. The aim of the post is to conduct innovative research within the group's range of interests and to make use of the opportunities afforded by our outstanding collaborators. I would welcome candidates who wish to use the opportunity as a stepping stone to independent funding.

The postdoc will join a team with expertise in microbiology, genomics, evolution, population genetics and statistical inference. Responsibilities will include planning a research project and milestones with help and guidance from the group, preparing manuscripts for publication, keeping records of results and methods and tracking milestones, and disseminating results, including through academic conferences.

We will consider applicants who hold, or are close to completion of, a PhD/DPhil involving statistical methods development, and who have experience of large-scale statistical data analysis, evidence of originating and executing independent academic research ideas, excellent interpersonal skills and the ability to work closely with others in a team.

The position is advertised to 31 December 2021. The application deadline is noon on Thursday 1st October 2020. Visit the University recruitment page to apply.

Friday, 21 August 2020

The group's research response to COVID-19

This is an update on the group's research response to the COVID-19 pandemic. As an infectious disease group we have been keen to contribute to the international research effort where we could be useful, while recognising the need to continue our research on other important infections where possible.

  • Bugbank. Thanks to a pre-existing collaboration between our group, Public Health England and UK Biobank, we were in a position to help rapidly facilitate COVID-19 research via SARS-CoV-2 PCR-based swab test results. Beginning mid-March, we worked to provide regular (usually weekly) updates of tests results, which were made available to all UK Biobank researchers beginning April 17th. This is one of several resources on COVID-19 linked to UK Biobank. Beginning in May we provided feeds to other cohorts: INTERVAL, COMPARE, Genes & Health and the NIHR BioResource. We provide updates on this work through the project website www.bugbank.uk. We have published a paper describing the dynamic data linkage in Microbial Genomics (press release). Key collaborators in this project are Jacob Armstrong (Big Data Institute) Naomi Allen (UK Biobank) and David Wyllie and Anne Marie O'Connell (Public Health England).


  • Epidemiological risk factors for COVID-19. Graduate student Nicolas Arning and I are developing an approach to quantify the effects of lifestyle and medical risk factors for COVID-19 in the UK Biobank that accounts for inherent uncertainty in which risk factors to consider. The new method employs the harmonic mean p-value, a model-averaging approach for big data that we published previously. We are in the process of evaluating the performance of the approach, comparing it to machine learning, and interpreting the results.

  • Antibody testing for the UK Government. Postdoc Justine Rudkin has been working in the lab with Derrick Crook, Sir John Bell and others to measure the efficacy of antibody tests for the UK Government. They have tested many hundreds of kits to establish the sensitivity and specificity of the tests to help evaluate the utility of a national testing programme. This work was crucial in demonstrating the limitations of early blood-spot based tests, and the credibility of subsequent generations of antibody tests. The work has been published in Wellcome Open Research.


Work on other infections that has continued during the lockdown. Postdoc Sarah Earle continues research into pathogen genetic risk factors for diseases including tuberculosis and meningococcal meningitis, while Steven Lin has continued to pursue work on hepatitis C virus genetics and epidemiology. Many of our close collaborators are infection doctors and they have of course been recalled to clinical duties. Laboratory work in the group has been severely disrupted, particularly several of Justine's Staphylococcus aureus projects. We are keen to pick up on those projects where we left off when the chance arrives.