Thanks to Jeff Chen who asked Google NotebookLM to produce this radio-style audio description of the harmonic mean p-value paper: hmp-google-notebooklm.m4a
Friday, 15 November 2024
Monday, 29 July 2024
Doublethink methods paper
Today we release the first full draft of the Doublethink methods paper. This is an evolution of what was originally conceived as the supplement to the Doublethink COVID-19 paper. The wider significance of the results persuaded us to separate the two, which now focus on:
- Doublethink methods paper: Broad connections between Bayesian and classical hypothesis testing that we hope bring the best of both world by enabling scientists to simultaneously control the Bayesian false discovery rate and the classical familywise error rate, in big data settings.
- Doublethink COVID-19 paper: Identifying direct risk factors for COVID-19 hospitalization among 2000 candidate variables in 200,000 UK Biobank participants. Compares results to the literature and considers the limitations imposed by mediation and complex 'exposome-wide' association studies.
Tuesday, 23 January 2024
Festival of Genomics 2024
I will be talking at the Festival of Genomics on Wednesday 24 January about Identifying virulence and antimicrobial resistance genes in bacterial using genome-wide association studies. You can preview my talk here.
Wednesday, 3 January 2024
Introducing Doublethink: joint Bayesian-frequentist model-averaged hypothesis testing
This week Nick Arning, Helen Fryer and I released two related preprints describing a new method called Doublethink, and its application to identifying risk factors for COVID-19 hospitalization in UK Biobank:
- Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing. Fryer, Arning, Wilson (2023) arXiv doi: 10.48550/arXiv.2312.17566
- Identifying direct risk factors in UK Biobank with simultaneous Bayesian-frequentist model-averaged hypothesis testing using Doublethink. Arning, Fryer, Wilson (2024) medRxiv doi: 10.1101/2024.01.01.24300687
Doublethink: Bayesian-frequentist model-averaged hypothesis testing
- In classical tests, the statistical evidence that one variable directly affects an outcome generally depends on which other variables are assumed to directly affect it.
- In Bayesian tests, the statistical evidence that one variable directly affects an outcome depends on the prior assumptions.
Identifying direct risk factors in UK Biobank with Doublethink
- The ability to discover unexpected results.
- Stringent control for multiple testing.
- Avoidance of bias in choosing candidate risk factors or deciding to publish.
- Recapitulated several commonly reported direct risk factors, e.g. age, sex, and obesity.
- Excluded others, e.g. diabetes, cardiovascular disease, and hypertension, which might be mediated through other variables that measure general comorbidity.
- Identified some infrequently reported direct risk factors, both individually, e.g. lung infection, and as groups, e.g. constipation/urinary tract infection, which might reflect underlying kidney disease.
Wednesday, 28 September 2022
Rewley House Lecture: Role of data science in the pandemic
This year I was invited to give the Rewley House Lecture, a multidisciplinary research talk open to all, at the Department for Continuing Education, where I am Director of Studies in Data Science.
I talked about how data science has been used during the COVID-19 pandemic, spanning vaccine design, clinical trials, surveillance and policy advice, and highlighting the identification of risk factors for disease.
If you like this talk, you might be interested in the following courses available this academic year:
- Infectious Disease Modelling: Mathematical Techniques (September 2022)
- Infectious Disease Modelling: Applied Methods in R (January 2023)
- Pandemic Data Science (April 2023)
Thursday, 25 August 2022
Identifying resistance genes in tuberculosis
Newly published in PLOS Biology is our work identifying genes that confer resistance to common and last-resort antibiotics in bacteria that cause tuberculosis. Resistance to these drugs contributes to mortality and sickness on a pandemic scale every year, and disproportionately affects the poorest people in the world.
This new article is one of a series presenting results generated by more than 100 scientists across 23 countries across 5+ years as part of a collaboration called CRyPTIC.
Our role in CRyPTIC was the discovery of genes and mutations likely to cause drug resistance by applying a tool known as a genome-wide association study (GWAS), an approach we helped adapt to bacteria.
Using GWAS, we identified previously uncatalogued genes and mutations underlying resistance to every one of the 13 drugs we investigated. These include new and repurposed drugs, as well as the first- and second-line drugs more often used to treat tuberculosis.
Thanks to its generous funders, CRyPTIC dedicated scale (10,000+ genomes) and technical innovation (new high-throughput MIC assays) to help decode the DNA blueprint of antibiotic resistance. Pushing these boundaries has yielded a steep increase of up to 36% in the variation in resistance attributable to the genome for the important and previously understudied new and repurposed drugs.
Science at this scale can produce a seemingly overwhelming wealth of new information. We avoided the temptation to over-emphasize any individual result for the sake of simple narrative. Instead, we highlighted discoveries of uncatalogued genes or genetic variants that we found for every drug investigated:
• The amidase AmiA2 and GTPase Era for bedaquiline.
• The cytochrome P450 enzyme Cyp142 for clofazimine.
• The serine/threonine protein kinase PknH for delaminid.
• The antitoxin VapB20 for linezolid.
• The PPE-motif family outer membrane protein PPE42 for amikacin and kanamycin.
• The antibiotic-induced transcriptional regulator WhiB7 for ethionamide.
• The rRNA methylase TlyA for levofloxacin.
• The DNA gyrase subunit B GyrB for moxifloxacin.
• The putative rhodaneses CysA2 and CysA3 for rifabutin.
• The tRNA/rRNA methylase SpoU for ethambutol and rifampicin.
• The multidrug efflux transport system repressor Rv1219 for isoniazid.
All these hits passed stringent evidence thresholds that take into account the large amount of data crunched. For each hit, we identified possible relationships between gene functions, such as they are known, and the mechanism of action of the antibiotics.
Beyond the biological discoveries of primary interest, this new paper unveils methodological advances in bacterial GWAS. We introduced a systematic, whole-genome approach to analysing not just short DNA sequences (so called oligonucleotide or “kmer”-based approaches), but also short sequences of the proteins that the DNA codes for (an oligopeptide-based approach). We have released our software on an open-source GitHub repository.
We also discovered a relationship that may help disentangle a technical issue in bacterial GWAS where the co-occurence of traits can trick us into thinking that a gene influences one trait when it influences another instead. For antimicrobial resistance, this issue is known as artefactual cross resistance. We observed that true associations tended to produce larger associations (as measured by the 'coefficient', rather than the p-value), providing a possible way to prioritize signals in the future.
This paper was published alongside the CRyPTIC Data Compendium in PLOS Biology, in which we released our data open source to the community, with resources provided by the European Bioinformatics Institute.
Some of the results of CRyPTIC have already been rushed into service by the World Health Organization on the grounds of exceptional importance based on a candidate gene approach; this includes the DNA gyrase subunit B – moxifloxacin association spotlighted above (Walker et al 2022). However, the new results go beyond a candidate gene approach, detecting a range of previously uncatalogued genes via its agnostic, whole-genome strategy.
Unpicking the genetics of antimicrobial resistance is a priority for improving rapid susceptibility tests for individual patients, selecting drug regimens that inhibit the evolution of multidrug resistance, and developing improved treatment options. The need is particularly great in M. tuberculosis, which killed 1.4 million people in 2019, owing to the slow (6-12 week) turnaround of traditional susceptibility testing, and the alarming threat of multidrug resistant tuberculosis. The discovery of many new candidate resistance variants therefore represents an advance that we hope will contribute to progress in reducing the burden of disease.
Wednesday, 23 February 2022
Seeking Postdoc in Statistical Genetics and Infectious Disease
I am seeking a senior postdoc in Statistical Genetics and Infectious Disease to join my research group at the Big Data Institute, University of Oxford. Our research into Infectious Disease Genomics is focused on developing and applying big data methods to identify genetic risk factors for disease, both microbial virulence factors and human susceptibility genes. We are focused on a range of bacterial and viral diseases including staphylococcal sepsis and COVID-19.
The Big Data Institute, part of Oxford Population Health, provides an excellent environment for multi-disciplinary research and teaching. Situated on the modern Old Road Campus in the heart of the medical sciences neighbourhood of Headington, we benefit from outstanding facilities and opportunities to collaborate with world-leading scientists and clinicians to help expand knowledge and improve global health.
As a Senior Postdoc the post-holder will work closely with me to jointly lead the implementation, design and application of new statistical tools for genome-wide association studies, and to lead the biological interpretation of key findings. They will develop novel methodologies for analysis and data collection, take the lead in the production of scientific reports and publications and supervise junior group members.
To be considered applicants will have a PhD and post-doctoral experience in a relevant subject, with direct experience in statistical genetics, demonstrable expertise and knowledge of the statistical genetics literature or a closely related, relevant discipline and a publication record as first author, in statistical genetics.
The position is full time (part time considered) and fixed-term for 3 years.
The closing date for application is 12.00 noon GMT on 18th March.
Announcing the Oxford Statistical Genomics Summer School 2022
Join us at St Hilda's College Oxford, overlooking the River Cherwell and Christ Church Meadow, for an immersive week-long residential post-graduate summer school on Statistical Genomics on 19th-24th June 2022. This course aims to connect post-graduate and post-doctoral researchers from academia and industry with experts at Oxford's Big Data Institute, Wellcome Centre for Human Genetics, and Department of Statistics.
Our friendly tutors, internationally recognised for their scientific expertise, will offer specialist instruction and hands-on computer practicals across five broad areas of Statistical Genomics: Next-generation Sequence Data Analysis, Gene and Variant Association Testing, Genomics of Infectious Diseases, Genealogical Inference and Analysis, and Medical Genomics.
The course is aimed at trainee scientists actively engaged in statistical genomics research, who wish to expand their knowledge of concepts and techniques.
Click here for more information including how to apply.
Wednesday, 26 January 2022
Postdoctoral and Ph.D. positions in the group
If you are interested in joining the group, please contact me (details here) with a brief explanation and a copy of an up-to-date CV.
Tuesday, 25 January 2022
Announcing ProbGen22 in Oxford 28-30 March
The organizing committee is pleased to announce the 7th Probabilistic Modeling in Genomics Conference (ProbGen22) to be held at the Blavatnik School of Government and Somerville College Oxford from 28th-30th March 2022.
The meeting will be a hybrid in-person and online event. Talk sessions will feature live speakers, both in-person and online, and will take place during the afternoons (making live attendance feasible for US timezones). Talks will be recorded and made available to registrants for a period of one month. Poster sessions will be held online during the evenings.
The conference will cover probabilistic models, algorithms, and statistical methods across a broad range of applications in genetics and genomics. We invite abstract submissions on a range of topics including population genetics, natural selection, Quantitative genetics, Methods for GWAS, Applications to cancer and other diseases, Causal inference in genetic studies, Functional genomics, Assembly and variant identification, Phylogenetics, Single cell 'omics, Deep learning in genomics and Pathogen genomics.
The registration deadline is 28th February 2022.
For more details visit the conference website.
Tuesday, 7 December 2021
Two new positions: Senior Statistical Geneticist and Bioinformatician
Two new positions are available in my Infectious Disease Genomics group at the Big Data Institute, University of Oxford.
A Senior Postdoctoral Statistical Geneticist to jointly lead the implementation, design and application of new statistical tools for genome-wide association studies, lead the biological interpretation of key findings, develop methodologies and supervise junior group members. This post would suit a candidate with a PhD and relevant post-doctoral experience including direct experience in statistical genetics. Candidates without post-doctoral experience may be considered for a less senior appointment.
A Bioinformatician to provide expertise for computationally intensive analyses including genome-wide association studies and RNAseq studies of differential gene expression, as well as contributing to informatics projects as part of a wider collaboration with national biomedical cohorts. This post would suit a candidate with either a post-graduate degree related to Bioinformatics, Statistics, and Computing or equivalent experience in industry.
The application deadline for both posts is Noon GMT on Friday 7th January 2022.
New paper: Machine learning to predict the source of campylobacteriosis using whole genome data
This study, published in October in PLOS Genetics, brings together machine learning, large bacterial isolate collections and whole genome sequencing to address the general problem of how to trace the source of human infections.
Specifically, we investigated campylobacteriosis, a common infection of animal origin causing ~1.5 million cases of gastroenteritis and 10,000 hospitalizations every year in the United States alone. We show that our combined machine learning/genomics analyses:
- Improve the accuracy with which infections can be traced back to farm reservoirs.
- Identify evolutionary shifts in bacterial affinity for livestock host species.
- Detect changes in human infection capability within related strains.
These results will improve understanding not only of Campylobacter, but more generally as these technologies can readily be applied to other important bacterial pathogen species.
This paper builds on previous work published by the group, including our well cited Tracing the source of campylobacteriosis (Wilson et al 2008, PLOS Genetics 4:e1000203). The use of these methods for tracing infection has influenced public health policy and contributed to reducing disease burden.
This work demonstrates the potential for modern genomics and artificial intelligence approaches to address common and serious problems that affect our everyday lives. The awareness of the importance of infection to society has rarely been higher than in 2021, and while the current pandemic imposes an acute global problem, other infections continue to present long-term threats to health and productivity.
New paper: Antimicrobial resistance determinants are associated with Staphylococcus aureus bacteraemia and adaptation to the healthcare environment
Staphylococcus aureus is a leading cause of infectious disease deaths in all countries, with bloodstream infection leading to sepsis a major concern. This new study, published in November in Microbial Genomics, reports genes and genetic variants in Staph. aureus associated severe disease vs asymptomatic carriage, and healthcare vs community carriage.
Our genome-wide association study of 2000 bacterial genomes showed that antibiotic resistance in Staph. aureus is associated with severe disease and the hospital environment:
- A mutation conferring trimethoprim resistance (dfrB F99Y) and the presence of a gene conferring methicillin resistance (mecA) were both associated with bloodstream infection vs asymptomatic nose carriage.
- Separately, we demonstrated that a mutation conferring fluoroquinolone resistance (gyrA L84S) and variation in a gene involved in resistance to multiple antibiotics (prsA) were preferentially associated with healthcare-associated carriage vs community-acquired carriage.
New paper: Genome-wide association studies reveal the role of polymorphisms affecting factor H binding protein expression in host invasion by Neisseria meningitidis
In this paper, published in October in PLOS Pathogens, we discovered a novel genetic association between life-threatening invasive meningococcal disease (IMD) and bacterial genetic variation in factor H binding protein (fHbp) through two bacterial genome-wide association studies (GWAS), which we validated experimentally. This was a collaboration with the groups of Chris Tang and Martin Maiden, with the work in my group led by Sarah Earle.
fHbp is an important component of meningococcal vaccines that directly interacts with human complement factor H (CFH). Intriguingly, our discovery that bacterial genetic variation in fHbp associates with increased virulence mirrors an earlier discovery that human genetic variation in CFH associates with increased susceptibility to IMD (Nature Genetics 42: 772).
Our experiments showed that the fHbp risk allele increased expression. Interestingly, increased susceptibility to IMD has been previously associated with elevated CFH expression. Therefore over-expression of either fHbp by the bacterium or CFH by the host appears to increase the risk of IMD. Since complement evasion is necessary for pathogenesis, these insights offer new leads for improving treatment.
Key results from the paper:
- A GWAS for IMD in 261 meningococci from the Czech Republic highlighted a highly polygenic architecture of meningococcal virulence (see Figure), including capsule biosynthesis genes, the meningococcal disease association island and the new signal near the fba and fHbp genes.
- A replication GWAS for IMD in 1295 meningococcal genomes belonging to strain ST41/44 downloaded from pubMLST.org validated the novel signal of association near fba and fHbp.
- SHAPE reactivity analyses revealed that IMD-associated variation in the regulatory region of fHbp disrupted the ability of the cell machinery to commence gene expression.
- Flow cytometry assays of newly constructed genetically engineered strains, in different temperatures and in the presence and absence of human serum, attributed changes in gene expression to a non-synonymous candidate mutation in the fHbp gene.
In this study, our GWAS relied exclusively on publicly available genome sequences and metadata, highlighting the untapped potential of large-scale open source databases like pubMLST.org, and the value of big data for improving our understanding of disease.
Tuesday, 13 April 2021
New positions: Data Scientist in Public Health Epidemiology and Postdoc in Statistical Methods
I am looking to fill two positions at the Big Data Institute, Nuffield Department of Population Health, University of Oxford: a Data Scientist in Public Health Epidemiology and a Postdoctoral Researcher in Statistical Methods.
The Big Data Institute (BDI) is an interdisciplinary research centre that develops, evaluates and deploys efficient methods for acquiring and analysing biomedical data at scale and for exploiting the opportunities arising from such studies. The Nuffield Department of Population Health (NDPH), a key partner in the BDI, contains world-renowned population health research groups and is an excellent environment for multi-disciplinary teaching and research.
The role of the Data Scientist in Public Health Epidemiology is to help pilot a project developing systems for continuous record linkage between a large Public Health England (PHE) data source and other population health records, with the aim of facilitating research into infectious diseases.
The post holder will manage and develop record linkage algorithms comparing records with relational databases containing health records via appropriate anonymization protocols, and manage and develop systems for identifying incoming records of interest, for near-real time updating of SQL databases, and for issuing email and SMS alerts in response to these events. The responsibilities will also include contributing to large-scale statistical studies using public health records to investigate disease epidemiology, and analysing and interpreting results, reviewing and refining working hypotheses, writing reports and presenting findings to colleagues.
To be considered, applicants will hold a degree in Computer Science, Data Science, Statistics, or another relevant subject with a strong quantitative component, or have equivalent experience. They will also need an understanding of relational database construction and SQL queries, experience coding in at least one common programming language (e.g. C#, Java, Python) and good interpersonal skills with the ability to work closely with others as part of a team, while taking personal responsibility for assigned tasks.
The role of the Postdoctoral Researcher in Statistical Methods is to develop statistical methods based on the harmonic mean p-value (HMP) approach. The HMP bridges classical and Bayesian approaches to model-averaged hypothesis testing, with applications to very large-scale data analysis problems in biomedical science.
The post holder will join a team with expertise in statistical inference, population genetics, genomics, evolution, epidemiology and infectious disease. The responsibilities will include developing statistical methods based on the HMP, undertaking research under the direction of the principal investigator, helping with supervision within the project as required, driving forward manuscripts for publication in collaboration with group members and disseminating results through other means such as academic conferences.
To be considered, applicants will hold, or be close to completion of, a PhD/DPhil involving statistical methods development and a track record of publication-quality methods development in statistical theory or methods development. The ability to work independently in pursuing the goals of an agreed research plan and excellent interpersonal skills and the ability to work closely with others as a team are also essential.
The closing date for both positions is noon on the 5th May 2021. Only applications received through the online system will be considered:
- Click here to apply for the Data Scientist in Public Health Epidemiology position
- Click here to apply for the Postdoctoral Researcher in Statistical Methods position
Presentation: Genome-wide association studies of COVID-19
An updated version of this talk given at the Nuffield Department for Population Health's annual symposium 2021:
Monday, 7 September 2020
Postdoc position available in Statistical Genomics
I am seeking someone with a track record in methods development for Statistical Genomics and an interest in Infectious Disease to join the group. The aim of the post is to conduct innovative research within the group's range of interests and to make use of the opportunities afforded by our outstanding collaborators. I would welcome candidates who wish to use the opportunity as a stepping stone to independent funding.
The postdoc will join a team with expertise in microbiology, genomics, evolution, population genetics and statistical inference. Responsibilities will include planning a research project and milestones with help and guidance from the group, preparing manuscripts for publication, keeping records of results and methods and tracking milestones, and disseminating results, including through academic conferences.
We will consider applicants who hold, or are close to completion of, a PhD/DPhil involving statistical methods development, and who have experience of large-scale statistical data analysis, evidence of originating and executing independent academic research ideas, excellent interpersonal skills and the ability to work closely with others in a team.
The position is advertised to 31 December 2021. The application deadline is noon on Thursday 1st October 2020. Visit the University recruitment page to apply.
Friday, 21 August 2020
Presentation: Genome-wide Association Studies of COVID-19
An online recording of the talk about Genome-wide Association Studies of COVID-19 at the UK Biobank 2020 meeting on 23 June 2020. The full conference is also online.
The group's research response to COVID-19
This is an update on the group's research response to the COVID-19 pandemic. As an infectious disease group we have been keen to contribute to the international research effort where we could be useful, while recognising the need to continue our research on other important infections where possible.
- Bugbank. Thanks to a pre-existing collaboration between our group, Public Health England and UK Biobank, we were in a position to help rapidly facilitate COVID-19 research via SARS-CoV-2 PCR-based swab test results. Beginning mid-March, we worked to provide regular (usually weekly) updates of tests results, which were made available to all UK Biobank researchers beginning April 17th. This is one of several resources on COVID-19 linked to UK Biobank. Beginning in May we provided feeds to other cohorts: INTERVAL, COMPARE, Genes & Health and the NIHR BioResource. We provide updates on this work through the project website www.bugbank.uk. We have published a paper describing the dynamic data linkage in Microbial Genomics (press release). Key collaborators in this project are Jacob Armstrong (Big Data Institute) Naomi Allen (UK Biobank) and David Wyllie and Anne Marie O'Connell (Public Health England).
- COVID-19 Host Genetics Initiative. Along with groups at McGill and the Broad Institute, I have contributed analyses of UK Biobank to investigate genetic risk factors for COVID-19. The Host Genetics Initiative is drawing on more than 200 cohorts from around the world to conduct meta-analysis to tackle questions such as why some people suffer severe COVID-19 while others get mild or asymptomatic infection (media coverage). The meta-analysis results are publicly released periodically. This led to the independent replication of a genetic risk factor on chromosome 3 discovered by a Spanish-Italian cohort and published in the New England Journal of Medicine.
- Epidemiological risk factors for COVID-19. Graduate student Nicolas Arning and I are developing an approach to quantify the effects of lifestyle and medical risk factors for COVID-19 in the UK Biobank that accounts for inherent uncertainty in which risk factors to consider. The new method employs the harmonic mean p-value, a model-averaging approach for big data that we published previously. We are in the process of evaluating the performance of the approach, comparing it to machine learning, and interpreting the results.
- Antibody testing for the UK Government. Postdoc Justine Rudkin has been working in the lab with Derrick Crook, Sir John Bell and others to measure the efficacy of antibody tests for the UK Government. They have tested many hundreds of kits to establish the sensitivity and specificity of the tests to help evaluate the utility of a national testing programme. This work was crucial in demonstrating the limitations of early blood-spot based tests, and the credibility of subsequent generations of antibody tests. The work has been published in Wellcome Open Research.
- Office of National Statistics Antibody Survey. Justine has also been working in the lab with Sarah Walker to set up robots for her large antibody testing cohort study. Sarah is leading an infection survey with the Office of National Statistics to investigate exposure to SARS-CoV-2 in England and Wales.
Teaching: Online lectures and practical on Phylogenetics in Practice
On March 16th, we were in the interesting position of running an infectious disease course at the Big Data Institute on the day the national lockdown was announced in response to the COVID-19 pandemic. As a result, we were among the first in the university to do remote teaching, something Katrina Lythgoe and the rest of us had prepared for in anticipation of the lockdown a week earlier that never happened.
These are the two online lectures in the Health Data Sciences CDT that I gave called Phylogenetics in Practice.
The online practical, which applies phylogenetics approaches to understand the Zika virus epidemic, is implemented as a Docker container, and available here.
Presentation on identifying COVID-19 inpatients from Public Health England data
This is a presentation I gave at the COVID-19 Host Genetics Initiative meeting on 2nd July 2020 about using Public Health England's Second Generation Surveillance System to identify COVID-19 inpatients among SARS-CoV-2 positive individuals in England.
For further information, please see this bugbank blog post comparing inpatients identified using SGSS and Hospital Episode Statistics.
Monday, 20 July 2020
Royal Society Summer Science Exhibition 2020
Friday, 20 March 2020
New paper: GenomegaMap for dN/dS in over 10,000 genomes
The dN/dS ratio is a popular statistic in evolutionary genetics that quantifies the relative rates of protein-altering and non-protein-altering mutations. The rate is adjusted so that under neutral evolution - i.e. when the survival and reproductive advantage of all variants is the same - it equals 1. Typically, dN/dS is observed to be less than 1 meaning that new mutations tend to be disfavoured, implying they are harmful to survival or reproduction. Occasionally, dN/dS is observed to be greater than 1 meaning that new mutations are favoured, implying they provide some survival or reproductive advantage. The aim of estimating dN/dS is usually to identify mutations that provide an advantage.
Theoreticians are often critical of dN/dS because it is more of a descriptive statistic than a process-driven model of evolution. This overlooks the problem that currently available models make simplifying assumptions such as minimal interference between adjacent mutations within genes. These assumptions are not obviously appropriate in many species, including infectious micro-organisms, that exchange genetic material infrequently.
There are many methods for measuring dN/dS. This new paper overcomes two common problems:
- It is fast no matter how many genomes are analysed together.
- It is robust whether there is frequent genetic exchange (which causes phylogenetic methods to report spurious signals of advantageous mutation) or infrequent genetic exchange.
Software that implements genomegaMap is available on Docker Hub and the source code and documentation are available on Git Hub.
With the steady rise of more and more genome sequences, the analysis of data becomes an increasing challenge even with modern computers, so it is hoped that this new method provides a useful way to exploit the opportunities in such large datasets to gain new insights into evolution.
Monday, 16 March 2020
Postdoc Available in Statistical Genetics
We are seeking an exceptional researcher with a track record in methods development for Statistical Genomics and an interest in Infectious Disease to join our group at the Big Data Institute. Our research focuses on Bacterial Genomics, Genome-Wide Association Studies and Population Genetics. The aim of the post is to conduct innovative research within the group's range of interests and to make use of the opportunities afforded by our outstanding collaborators. We welcome candidates who wish to use the opportunity as a stepping stone to independent funding.
The Oxford University Big Data Institute (BDI) is an interdisciplinary research centre aiming to develop, evaluate and deploy efficient methods for acquiring and analysing biomedical data at scale and for exploiting the opportunities arising from such studies. The Nuffield Department of Population Health, a partner in the BDI, contains world-renowned population health research groups and is an excellent environment for multi-disciplinary teaching and research.
The Postdoctoral Researcher in Statistical Genomics will join our team which has expertise in microbiology, genomics, evolution, population genetics and statistical inference. Responsibilities include planning a research project and milestones with help and guidance from the group, preparing manuscripts for publication, keeping records of results and methods and tracking milestones, and disseminating results.
To be considered, you need to hold, or be close to completion of, a PhD/DPhil involving statistical methods development. You also need experience of large-scale statistical data analysis, evidence of originating and executing your own academic research ideas and excellent interpersonal skills and the ability to work closely with others in a team.
For informal enquiries, please contact me.
Further details, including how to apply are here: https://my.corehr.com/pls/uoxrecruit/erq_jobspec_details_form.jobspec?p_id=145506
Tuesday, 22 October 2019
Correction published and R package on GitHub
I have posted the source code for the harmonicmeanp R package on GitHub. This means there is now a development version with the latest updates, and instructions for installing it: github.com/danny-wilson/harmonicmeanp/tree/dev.
Monday, 19 August 2019
Updated correction: The harmonic mean p-value for combining dependent tests
I would like to update the correction I issued on July 3, 2019 to cover a second error I discovered that affects the main function of the R package, p.hmp. There are two errors in the original paper:
- The paper (Wilson 2019 PNAS 116: 1195-1200) erroneously stated that the test \(\overset{\circ}{p}_\mathcal{R} \leq \alpha_{|\mathcal{R}|}\,w_\mathcal{R}\) controls the strong-sense family-wise error rate asymptotically when it should read \(\overset{\circ}{p}_\mathcal{R} \leq \alpha_{L}\,w_\mathcal{R}\).
- The paper incorrectly stated that one can produce adjusted p-values that are asymptotically exact, as intended in the original Figure 1, by transforming the harmonic mean p-value with Equation 4 before adjusting by a factor \(1/w_{\mathcal{R}}\). In fact the harmonic mean p-value must be multiplied by \(1/w_{\mathcal{R}}\) before transforming with Equation 4.
In the above,
- L is the total number of individual p-values.
- \(\mathcal{R}\) represents any subset of those p-values.
- \(\overset{\circ}{p}_\mathcal{R} = \left(\sum_{i\in\mathcal{R}} w_i\right)/\left(\sum_{i\in\mathcal{R}} w_i/p_i\right)\) is the HMP for subset \(\mathcal{R}\).
- \(w_i\) is the weight for the ith p-value. The weights must sum to one: \(\sum_{i=1}^L w_i=1\). For equal weights, \(w_i=1/L\).
- \(w_\mathcal{R}=\sum_{i\in\mathcal{R}}w_i\) is the sum of weights for subset \(\mathcal{R}\).
- \(|\mathcal{R}|\) gives the number of p-values in subset \(\mathcal{R}\).
- \(\alpha_{|\mathcal{R}|}\) and \(\alpha_{L}\) are significance thresholds provided by the Landau distribution (Table 1).
The tutorial, available as a vignette in the R package and online, is affected quantitatively by both errors, and has been extensively updated for version 3.0.
The second error affects only one line of the corrected paper (issued July 2019). I have updated it to address the second error and two typos in Figure legends 1 and 2: http://www.danielwilson.me.uk/files/wilson_2019_annotated_corrections.v2.pdf. You will need Adobe Reader to properly view the annotations and the embedded corrections to Figures 1 and 2.
I would like to deeply apologise to users for the inconvenience the two errors have caused.
More information follows under the headings:
- Why does this matter?
- How does it affect the paper?
- Where did the errors come from?
- How do I update the R package?
- What if I have already reported results?
Why does this matter?
The ssFWER is not controlled at the expected rate if:
- The more lenient threshold \(\alpha_{|\mathcal{R}|}\) is used rather than the corrected threshold \(\alpha_L\), both derived via Table 1 of the paper from the desired ssFWER \(\alpha\).
- Raw p-values are transformed with Equation 4 before adjusting by a factor \(w_{\mathcal{R}}^{-1}\), rather than adjusting the raw p-values by a factor \(w_{\mathcal{R}}^{-1}\) before transforming with Equation 4.
Regarding error 1, individual p-values need to be assessed against the threshold \(\alpha_{L}/L\) when the HMP is used, not the more lenient \(\alpha_{1}/L\) nor the still more lenient \(\alpha/L\) (assuming equal weights). This shows that there is a cost to using the HMP compared to Bonferroni correction in the evaluation of individual p-values (and indeed small groups of p-values). For one billion tests \(\left(L=10^9\right)\) and a desired ssFWER of \(\alpha=0.01\), the fold difference in thresholds from Table 1 would be \(\alpha/\alpha_L=0.01/0.008=1.25\).
However, it remains the case that HMP is more powerful than Bonferroni for assessing the significance of large groups of hypotheses. This is the motivation for using the HMP, and combined tests in general, because the power to find significant groups of hypotheses will be higher than the power to detect significant individual hypotheses when the total number of tests (L) is large and the aim is to control the ssFWER.
How does it affect the paper?
Where did the errors come from?
The second error, which was also caused by carelessness on my part, occurred in the main text in the statement "(Equivalently, one can compare the exact p-value from Eq. 4 with \(\alpha\,w_{\mathcal{R}}\).)" I did not identify it sooner because the corrected version of the paper no longer uses Equation 4 to transform p-values in Figure 1.
How do I update the R package?
After installation, check again the version number:
What if I have already reported results?
More information
Saturday, 6 July 2019
Correction: The harmonic mean p-value for combining dependent tests
I would like to issue the following correction to users of the harmonic mean p-value (HMP), with apologies: The paper (Wilson 2019 PNAS 116: 1195-1200) erroneously states that the following asymptotically exact test controls the strong-sense family-wise error rate for any subset of p-values \(\mathcal{R}\):
$$\overset{\circ}{p}_\mathcal{R} \leq \alpha_{|\mathcal{R}|}\,w_\mathcal{R}$$
when it should read
$$\overset{\circ}{p}_\mathcal{R} \leq \alpha_{L}\,w_\mathcal{R}$$
- L is the total number of individual p-values.
- \(\mathcal{R}\) represents any subset of those p-values.
- \(\overset{\circ}{p}_\mathcal{R} = \left(\sum_{i\in\mathcal{R}} w_i\right)/\left(\sum_{i\in\mathcal{R}} w_i/p_i\right)\) is the HMP for subset \(\mathcal{R}\).
- \(w_i\) is the weight for the ith p-value. The weights must sum to one: \(\sum_{i=1}^L w_i=1\). For equal weights, \(w_i=1/L\).
- \(w_\mathcal{R}=\sum_{i\in\mathcal{R}}w_i\) is the sum of weights for subset \(\mathcal{R}\).
- \(|\mathcal{R}|\) gives the number of p-values in subset \(\mathcal{R}\).
- \(\alpha_{|\mathcal{R}|}\) and \(\alpha_{L}\) are significance thresholds provided by the Landau distribution (Table 1).
An updated tutorial is available as a vignette in the R package and online here: http://www.danielwilson.me.uk/harmonicmeanp/hmpTutorial.html
Why does this matter?
How does it affect the paper?
Where did the error come from?
More information
Monday, 25 February 2019
New paper: PVL toxin associated with pyomyositis
Catrin Moore and colleagues at the Angkor Children's Hospital in Siem Reap, Cambodia, spent more than a decade collecting S. aureus bacteria from pyomyositis infections in young children, and built a comparable control group of S. aureus carried asymptomatically in children of similar age and location.
When Bernadette Young in our group compared the genomes of cases and controls using statistical tools we developed, she found some strong signals:
- Most, but not all, pyomyositis was caused by the CC-121 strain, common in Cambodia.
- The association with CC-121 was driven by the PVL toxin which it carries.
Monday, 7 January 2019
New paper in PNAS: harmonic mean p-value
The method has two stages:
- Compute a test statistic: the harmonic mean of the p-values (HMP) of the tests to be combined. Remarkably, this HMP is itself a valid p-value for small values (e.g. below 0.05).
- Calculate an asymptotically exact p-value from the test statistic using generalized central limit theorem. The distribution is a type of Stable distribution first described by Lev Landau.
- Combining p-values allows information to be aggregated over multiple tests and requires less stringent significance thresholds.
- The HMP procedure is robust to positive dependence between the p-values, making it more widely applicable than Fisher's method which assumes independence.
- The HMP procedure is more powerful than the Bonferroni and Simes procedures.
- The HMP procedure is more powerful than the Benjamini-Hochberg (BH) procedure, even though BH only controls the weaker false discovery rate (FDR) and weak-sense family-wise error rate (wsFWER) in the sense that whenever the BH procedure detects one or more significant p-values, the HMP procedure will detect one or more significant p-values or groups of significant p-values.
- R. A. Fisher (1934) Statistical Methods for Research Workers (Oliver and Boyd, Edinburgh), 5th Ed.
- L. D. Landau (1944) On the energy loss of fast particles by ionization. Journal of Physics U.S.S.R. 8: 201-205.
- I. J. Good (1958) Significance tests in parallel and in series. Journal of the American Statistical Association 53: 799-813. (Jstor)
- R. J. Simes (1986) An improved Bonferroni procedure for multiple tests of significance. Biometrika 73: 751-754.
- Y. Benjamini and Y. Hochberg (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 57: 289-300.
- D. J. Wilson (2019) The harmonic mean p-value for combining dependent tests. Proceedings of the National Academy of Sciences U.S.A. published ahead of print January 4, 2019. (PNAS)
Wednesday, 18 July 2018
Bacterial Doubling Times in the Wild
We have done it by comparing two known quantities and taking the ratio: the rate at which DNA mutates in bacteria per year, and the rate it mutates per replication. This tells us in theory how many replications there are per year.
The mutation rate per replication has long been studied in the laboratory, and is around once per billion letters. Meanwhile, the recent avalanche of genomic data has allowed microbiologists to quantify the rate at which bacteria evolve over short time scales such as a year, including during outbreaks and even within individual infected patients. Most bugs mutate about once per million letters per year, with ten-fold variation above and below this not uncommon among different species.
For five species both these quantities exist. The fastest bug we looked at causes cholera and we estimate it doubles once every hour on average (give or take 30 minutes). The slowest was Salmonella, which we estimate doubles once a day on average (give or take 8 hours). In between were Staph. aureus and Pseudomonas at about two hours each, and E. coli at 15 hours. These are average over the very diverse and often hostile conditions that a bacterial cell may find itself in during the course of its natural lifecycle. To find out more about the work, please check out the paper.
Friday, 29 June 2018
PhD Studentship: Genomic prediction of antimicrobial resistance spread
An opportunity has arisen for a D.Phil. (Ph.D.) place on the BBSRC-funded Oxford Interdisciplinary Bioscience Doctoral Training Partnership in the area of Artificial Intelligence, specifically Predicting the spread of antimicrobial resistance from genomics using machine learning.
If successful in a competitive application process, the candidate will join a cohort of students enrolled in the DTP’s one-year interdisciplinary training programme, before commencing the research project and joining my research group at the Big Data Institute.
This project addresses the BBSRC priority area “Combatting antimicrobial resistance” by using ML to predict the spread of antimicrobial resistance in human, animal and environmental bacteria exemplified by Escherichia coli. Understanding how quickly antimicrobial resistance (AMR) will spread helps plan effective prevention, improved biosecurity, and strategic investment into new measures. We will develop ML tools for large genomic datasets to predict the future spread of AMR in humans, animals and the environment. The project will create new methods based on award-winning probabilistic ML tools pioneered in my group (BASTA, SCOTTI) by training models using genomic and epidemiological data informative about past spread of AMR. We will apply the tools collaboratively to genomic studies of E. coli in Kenya, the UK and across Europe from humans, animals and the environment, Enterobacteriaceae in North-West England, and Campylobacter in Wales. Genomics has proven effective for asking “what went wrong” in the context of outbreak investigation and AMR spread; here we will address the greater challenge of repurposing such information using ML for forward prediction of future spread of AMR. Scrutiny will be intense because future predictions can and will be tested, raising the bar for the biological realism required while producing computationally efficient tools.
Attributes of suitable applicants: Understanding of genomics. Interest in infectious disease. Some numeracy, e.g. mathematics A-level, desirable. Experience of coding would help.
Funding notes: BBSRC eligibility criteria for studentship funding applies (https://www.ukri.org/files/legacy/news/training-grants-january-2018-pdf/). Successful students will receive a stipend of no less than the standard RCUK stipend rate, currently set at £14,777 per year.
How to apply: send me a CV and brief covering letter/email (no more than 1 page) explaining why you are interested and suitable by the Wednesday 11 July initial deadline. I will invite the best applicant/s to submit with me a formal application in time for the Friday 13 July second-stage deadline.