Phylogenetic analysis is essential to genomic epidemiology, for example in tracing the origin and evolution of SARS-CoV-2 variants during the COVID-19 pandemic. We previously introduced CMAPLE, a single-threaded implementation of the MAPLE algorithm designed for large-scale epidemiological genomic datasets. CMAPLE can reconstruct phylogenetic trees from up to one million SARS-CoV-2 genomes. Here, we present CMAPLE 2, a multi-threaded version of CMAPLE with parallel sample placement and subtree pruning and regrafting (SPR) search algorithms. CMAPLE 2 also reduces memory consumption by compressing data structures using multiple references along the tree instead of a single reference genome. It further implements two advanced models of highly site- and nucleotide-specific mutation patterns as observed in pandemic-scale genome data. Additionally, CMAPLE 2 parallelizes SPR-based Tree Assessment (SPRTA), an efficient and interpretable approach for assessing phylogenetic tree uncertainty, and supports ancestral state and mutation inference via mutation-annotated tree (MAT) reconstruction. When inferring a phylogeny from 500,000 SARS-CoV-2 genomes using 48 CPU cores, CMAPLE 2 reduces runtime from 5 days (with sequential CMAPLE) to 9 hours (a 13-fold speedup) while decreasing peak RAM usage from 11.1 GB to 7.3 GB. CMAPLE 2 can now reconstruct a tree of nearly four million SARS-CoV-2 genomes from scratch within 12 days using 41 GB of RAM, a task that the sequential CMAPLE and MAPLE cannot realistically complete. CMAPLE 2 is applicable to many pathogen genome datasets and enhances our preparedness for future pandemics.
Ly-Trong, N., Martin, S., Goldman, N., De Maio, N., Minh, B. Q.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 7
- Comments 0
