Premium accounts now available! Sign up and create a premium account. Read more Close

Advertisement

Image

Tiny Subsamples and Upsampling Tame Big Data Evolutionary Analysis in Phylogenomics

Preprint Created on 23 Jun 2026 bioRxiv

Long runtime, high memory demands, and reliance on high-performance computing increasingly limit the evolutionary analysis of long phylogenomic datasets. We review a scalable framework based on phylogenomic subsampling and upsampling (PSU), in which many small subsamples of sites from a long concatenated sequence alignment are extended by upsampling prior to inference, and the resulting analyses are then aggregated to obtain stable evolutionary estimates. PSU exploits a useful distinction between the computational burden and the inferential power of statistical methods in molecular phylogenetics: computational cost is strongly influenced by the number of distinct site patterns in the concatenated alignment, whereas statistical power depends primarily on the amount of evolutionary information represented by sites and substitutions. By reducing the former while restoring the latter through upsampling, PSU can approximate many full-data analyses at substantially lower computational cost. Evidence from simulated and empirical datasets shows that PSU can accurately estimate bootstrap support values, select optimal substitution models, test evolutionary hypotheses, and infer branch lengths, divergence times, and associated uncertainty measures, while often reducing runtime and memory requirements by orders of magnitude. The same subsampling-upsampling-aggregation principle underlies all of these applications. PSU also provides distributions of inferred clade support across independent subsamples, enabling detection of concordant and conflicting phylogenetic signals that may remain hidden in conventional concatenated phylogenomic analyses. Adaptive procedures for selecting the subsample size, the number of subsamples, and the number of upsampling replicates make the framework practical across diverse datasets. We suggest that PSU is a general strategy for scalable phylogenomic inference across a broad range of statistical methods. By enabling rigorous analyses of genome-scale alignments on standard computing hardware, PSU expands access to computationally intensive evolutionary methods while reducing the environmental and infrastructural costs of big-data phylogenomics.

Kumar, S., Tamura, K., Sharma, S.

Advertisement

Stats

  • Recommendations n/a n/a positive of 0 vote(s)
  • Views 2
  • Comments 0

Recommended by

  • No recommendations yet.

Post a comment

You need to be signed in to post comments. You can sign in here.

Comments

There are no comments yet.

Advertisement