Premium accounts now available! Sign up and create a premium account. Read more Close

Advertisement

Image

samsampleX: Distribution-aware downsampling for benchmarking next-generation sequencing data

Preprint Created on 06 Jun 2026 bioRxiv

High-throughput next-generation sequencing (NGS) is essential for genetic variant discovery across diverse applications. As NGS evolve, there is a growing need for benchmarking tools that support realistic data simulation and downsampling. Existing downsampling tools apply uniform sampling of sequencing reads, which inadequately models realistic coverage distributions, particularly in difficult-to-sequence regions and hybrid sequencing designs. Here we present samsampleX, a Python-based tool implementing a novel distribution-aware downsampling algorithm that dynamically adjusts read retention probabilities to emulate coverage profiles derived from real sequencing data. Using ultra-high-coverage reference datasets, samsampleX accurately reproduces coverage patterns observed in typical sequencing experiments, outperforming uniform downsampling methods at preserving depth variability across genomic regions such as the HLA locus and hybrid whole-exome/genome sequencing configurations. samsampleX extends current downsampling strategies by offering enhanced flexibility for specialized NGS benchmarking scenarios, facilitating improved assessment of sequencing data analysis methods.

Demiriz, S., Taliun, D.

Advertisement

Stats

  • Recommendations n/a n/a positive of 0 vote(s)
  • Views 10
  • Comments 0

Recommended by

  • No recommendations yet.

Post a comment

You need to be signed in to post comments. You can sign in here.

Comments

There are no comments yet.

Advertisement