DNA foundation models are increasingly proposed as general-purpose representations for genomic prediction and design, yet their evaluation remains largely centered on conventional regulatory tasks. This leaves a critical question unresolved: do DNA foundation models generalize to sequence biology beyond conventional gene regulation? To answer this question, we introduce RloopBench, a systematic benchmark for R-loop-forming sequence prediction as a biophysically distinct, genome-stability-associated task. We compare rule-based methods, task-specific models, classical sequence encodings, and foundation model representations across in-distribution, cross-platform, consensus-level, and cross-species evaluations. Foundation models achieve strong performance when positive and negative sequences are compositionally separable, but this advantage does not consistently transfer to cross-platform and cross-species settings, where they are often comparable to classical k-mer representations. Unexpectedly, a one-hot classifier baseline shows the strongest overall sensitivity to R-loop-forming sequences, exceeding more complex models across several generalization tests. Rule-based and task-specific models also exhibit limited transfer outside their original training regimes. Performance is further shaped by sequence properties, negative-control design, experimental platform, and species-specific genomic context. Together, RloopBench establishes genome-stability-associated sequence prediction as a complementary direction for DNA foundation model development and evaluation, while underscoring that simple sequence encodings remain necessary baselines for assessing model generalization beyond conventional regulatory tasks.
Zhang, Y., Ganesan, A., Lin, X.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 8
- Comments 0
