Premium accounts now available! Sign up and create a premium account. Read more Close

Advertisement

Image

MINA: linear probes reveal coding-sequence family signal in frozen DNA encoders

Preprint Created on 29 May 2026 bioRxiv

Frozen DNA encoders are often used as genomic feature extractors, but downstream fine-tuning does not show what information is already linearly accessible in their unchanged embeddings. We introduce MINA (Model Interrogation of Nucleotide Architectures), a lightweight probing benchmark for testing whether frozen DNA embeddings can recover (i) a 5-way protein-family label for each gene and (ii) the 1,536-dimensional GenePT embedding of each gene's natural-language summary. We compare recoverability between canonical coding sequence and TSS-centred genomic contexts. In 3,244 human protein-coding genes from five families, frozen encoders recovered the family-annotation target most clearly from coding sequence. NT-v2 with meanD pooling reached macro-F1 0.828 / kappa 0.821, compared with 0.672 / kappa 0.702 for a CDS 4-mer baseline. Alignment to GenePT natural-language descriptions was weaker. Replacing CDS with 196,608 bp TSS-centred windows substantially reduced performance across all four encoders, indicating that the recoverable signal is primarily coding-sequence family signal rather than generic gene-function signal from arbitrary genomic context.

Wijaya, A. S., Leung, H., Yoo, H.

Advertisement

Stats

  • Recommendations n/a n/a positive of 0 vote(s)
  • Views 7
  • Comments 0

Recommended by

  • No recommendations yet.

Post a comment

You need to be signed in to post comments. You can sign in here.

Comments

There are no comments yet.

Advertisement