Frozen DNA encoders are often used as genomic feature extractors, but downstream fine-tuning does not show what information is already linearly accessible in their unchanged embeddings. We introduce MINA (Model Interrogation of Nucleotide Architectures), a lightweight probing benchmark for testing whether frozen DNA embeddings can recover (i) a 5-way protein-family label for each gene and (ii) the 1,536-dimensional GenePT embedding of each gene's natural-language summary. We compare recoverability between canonical coding sequence and TSS-centred genomic contexts. In 3,244 human protein-coding genes from five families, frozen encoders recovered the family-annotation target most clearly from coding sequence. NT-v2 with meanD pooling reached macro-F1 0.828 / kappa 0.821, compared with 0.672 / kappa 0.702 for a CDS 4-mer baseline. Alignment to GenePT natural-language descriptions was weaker. Replacing CDS with 196,608 bp TSS-centred windows substantially reduced performance across all four encoders, indicating that the recoverable signal is primarily coding-sequence family signal rather than generic gene-function signal from arbitrary genomic context.
Wijaya, A. S., Leung, H., Yoo, H.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 7
- Comments 0
