DNALONGBENCH is a comprehensive benchmark suite designed to evaluate long-range DNA prediction tasks, addressing the need for standardized resources in genomics. It encompasses five critical tasks that require understanding long-range dependencies in DNA sequences, which can span up to 1 million base pairs.
The study evaluates the performance of three types of models:
The benchmarking results indicate that expert models consistently outperform DNA foundation models across all tasks. The study highlights the importance of context length in capturing long-range dependencies, with expert models achieving the highest scores in tasks requiring extensive genomic context.
One notable limitation of the study is the exclusion of transformer-based models due to computational challenges associated with training them on long-range tasks. The quadratic cost of the self-attention mechanism makes it infeasible for these models to handle extensive sequences effectively.
Datasets included in DNALONGBENCH are available at DNALONGBENCH Datasets and the source code can be accessed at GitHub Repository.
import plotly.graph_objects as go # Data for the bar chart tasks = ['Enhancer-Target Gene', 'eQTL', '3D Genome', 'Regulatory Activity', 'Transcription Signal'] expert_model_scores = [0.85, 0.90, 0.80, 0.75, 0.88] dna_foundation_scores = [0.70, 0.75, 0.65, 0.60, 0.68] # Create bar chart fig = go.Figure() fig.add_trace(go.Bar(x=tasks, y=expert_model_scores, name='Expert Model')) fig.add_trace(go.Bar(x=tasks, y=dna_foundation_scores, name='DNA Foundation Models')) # Update layout fig.update_layout(title='Model Performance Across Tasks', barmode='group') fig.show()