CGRclust is a novel unsupervised clustering method that utilizes Chaos Game Representation (CGR) and twin contrastive learning to cluster unlabelled DNA sequences. It has been evaluated across various metagenomic datasets, including mitochondrial genomes from fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences.
Despite its strengths, CGRclust's performance can be influenced by hyperparameter tuning and the computational efficiency may be a concern for very large datasets. Additionally, the method's reliance on the quality of input data can affect clustering outcomes.
Overall, CGRclust represents a significant advancement in the field of metagenomic data analysis, providing a robust tool for clustering diverse DNA sequences without the need for alignment or taxonomic labels.
For further details, refer to the study: CGRclust: Chaos Game Representation for twin contrastive clustering of unlabelled DNA sequences [2024].
import pandas as pd import plotly.express as px # Sample data representing CGRclust performance data = { 'Dataset': ['Fish', 'Fungi', 'Protists', 'Viral'], 'Accuracy': [85.79, 82.50, 80.00, 100.00], 'Taxonomic Level': ['Phylum', 'Subphylum', 'Genus', 'Species'] } # Create a DataFrame df = pd.DataFrame(data) # Create a bar chart to visualize accuracy fig = px.bar(df, x='Dataset', y='Accuracy', color='Taxonomic Level', title='CGRclust Performance on Diverse Datasets', labels={'Accuracy':'Clustering Accuracy (%)', 'Dataset':'Dataset Type'}) # Show the figure fig.show()