This research presents a novel approach to protein engineering that integrates machine learning with optimization techniques to design high-performance proteins. The authors introduce a semi-supervised neural network model called Seq2Fitness for predicting protein fitness and a new optimization algorithm named Biphasic Annealing for Diverse Adaptive Sequence Sampling (BADASS). This combination aims to enhance the efficiency and effectiveness of protein design by exploring a broader sequence space and maintaining diversity in the generated sequences.
The study utilized datasets from two protein families: alpha-amylase (AMY BACSU) and endonuclease (NucB). The authors trained the Seq2Fitness model using various published fitness datasets and compared its performance against alternative models. BADASS was then employed to sample and score batches of sequences based on the fitness predictions from Seq2Fitness.
The results demonstrated that Seq2Fitness outperformed other models in predicting protein fitness, achieving significant improvements in Spearman correlation with experimental measurements. BADASS consistently identified higher-scoring sequences compared to existing methods, with 100% of the top 10,000 sequences found by BADASS having higher Seq2Fitness predictions than the wildtype sequence. In contrast, competing approaches showed a wide range of performance, with only 3% to 99% of their top sequences outperforming the wildtype.
The authors developed a theoretical framework to explain the behavior of BADASS, focusing on how dynamic temperature control and mutation energy adjustments contribute to its effectiveness in exploring sequence space.
This study highlights the potential of integrating machine learning with optimization algorithms in protein engineering. The Seq2Fitness model and BADASS algorithm provide a powerful framework for designing diverse and high-performance proteins, which could have significant implications for biotechnology and synthetic biology applications.
For further details, refer to the original paper: Designing diverse and high-performance proteins with a large language model in the loop [2024].
# Example Python code to implement Seq2Fitness and BADASS for protein design import numpy as np import torch from seq2fitness import Seq2FitnessModel from badass import BADASSOptimizer # Load datasets fitness_data = load_fitness_data('path/to/dataset') # Initialize models seq2fitness_model = Seq2FitnessModel() badass_optimizer = BADASSOptimizer(model=seq2fitness_model) # Train Seq2Fitness model seq2fitness_model.train(fitness_data) # Generate sequences sequences = badass_optimizer.optimize(num_sequences=10000) # Evaluate sequences fitness_scores = seq2fitness_model.predict(sequences) # Output results print(f'Generated sequences: {sequences}') print(f'Fitness scores: {fitness_scores}')