De novo generation of functional terpene synthases using TpsGPT
Summary
This paper introduces TpsGPT, a generative AI model fine-tuned on 79,000 terpene synthase (TPS) sequences, to rapidly design novel TPS enzymes, addressing the limitations of traditional costly and slow directed evolution methods. The model successfully generated a pool of de novo enzyme candidates, from which seven highly-ranked sequences were identified through rigorous in silico validation. Experimental testing confirmed functional enzymatic activity in at least two of these computationally designed sequences, demonstrating the power of AI for generating functional, evolutionarily distant enzymes.
Medical Relevance
Terpene synthases are indispensable for synthesizing the diverse terpene scaffolds found in many natural products, including critical anticancer drugs like Taxol. This technology provides a significantly faster and more scalable method for discovering and designing novel TPS enzymes, which can accelerate the development and production of new therapeutic compounds and improve access to existing complex natural product-derived medicines.
AI Health Application
TpsGPT is a generative AI model specifically developed for scalable de novo protein design, in this case, functional terpene synthases. Its application in health and medicine lies in accelerating the discovery, engineering, and efficient production of enzymes that are precursors to therapeutically valuable natural products, most notably anticancer drugs like Taxol. By rapidly generating and validating novel functional enzymes, TpsGPT can reduce the time and cost associated with obtaining these compounds, potentially leading to new drug candidates or improved synthesis methods for existing pharmaceuticals.
Key Points
- Traditional de novo design of terpene synthases (TPS), crucial for natural products including anticancer drugs like Taxol, is slow and costly.
- TpsGPT is a novel generative model developed by fine-tuning the protein language model ProtGPT2 on a curated dataset of 79,000 TPS sequences from UniProt.
- The model generated an initial pool of 28,000 de novo enzyme candidates in silico, which were then subjected to rigorous multi-metric computational validation (e.g., EnzymeExplorer, ESMFold pLDDT, InterPro, Foldseek).
- Seven putative TPS enzymes were identified that satisfied all stringent in silico validation criteria, significantly narrowing down the candidate pool.
- Experimental validation confirmed that at least two of these computationally designed and filtered sequences exhibited functional TPS enzymatic activity.
- The research demonstrates that fine-tuning protein language models on enzyme-class-specific datasets, combined with robust computational filtering, can enable the de novo generation of functional, evolutionarily distant enzymes.
- This approach offers a scalable and potentially cost-effective method for designing enzymes for natural product synthesis, overcoming limitations of traditional directed evolution.
Methodology
The study involved fine-tuning the pre-trained protein language model ProtGPT2 on a curated dataset of 79,000 terpene synthase (TPS) sequences obtained from UniProt to create TpsGPT. TpsGPT then generated a large library of 28,000 de novo enzyme candidates in silico. These generated sequences underwent rigorous computational validation using multiple metrics: EnzymeExplorer for classification, ESMFold for structural confidence (pLDDT), sequence diversity analysis, CLEAN classification, InterPro for domain detection, and Foldseek for structure alignment. A subset of seven candidates that passed all in silico criteria was selected for subsequent experimental validation to confirm functional enzymatic activity.
Key Findings
The TpsGPT model successfully generated novel and diverse terpene synthase candidates. Through a stringent multi-stage in silico validation process, seven highly promising TPS enzyme candidates were identified. Crucially, experimental validation confirmed that at least two of these computationally designed sequences exhibited functional enzymatic activity, demonstrating the successful de novo generation of novel and functional enzymes that are evolutionarily distant from known sequences.
Clinical Impact
This technology holds substantial clinical impact by potentially revolutionizing the drug discovery pipeline for terpene-based therapeutics. It could significantly accelerate the identification and synthesis of novel drug candidates, particularly for challenging diseases like cancer, by rapidly designing and optimizing the enzymes required for their production. This could lead to a more efficient and cost-effective development of new medicines or improved biosynthetic routes for existing complex natural product drugs (e.g., Taxol), which often face supply chain challenges or laborious chemical synthesis.
Limitations
The abstract indicates that while seven candidates were identified as putative TPS enzymes, only 'at least two' were experimentally validated as active. This suggests that the current in silico validation, while robust, may not achieve a 100% success rate in predicting functional activity. The abstract does not detail the level or efficiency of the confirmed enzymatic activity compared to natural enzymes, nor does it specify the full experimental cost or time involved after the computational phase.
Future Directions
Future research directions could include further optimizing TpsGPT to increase the success rate of generating functional enzymes, exploring its application to design other classes of enzymes, and conducting detailed biochemical characterization (e.g., catalytic efficiency, substrate specificity, stability) of the generated enzymes. Integrating this de novo design methodology into high-throughput screening platforms and exploring multi-objective optimization for specific therapeutic properties would also be valuable.
Medical Domains
Keywords
Abstract
Terpene synthases (TPS) are a key family of enzymes responsible for generating the diverse terpene scaffolds that underpin many natural products, including front-line anticancer drugs such as Taxol. However, de novo TPS design through directed evolution is costly and slow. We introduce TpsGPT, a generative model for scalable TPS protein design, built by fine-tuning the protein language model ProtGPT2 on 79k TPS sequences mined from UniProt. TpsGPT generated de novo enzyme candidates in silico and we evaluated them using multiple validation metrics, including EnzymeExplorer classification, ESMFold structural confidence (pLDDT), sequence diversity, CLEAN classification, InterPro domain detection, and Foldseek structure alignment. From an initial pool of 28k generated sequences, we identified seven putative TPS enzymes that satisfied all validation criteria. Experimental validation confirmed TPS enzymatic activity in at least two of these sequences. Our results show that fine-tuning of a protein language model on a carefully curated, enzyme-class-specific dataset, combined with rigorous filtering, can enable the de novo generation of functional, evolutionarily distant enzymes.
Comments
11 pages, 8 figures, Accepted at the NeurIPS 2025 AI for Science and MLSB 2025 workshops