T-SeGAT: 面向不平衡数据的分子性质及CPI预测模型
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

广东省区域联合基金重点项目(2022B1515120075)


T-SeGAT: Molecular Property and CPI Prediction Model for Imbalanced Data
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    分子性质预测和化合物-蛋白质相互作用(CPI)预测是药物发现中的关键环节, 但传统图卷积网络(GCN)受限于局部感受野, 难以充分捕捉化学结构复杂性、分子构象动态变化以及长程电子相互作用等信息, 预测性能存在瓶颈. 为解决这一问题, 本文提出了一种深度学习模型T-SeGAT, 用于提升分子性质和 CPI 预测的准确性与泛化能力. 该模型融合了ESM-2蛋白语言模型、ChemBERTa分子语言模型以及基于图注意力网络(GAT)与Set2Set 的图神经网络, 实现从序列到结构的多层次特征提取与融合. 同时, 针对实验数据的不平衡问题, 模型在数据加载、损失计算和预测决策这3个层面引入加权随机采样、平衡/焦点/自适应损失函数以及动态阈值搜索机制, 并结合基于 AUC 差值的过拟合抑制方法、早停策略和学习率调度, 提升训练稳定性与泛化能力. 本文在 BACE、P53 和 hERG 数据集上进行分子性质预测实验, 在Human和C. elegans 数据集上进行 CPI 预测实验, 均采用分层5折交叉验证进行性能评估. 实验结果表明, T-SeGAT 在所有数据集上均优于现有基线模型, 其中在 BACE 和 hERG 数据集上, AUC和精确率分别较次优模型提升0.022、0.010 和0.004、0.022, 在 Human 数据集上的精确率提升 0.013. 综合实验结果表明, T-SeGAT在精度、稳定性和实用性方面表现出显著优势, 为药物发现过程中的分子性质预测与 CPI 预测提供了有力支持.

    Abstract:

    Molecular property prediction and compound-protein interaction (CPI) prediction are key steps in drug discovery. However, traditional graph convolutional network (GCN) are limited by local receptive fields and cannot fully capture the complexity of chemical structures, dynamic changes of molecular conformation, and long-range electronic interactions, which causes bottlenecks to the prediction performance. To this end, this study proposes a deep learning model, T-SeGAT, designed to improve the accuracy and generalization ability of molecular property and CPI prediction. T-SeGAT integrates the ESM-2 protein language model, ChemBERTa molecular language model, and a graph neural network based on graph attention network (GAT) and Set2Set, thereby enabling multi-level feature extraction and fusion from sequence to structure. Meanwhile, to handle the imbalance of experimental data, the model introduces weighted random sampling, balanced/focal/adaptive loss functions, and a dynamic threshold search mechanism at the levels of data loading, loss calculation, and prediction decision-making. Furthermore, it combines an AUC difference-based overfitting suppression method, early stopping strategy, and learning rate scheduling to enhance training stability and generalization ability. Experiments are conducted on the BACE, P53, and hERG datasets for molecular property prediction, and on the Human and C. elegans datasets for CPI prediction, with stratified five-fold cross-validation adopted for performance evaluation. The results show that T-SeGAT consistently outperforms existing baseline models on all datasets. Among them, on the BACE and hERG datasets, the AUC and precision improved by 0.022, 0.010 and 0.004, 0.022 respectively compared with the second-best model, while on the Human dataset, precision increases by 0.013. In conclusion, T-SeGAT demonstrates clear advantages in accuracy, stability, and practicality, providing powerful support for molecular property and CPI prediction in drug discovery.

    参考文献
    相似文献
    引证文献
引用本文

石全宾,吴萌,何芊平,王亚琪. T-SeGAT: 面向不平衡数据的分子性质及CPI预测模型.计算机系统应用,,():1-12

复制
分享
相关视频

文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-09-02
  • 最后修改日期:2025-09-22
  • 录用日期:
  • 在线发布日期: 2026-01-19
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62661041 传真: Email:csa@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号