Abstract:Molecular property prediction and compound-protein interaction (CPI) prediction are key steps in drug discovery. However, traditional graph convolutional network (GCN) are limited by local receptive fields and cannot fully capture the complexity of chemical structures, dynamic changes of molecular conformation, and long-range electronic interactions, which causes bottlenecks to the prediction performance. To this end, this study proposes a deep learning model, T-SeGAT, designed to improve the accuracy and generalization ability of molecular property and CPI prediction. T-SeGAT integrates the ESM-2 protein language model, ChemBERTa molecular language model, and a graph neural network based on graph attention network (GAT) and Set2Set, thereby enabling multi-level feature extraction and fusion from sequence to structure. Meanwhile, to handle the imbalance of experimental data, the model introduces weighted random sampling, balanced/focal/adaptive loss functions, and a dynamic threshold search mechanism at the levels of data loading, loss calculation, and prediction decision-making. Furthermore, it combines an AUC difference-based overfitting suppression method, early stopping strategy, and learning rate scheduling to enhance training stability and generalization ability. Experiments are conducted on the BACE, P53, and hERG datasets for molecular property prediction, and on the Human and C. elegans datasets for CPI prediction, with stratified five-fold cross-validation adopted for performance evaluation. The results show that T-SeGAT consistently outperforms existing baseline models on all datasets. Among them, on the BACE and hERG datasets, the AUC and precision improved by 0.022, 0.010 and 0.004, 0.022 respectively compared with the second-best model, while on the Human dataset, precision increases by 0.013. In conclusion, T-SeGAT demonstrates clear advantages in accuracy, stability, and practicality, providing powerful support for molecular property and CPI prediction in drug discovery.