[1]乔 梁,谢冬青.基于类不平衡学习的蛋白质与金属离子交互位点预测[J].南京理工大学学报(自然科学版),2018,42(06):707.[doi:10.14177/j.cnki.32-1397n.2018.42.06.011]
 Qiao Liang,Xie Dongqing.Protein-metal-ion interaction sites prediction based onclass imbalance learning[J].Journal of Nanjing University of Science and Technology,2018,42(06):707.[doi:10.14177/j.cnki.32-1397n.2018.42.06.011]
点击复制

基于类不平衡学习的蛋白质与金属离子交互位点预测()
分享到:

《南京理工大学学报》(自然科学版)[ISSN:1005-9830/CN:32-1397/N]

卷:
42卷
期数:
2018年06期
页码:
707
栏目:
出版日期:
2018-12-30

文章信息/Info

Title:
Protein-metal-ion interaction sites prediction based onclass imbalance learning
文章编号:
1005-9830(2018)06-0707-09
作者:
乔 梁谢冬青
广州大学 数学与信息科学学院,广东 广州 510006
Author(s):
Qiao LiangXie Dongqing
School of Mathematics and Information Science,Guangzhou University,Guangzhou 510006,China
关键词:
类不平衡学习 蛋白质与金属离子 交互位点 预测 支持向量机
Keywords:
class imbalance learning protein-metal-ion interaction sites prediction support vector machine
分类号:
TP391.4
DOI:
10.14177/j.cnki.32-1397n.2018.42.06.011
摘要:
为了提高蛋白质与金属离子的交互位点(PMIIS)预测的准确率,从解决数据分布不平衡问题出发,提出了1种结合下采样与上采样方法的类不平衡学习算法。同时对多数类样本与少数类样本进行采样,在补充少数类样本信息的同时,减少多数类样本的冗余信息。基于该文类不平衡学习算法与支持向量机(SVM),设计了1种基于序列信息的预测方法。为了客观评价PMIIS的预测性能,构建了领域内较为完备的、含有蛋白质与Zn2+、Ca2+与Fe3+交互位点的标准数据集。在此数据集上的实验结果表明,该文预测方法在蛋白质与Zn2+、Ca2+与Fe3+交互位点预测问题上的平均马氏相关系数(MCC)为0.646,优于TargetS与IonCom。
Abstract:
A new class imbalance learning algorithm combining the under-sampling and over-sampling methods is proposed to relieve the problem of data imbalance distribution and improve the prediction performance of protein-metal-ion interaction sites(PMIIS). The majority and minority samples are sampled at the same time,the information of the minority samples is complemented,and the redundant information of the majority samples is reduced. A new sequence-based prediction method is designed based on the new class imbalance learning algorithm and support vector machine(SVM)algorithm. A relatively complete standard dataset including the interaction sites of protein-Zn2+,protein-Ca2+ and protein-Fe3+ is constructed to objectively evaluate the performance of PMIIS prediction. The experimental results of the dataset show that,the average Matthew’s correlation coefficients(MCC)of the proposed method is 0.646 on protein-Zn2+,protein-Ca2+ and protein-Fe3+ interaction site predictions,which is better than that of TargetS and IonCom.

参考文献/References:

[1] Hu Jun,He Xue,Yu Dongjun,et al. A new supervised over-sampling algorithm with application to protein-nucleotide binding residue prediction[J]. PLoS ONE,2014,9(9):e107676.
[2]Yu Dongjun,Hu Jun,Yang Jing,et al. Designing template-free predictor for targeting protein-ligand binding sites with classifier ensemble and spatial clustering[J]. IEEE-ACM Transactions on Computational Biology and Bioinformatics,2013,10(4):994-1008.
[3]Rausell A,Juan D,Pazos F,et al. Protein interactions and ligand binding:From protein subfamilies to functional specificity[J]. Proceedings of the National Academy of Sciences,2010,107(5):1995-2000.
[4]赵欣,蒲小平. 蛋白质组学在药物研究中的应用[J]. 中国药理学通报,2009,25(8):988-991.
Zhao Xin,Pu Xiaoping. The application of proteomics technology in drug study[J]. Chinese Pharmacological Bulletin,2009,25(8):988-991.
[5]Chen K,Mizianty M J,Kurgan L. Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors[J]. Bioinformatics,2012,28(3):331-341.
[6]Hendlich M,Rippmann F,Barnickel G. LIGSITE:Automatic and efficient detection of potential small molecule-binding sites in proteins[J]. Journal of Molecular Graphics and Modelling,1997,15(6):359-363.
[7]Wass M N,Kelley L A,Sternberg M J E. 3DLigandSite:Predicting ligand-binding sites using similar structures[J]. Nucleic Acids Research,2010,38(Suppl_2):W469-W473.
[8]Yang Jianyi,Roy A,Zhang Yang. Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment[J]. Bioinformatics,2013,29(20):2588-2595.
[9]Chauhan J S,Mishra N K,Raghava G P. Identification of ATP binding residues of a protein from its primary sequence[J]. BMC Bioinformatics,2009,10:434.
[10]Hu Xiuzhen,Dong Qiwen,Yang Jianyi,et al. Recognizing metal and acid radical ion-binding sites by integrating ab initio modeling with template-based transferals[J]. Bioinformatics,2016,32(21):3260-3269.
[11]Yu Dongjun,Hu Jun,Tang Zhenmin,et al. Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling[J]. Neurocomputing,2013,104:180-190.
[12]Panwar B,Gupta S,Raghava G P S. Prediction of vitamin interacting residues in a vitamin binding protein using evolutionary information[J]. BMC Bioinformatics,2013,14:44.
[13]Yan Changhui,Terribilini M,Wu Feihong,et al. Predicting DNA-binding sites of proteins from amino acid sequence[J]. BMC Bioinformatics,2006,7:262.
[14]Roy A,Yang Jianyi,Zhang Yang. COFACTOR:An accurate comparative algorithm for structure-based protein function annotation[J]. Nucleic Acids Research,2012,40(W1):W471-W477.
[15]杨章静,刘传才,顾兴健,等. 依概率分类的保持投影及其在人脸识别中的应用[J]. 南京理工大学学报,2013,37(1):7-11.
Yang Zhangjing,Liu Chuangcai,Gu Xingjian,et al. Probabilistic classification preseving projections and its application to face recognition[J]. Journal of Nanjing University of Science and Technology,2013,37(1):7-11.
[16]Altschul S F,Madden T L,Sch?ffer A A,et al. Gapped BLAST and PSI-BLAST:A new generation of protein database search programs[J]. Nucleic Acids Research,1997,25(17):3389-3402.
[17]Jones D T. Protein secondary structure prediction based on position-specific scoring matrices[J]. Journal of Molecular Biology,1999,292(2):195-202.
[18]Lee B,Richards F M. The interpretation of protein structures:Estimation of static accessibility[J]. Journal of Molecular Biology,1971,55(3):379-400.
[19]Joo K,Lee S J,Lee J. Sann:Solvent accessibility prediction of proteins by nearest neighbor method[J]. Proteins-structure Function & Bioinformatics,2012,80(7):1791-1797.
[20]Hu Jun,Li Yang,Yan Wuxia,et al. KNN-based dynamic query-driven sample rescaling strategy for class imbalance learning[J]. Neurocomputing,2016,191:363-373.
[21]Chang C C,Lin C J. LIBSVM:A library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology(TIST),2011,2(3):27.
[22]Liaw A,Wiener M. Classification and regression by random forest[J]. R news,2002,2(3):18-22.
[23]He Haibo,Garcia E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering,2009,21(9):1263-1284.
[24]Rose P W,PrliAc’ A,Bi Chunxiao,et al. The RCSB protein data bank:Views of structural biology for basic and applied research and education[J]. Nucleic Acids Research,2015,43(D1):D345-D356.
[25]Li Weizhong,Godzik A. Cd-hit:A fast program for clustering and comparing large sets of protein or nucleotide sequences[J]. Bioinformatics,2006,22(13):1658-1659.

备注/Memo

备注/Memo:
收稿日期:2018-06-14 修回日期:2018-07-05
基金项目:国家自然科学基金(61772007)
作者简介:乔梁(1973-),男,博士生,主要研究方向:生物信息学,E-mail:qiaoliang_gu@126.com。
引文格式:乔梁,谢冬青. 基于类不平衡学习的蛋白质与金属离子交互位点预测[J]. 南京理工大学学报,2018,42(6):707-715.
投稿网址:http://zrxuebao.njust.edu.cn
更新日期/Last Update: 2018-12-30