Mao Zhengchong,Wang Junjun.Speaker recognition system based on fusion of cochlear filtercepstral coefficients and Teager energy operator phase[J].Journal of Nanjing University of Science and Technology,2018,42(01):83.[doi:10.14177/j.cnki.32-1397n.2018.42.01.012]

基于耳蜗倒谱系数和Teager能量算子 相位融合的说话人识别系统()




Speaker recognition system based on fusion of cochlear filter cepstral coefficients and Teager energy operator phase
江南大学 轻工过程先进控制教育部重点实验室,江苏 无锡 214122
Mao ZhengchongWang Junjun
Key Laboratory of Advanced Process Control for Light Industry,Ministry of Education, Jiangnan University,Wuxi 214122,China
能量算子 耳蜗倒谱系数 高斯混合模型-通用背景模型 说话人识别
energy operator cochlear filter cepstral coefficient Gaussian mixture model-universal background model speaker recognition
为了提高说话人识别系统的性能,该文在传统特征的基础上提出利用相位特征对听觉倒谱特征进行补偿的方法。该方法利用Teager能量算子(Teager energy operator,TEO)能够真实反映气流在通过声道系统呈现的涡流非线性作用的模型,再利用希尔伯特变换从TEO导出分析信号的瞬时相位信息,结合耳蜗倒谱系数(Cochlear filter cepstral coefficients,CFCC)得到融合特征参数。实现了对特征参数的补偿,提高了说话人识别系统的识别率。使用NIST-2002说话者识别评估(Speakers recognition evaluation,SRE)数据库,在高斯混合模型-通用背景模型(Gaussian mixture model-universal background model,GMM-UBM)的说话人识别系统上进行实验。实验结果表明TEO相位与CFCC的结合比单独CFCC更好,其识别精度比现有的CFCC特征和线性预测梅尔频率倒谱系数(Linear prediction Meyer frequency cepstral coefficient,LPMFCC)分别提高了8.32%和3.15%。这表明TEO相位包含与CFCC特征互补的信息,且具有较高的识别率。
In order to improve the performance of speaker recognition system,this paper proposes a method of compensating auditory cepstrum features by using phase features based on traditional features. In this method,Teager energy operator(TEO)can truly reflect the model of the eddy current nonlinearity caused by the airflow in the channel system. The Hilbert transform is used to derive the instantaneous phase information of the analyzing signal from TEO. The fusion characteristic parameters are obtained by combining with cochlear filter cepstral coefficients(CFCC). It realizes the compensation of characteristic parameters and improves the recognition rate of speaker recognition system. The NIST-2002 speakers recognition evaluation(SRE)database is used to experiment with the Gaussian mixture model-universal background model(GMM-UBM)speaker recognition system. The experimental results show that the combination of the TEO phase and CFCC is better than the CFCC alone,and its recognition accuracy is improved by 8.32% and 3.15%,respectively,compared with the existing CFCC characteristics and linear prediction Meyer frequency cepstral coefficient(LPMFCC). This indicates that the TEO phase contains the information that is complementary to the CFCC feature and has a high recognition rate.


[1] 李燕萍,唐振民,丁辉,等. 一种适于说话人辨认的自适应频率尺度变换[J]. 南京理工大学学报,2010,34(2):182-186. Li Yanping,Tang Zhenmin,Ding Hui,et al. Adaptive frequency transform for speaker identification[J]. Journal of Nanjing University of Science and Technology,2010,34(2):182-186. [2]Zhang Xueying,Guo Yueling,Hou Xuemei. A speech recognition method of isolated words based on modified LPC cepctrum[C]//IEEE International Conference on Granular Computing. Fremont,USA:IEEE Press,2007:481-485. [3]Hosseinzadeh D,Krishnan S. Combining vocal source and MFCC features for enhanced speaker recognition performance using GMMs[C]//IEEE Workshop on Multimedia Signal Processing. Chania Crete,Greece:IEEE Press,2007:365-368. [4]Qi Li,Yan Huang. An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions[J]. IEEE Transactions on Audio Speech and Language Processing,2011,19(6):1791-1801. [5]Senoussaoui M,Dehak N,Kenny P,et al. First attempt of Boltzmann machines for speaker recognition[C]//Odyssy2012:The Speaker and Language Recognition Workshop. Singapore:ISCA Press,2012. [6]陈丽萍,王尔玉,戴礼荣,等. 基于深层置信网络的说话人信息提取方法[J]. 模式识别与人工智能,2013,26(12):1089-1095. Chen Liping,Wang Eryu,Dai Lirong,et al. Deep belief network based speaker information extraction method[J]. Pattern Recognition and Artificial Intelligent,2013,26(12):1089-1095. [7]Ehsan V,Xin Lei,Erik M,et al. Deep neural netework for small footprint text-dependent speaker verification[C]//IEEE International Conferece on Acoustic Speech and Signal Processing. Florence,Italy:IEEE Press,2014. [8]Qi Wang,Joseph J. From maxout to channel-out:encoding information on sparse pathways[J]. Artificial Neural Networks and Machine Learning,2014,8681(3):273-280. [9]秦楚雄,张连海. 基于DNN的低资源语音识别特征提取技术[J]. 自动化学报,2017,43(7):1208-1219. Qin Chuxiong,Zhang Lianhai. Deep neural network based feature extraction for low-resource speech recognition[J]. Acta Automatica Sinica,2017,43(7):1208-1219. [10]张涛涛,陈丽萍,蒋兵,等. 采用深度神经网络的说话人特征提取方法[J]. 小型微型计算机系统,2017,38(1):142-146. Zhang Taotao,Chen Liping,Jiang Bing,et al. Novel method for speaker feature extraction using deep neural network[J]. Journal of Chinese Computer Systems,2017,38(1):142-146. [11]Mahadeva P S,Cheedella S,Yegnanarayana B. Extraction of speaker-specific excitation from linear prediction residual of speech[J]. Speech Communication,2006,48(10):1243-1261. [12]Zheng Nengheng,Lee T,Ching P. Integration of complementary acoustic features for speaker recognition[J]. IEEE Signal Processing Letters,2007,14(3):181-184. [13]李壮辉. 基音特征融合高斯混合模型的说话人识别研究[J]. 测控技术,2014,33(6):28-31. Li Zhuanghui. Gaussian mixture model of a new pitch features-integration for speaker recognition[J]. Measurement and Control Technology,2014,33(6):28-31. [14]毛燕湖,曾以成,陈雨莺,等. 说话人识别的特征组合方法[J]. 计算计应用,2015,35(2):242-244. Mao Yanhu,Zeng Yicheng,Chen Yuying,et al. Feature combination method in speaker recognition[J]. Journal of Computer Applications,2015,35(2):242-244. [15]Patil H A,Parhi K K. Development of TEO phase for speaker recognition[C]//Signal Processing and Communicatins. Bangalore,India:IEEE Press,2010:1-5. [16]Teager H. Some observations on oral air flow during phonation[J]. IEEE Transactions on Acoustics,Speech and Signal Processing,1980,28(5):599-601. [17]高慧,苏广川,陈善广. 基于Teager能量算子(TEO)非线性特征的语音情绪识别[J]. 航空医学与医学工程,2005,18(6):427-431. Gao Hui,Su Guangchuan,Chen Shanguang. Emotion recognition of mandarin speech using nonlinear features based on Teager energy operator[J]. Space Medicine and Medical Engineering,2005,18(6):427-431. [18]Kaiser J F. On a simple algorithm to calculate the energy of a signal[C]//International Conference on Acoustic,Speech and Signal Processing. Albuquerque,New Mexico,USA:IEEE Press,1990:381-384. [19]Naylor P A,Kounoudes A,Gudnason J,et al. Estimation of glottal closure instants in voiced speech using the DYPSA algorithm[J]. IEEE Transactions on Audio,Speech and Language Processing,2007,15(1):34-43. [20]刘庆华. 基于声门闭合瞬间检测的时延算法研究[J]. 电声技术,2006,30(9):45-49,53. Liu Qinghua. Time delay estimation based on the estimation of glottal closure instant[J]. Aduio Engineering,2006,30(9):45-49,53. [21]Murty K S R,Yegnanarayana B. Epoch extraction from speech signals[J]. IEEE Transactions on Audio,Speech and Language Processing,2008,16(8):1602-1613.


收稿日期:2017-03-27 修回日期:2018-01-11 基金项目:国家自然科学基金(60973095); 江苏省自然科学基金(BK20131107) 作者简介:茅正冲(1964-),男,副教授,主要研究方向:机器人视听觉识别,工业控制,E-mail:1297709187@qq.com; 通讯作者:王俊俊(1991-),男,硕士生,主要研域方向:语音信号处理,E-mail:850722750@qq.com。 引文格式:茅正冲,王俊俊. 基于耳蜗倒谱系数和Teager能量算子相位融合的说话人识别系统[J]. 南京理工大学学报,2018,42(1):83-88. 投稿网址:http://zrxuebao.njust.edu.cn
更新日期/Last Update: 2018-02-28