[1]张冰怡,魏 博,陈建成,等.基于对偶编码的中文分词算法[J].南京理工大学学报(自然科学版),2014,38(04):526-530.
 Zhang Bingyi,Wei Bo,Chen Jiancheng,et al.Chinese word segmentation algorithm based on pair coding[J].Journal of Nanjing University of Science and Technology,2014,38(04):526-530.
点击复制

基于对偶编码的中文分词算法
分享到:

《南京理工大学学报》(自然科学版)[ISSN:1005-9830/CN:32-1397/N]

卷:
38卷
期数:
2014年04期
页码:
526-530
栏目:
出版日期:
2014-08-31

文章信息/Info

Title:
Chinese word segmentation algorithm based on pair coding
作者:
张冰怡12魏 博1陈建成3魏 杰4饶国政12
1.天津大学 计算机科学与技术学院,天津 300072; 2.天津市认知计算与应用重点实验室,天津 300072; 3.驻马店西平县电业公司,驻马店 463900; 4.北京邮电大学 网络与交换技术国家重点实验室,北京 100876
Author(s):
Zhang Bingyi12Wei Bo1Chen Jiancheng3Wei Jie4Rao Guozheng12
1.College of Computer Science and Technology,Tianjin University,Tianjin 300072,China; 2.Tianjin Key Laboratory of Cognitive Computing and Application,Tianjin 300072,China; 3.Xiping County Electric Power Company,Zhumadian 463900,China; 4.State Key Labor
关键词:
对偶编码 中文分词 特征匹配 数据压缩 散列 特征值 模糊匹配
Keywords:
pair coding Chinese word segmentation characteristic matching data compression hash characteristic value fuzzy matching
分类号:
TP391.1
摘要:
为了提高中文分词算法的切分速度和存储效率,提出一种基于对偶编码的特征匹配算法。由中文分词的字符集和字符相邻关系提取特征值,根据此特征值在中文分词词典中进行快速匹配,基于字符的位置相邻关系提取特征值,支持模糊匹配,因此无需对多字词进行单独匹配,从而有效节省匹配时间。实验仿真表明,该算法可以降低特征存储空间,有效提高中文分词精度和效率。
Abstract:
To improve the segmentation velocity and storage efficiency of the Chinese word segmentation algorithm,this paper proposes a characteristic matching algorithm based on pair coding.The characteristic value is extracted from the Chinese character position.This method can support fuzzy matching and don't need match multi-character Chinese words,so the characteristic value extraction is extracted from the adjacent Chinese character position.In addition,the data compression method can contribute to reduce storage space and improve the performance of Chinese word segmentation.

参考文献/References:

[1] 周俊,郑中华,张炜.基于改进最大匹配算法的中文分词粗分方法[J].计算机工程与应用,2014,50(2):124-128.
Zhou Jun,Zheng Zhonghua,Zhang Wei.Method of Chinese words rough segmentation based on improving maximum match algorithm[J].Computer Engineering and Applications,2014,50(2):124-128.
[2]麦范金,李东普,岳晓光.基于双向匹配法和特征选择算法的中文分词技术研究[J].昆明理工大学学报,2011,36(1):47-51.
Mai Fanjin,Li Dongpu,Yue Xiaoguang.Research on Chinese word segmentation based on bi-direction marching method and feature selection algorithm[J].Journal of Kunming University of Science and Technology,2011,36(1):47-51.
[3]曹卫峰.中文分词关键技术研究[D].南京:南京理工大学计算机科学与技术学院,2009.
[4]王瑞雷,栾静,潘晓花,等.一种改进的中文分词正向最大匹配算法[J].计算机应用与软件,2011,28(3):195-197.
Wang Ruilei,Luan Jing,Pan Xiaohua,et al.An improved forward maximum matching algorithm for Chinese word segmentation[J].Computer Applications and Software,2011,28(3):195-197.
[5]胡鹏飞.Lucene与中文分词技术的研究及应用[D].北京:北京交通大学计算机科学与技术学院,2010.
[6]卢亮,张博文.搜索引擎原理、实践与应用[M].北京:电子工业出版社,2007.
[7]费洪晓,胡海苗,巩燕玲.基于Hash结构的机械统计分词系统研究[J].计算机工程与应用,2006,42(5):159-161.
Fei Hongxiao,Hu Haimiao,Gong Yanling.A kind of machine-statistics system based on hash structure for Chinese word segmentation[J].Computer Engineering and Applications,2006,42(5):159-161.
[8]Wang Zhengfei,Dai Jing,Wang Wei,et al.Fast query over encrypted character data in database[J].Communications in Information and Systems,2004,4(4):289-300.

相似文献/References:

[1]蒋卫丽,陈振华,邵党国,等.基于领域词典的动态规划分词算法[J].南京理工大学学报(自然科学版),2019,43(01):63.[doi:10.14177/j.cnki.32-1397n.2019.43.01.009]
 Jiang Weili,Chen Zhenhua,Shao Dangguo,et al.Dynamic programming word segmentation algorithmbased on domain dictionaries[J].Journal of Nanjing University of Science and Technology,2019,43(04):63.[doi:10.14177/j.cnki.32-1397n.2019.43.01.009]

备注/Memo

备注/Memo:
收稿日期:2014-05-06 修回日期:2014-07-26
基金项目:国家“973”计划资助项目(2013CB329301); 国家自然科学基金(61373165); 中国民航信息技术科研基地开放基金(CAAC-ITRB-201209)
作者简介:张冰怡(1974-),男,副教授,主要研究方向:数据挖掘,E-mail:byzhang@tju.edu.cn。
引文格式:张冰怡,魏博,陈建成,等.基于对偶编码的中文分词算法[J].南京理工大学学报,2014,38(4):526-530.
投稿网址:http://zrxuebao.njust.edu.cn
更新日期/Last Update: 2014-08-31