[1]马 旸,蔡 冰.大数据环境下Lucene性能优化方法研究[J].南京理工大学学报(自然科学版),2015,39(03):260.
 Ma Yang,Cai Bing.Performance optimization method of Lucene in big data[J].Journal of Nanjing University of Science and Technology,2015,39(03):260.
点击复制

大数据环境下Lucene性能优化方法研究
分享到:

《南京理工大学学报》(自然科学版)[ISSN:1005-9830/CN:32-1397/N]

卷:
39卷
期数:
2015年03期
页码:
260
栏目:
出版日期:
2015-06-30

文章信息/Info

Title:
Performance optimization method of Lucene in big data
作者:
马 旸蔡 冰
国家计算机网络应急技术处理协调中心江苏分中心,江苏 南京 210003
Author(s):
Ma YangCai Bing
Jiangsu Branch of National Computer Network Emergency Response Technical Team/ Coordination Center of China,Nanjing 210003,China
关键词:
大数据 Lucene 内存计算 批量更新 倒排索引 倒排表 缓存 内存索引 磁盘索引 多分块倒排结构
Keywords:
big data Lucene memory computing batch processing inverted index post-list cache random access memory index disk index multiple block inverted structure
分类号:
TP392
摘要:
为提高大数据环境下的数据查询分析效率,该文结合内存计算技术和批量更新技术提出一种优化倒排索引方法——内存磁盘索引(RFDirectory)。基于Lucene实现内存和磁盘相结合的倒排表管理技术。将新增数据写入缓存中,并周期性地写入磁盘索引结构中,从而提升倒排索引的写入性能。通过整合磁盘和内存的多分块倒排结构,为用户提供高效的数据查询分析结果。实验结果表明:在大数据环境下,RFDirectory方法的索引构建时间缩短为磁盘索引(FSDirectory)、内存索引(RAMDirectory)方法索引构建时间的50%,返回1个关键字的检索结果耗时缩短了近15%。
Abstract:
To improve the data query efficiency in big data,an optimized inverted index method—RAM FS directory(RFDirectory)is proposed here based on memory computing and batch processing technique.A post-list management technique combining random access memory(RAM)and disk is realized based on Lucene.New data are written into a cache,and then written into a disk index periodically to improve the writing performance of the inverted index method.Data query results are provided efficiently to consumers by integrating the multiple block inverted structure of the disk and RAM.Experimental results show that the index constructing time of RFDirectory is 50% of that of FSDirectory or RAMDirectory,and the time consuming of returning the index result of one keyword is reduced by 15% in big data.

参考文献/References:

[1] Scholer F,Williams H E,Yiannis J,et al.Compression of inverted indexes for fast query evaluation[A].Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval[C].New York,NY,USA:ACM,2002:222-229.
[2]Moffat A,Zobel J.Self-indexing inverted files for fast text retrieval[J].ACM Transactions on Information Systems,1996,14(4):349-379.
[3]Persin M,Zobel J,Sacks-Davis R.Filtered document retrieval with frequency-sorted indexes[J].Journal of the American Society for Information Science,1996,47(10):749-764.
[4]Brin S,Page L.The anatomy of a large-scale hypertextual Web search engine[A].Proceedings of the 7th WWW Conference[C].Brisbane,Australia:ScienceDirect,1998:107-117.
[5]谭斌,丁莎,车念,等.一种面向域的高效倒排索引结构及实时更新[J].四川大学学报(自然科学版),2011,48(2):321-326.
Tan Bin,Ding Sha,Che Nian,et al.Field-oriented structure of inverted index and real-time updates[J].Journal of Sichuan University(Natural Science Edition),2011,48(2):321-326.
[6]高梦娇,吕玉琴,侯宾.基于 R-tree 和倒排文件的混合索引的设计与实现[EB/OL].http://www.paper.edu.cn/html/releasepaper/2012/12/718/,2012-12-02.
[7]马健,张太红,陈燕红.中文搜索引擎分块倒排索引存储模式[J].计算机应用,2013,33(7):2031-2036.
Ma Jian,Zhang Taihong,Chen Yanhong.New inverted index storage scheme for Chinese search engine[J].Journal of Computer Applications,2013,33(7):2031-2036.
[8]刘小珠,彭智勇,陈旭.高效的随机访问分块倒排文件自索引技术[J].计算机学报,2010,33(6):977-987.
Liu Xiaozhu,Peng Zhiyong,Chen Xu.An efficient random access block inverted file self-index technology[J].Chinese Journal of Computers,2010,33(6):977-987.
[9]Hatcher E,Gospodnetic O.Lucene in action[EB/OL]http://citeseerx.ist.psu.edu/showciting?cid=541300,2015-06-03.
[10]中科院高能物理研究所计算中心.http://www.datatang.com/data/45499/,2015-06-03.

相似文献/References:

[1]孙炯宁.基于混合式子树算法的大数据匿名化[J].南京理工大学学报(自然科学版),2015,39(05):609.
 Sun Jiongning.Anonymization of big data based on hybrid tree[J].Journal of Nanjing University of Science and Technology,2015,39(03):609.
[2]钱晓东,曹 阳.基于社区极大类发现的大数据并行聚类算法[J].南京理工大学学报(自然科学版),2016,40(01):117.
 Qian Xiaodong,Cao Yang.Large data parallel clustering algorithm based ondiscovery of maximal class in the community[J].Journal of Nanjing University of Science and Technology,2016,40(03):117.

备注/Memo

备注/Memo:
收稿日期:2014-04-18 修回日期:2014-05-28
作者简介:马旸(1980-),男,硕士生,主要研究方向:网络与信息安全、大数据处理,E-mail:mayang@jsca.gov.cn。
引文格式:马旸,蔡冰.大数据环境下Lucene性能优化方法研究[J].南京理工大学学报,2015,39(3):260-265.
投稿网址:http://zrxuebao.njust.edu.cn
更新日期/Last Update: 2015-06-30