|Table of Contents|

Performance optimization method of Lucene in big data

《南京理工大学学报》(自然科学版)[ISSN:1005-9830/CN:32-1397/N]

Issue:
2015年03期
Page:
260-
Research Field:
Publishing date:

Info

Title:
Performance optimization method of Lucene in big data
Author(s):
Ma YangCai Bing
Jiangsu Branch of National Computer Network Emergency Response Technical Team/ Coordination Center of China,Nanjing 210003,China
Keywords:
big data Lucene memory computing batch processing inverted index post-list cache random access memory index disk index multiple block inverted structure
PACS:
TP392
DOI:
-
Abstract:
To improve the data query efficiency in big data,an optimized inverted index method—RAM FS directory(RFDirectory)is proposed here based on memory computing and batch processing technique.A post-list management technique combining random access memory(RAM)and disk is realized based on Lucene.New data are written into a cache,and then written into a disk index periodically to improve the writing performance of the inverted index method.Data query results are provided efficiently to consumers by integrating the multiple block inverted structure of the disk and RAM.Experimental results show that the index constructing time of RFDirectory is 50% of that of FSDirectory or RAMDirectory,and the time consuming of returning the index result of one keyword is reduced by 15% in big data.

References:

[1] Scholer F,Williams H E,Yiannis J,et al.Compression of inverted indexes for fast query evaluation[A].Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval[C].New York,NY,USA:ACM,2002:222-229.
[2]Moffat A,Zobel J.Self-indexing inverted files for fast text retrieval[J].ACM Transactions on Information Systems,1996,14(4):349-379.
[3]Persin M,Zobel J,Sacks-Davis R.Filtered document retrieval with frequency-sorted indexes[J].Journal of the American Society for Information Science,1996,47(10):749-764.
[4]Brin S,Page L.The anatomy of a large-scale hypertextual Web search engine[A].Proceedings of the 7th WWW Conference[C].Brisbane,Australia:ScienceDirect,1998:107-117.
[5]谭斌,丁莎,车念,等.一种面向域的高效倒排索引结构及实时更新[J].四川大学学报(自然科学版),2011,48(2):321-326.
Tan Bin,Ding Sha,Che Nian,et al.Field-oriented structure of inverted index and real-time updates[J].Journal of Sichuan University(Natural Science Edition),2011,48(2):321-326.
[6]高梦娇,吕玉琴,侯宾.基于 R-tree 和倒排文件的混合索引的设计与实现[EB/OL].http://www.paper.edu.cn/html/releasepaper/2012/12/718/,2012-12-02.
[7]马健,张太红,陈燕红.中文搜索引擎分块倒排索引存储模式[J].计算机应用,2013,33(7):2031-2036.
Ma Jian,Zhang Taihong,Chen Yanhong.New inverted index storage scheme for Chinese search engine[J].Journal of Computer Applications,2013,33(7):2031-2036.
[8]刘小珠,彭智勇,陈旭.高效的随机访问分块倒排文件自索引技术[J].计算机学报,2010,33(6):977-987.
Liu Xiaozhu,Peng Zhiyong,Chen Xu.An efficient random access block inverted file self-index technology[J].Chinese Journal of Computers,2010,33(6):977-987.
[9]Hatcher E,Gospodnetic O.Lucene in action[EB/OL]http://citeseerx.ist.psu.edu/showciting?cid=541300,2015-06-03.
[10]中科院高能物理研究所计算中心.http://www.datatang.com/data/45499/,2015-06-03.

Memo

Memo:
-
Last Update: 2015-06-30