[1]张瀚珑,沈备军,王永剑.基于模板检测的违法网站识别方法[J].南京理工大学学报(自然科学版),2015,39(03):266.
 Zhang Hanlong,Shen Beijun,Wang Yongjian.Illegal website identification method based on template detection[J].Journal of Nanjing University of Science and Technology,2015,39(03):266.
点击复制

基于模板检测的违法网站识别方法
分享到:

《南京理工大学学报》(自然科学版)[ISSN:1005-9830/CN:32-1397/N]

卷:
39卷
期数:
2015年03期
页码:
266
栏目:
出版日期:
2015-06-30

文章信息/Info

Title:
Illegal website identification method based on template detection
作者:
张瀚珑1沈备军1王永剑2
1.上海交通大学 软件学院,上海 200240; 2.公安部第三研究所,上海 200031
Author(s):
Zhang Hanlong1Shen Beijun1Wang Yongjian2
1.School of Software,Shanghai Jiao Tong University,Shanghai 200240,China; 2.The Third Research Institute of Ministry of Public Security,Shanghai 200031,China
关键词:
模板检测 违法网站识别 相似度 聚类 图挖掘 赌博网站
Keywords:
template detection illegal website identification similarity degree clustering graph mining gambling websites
分类号:
TP311
摘要:
为高效识别违法网站,该文提出了一种新方法。从HTTP POST提取特征值,计算网站间相似度,对网站进行聚类并抽取违法网站模板用来识别违法网站。应用图挖掘技术过滤合法网站,提升识别效率。以赌博网站为例,在真实环境中对该方法进行了大规模实验和评估。实验结果表明:该方法检测出赌博网站的精确度为1; 与URL、HTML和语义特征相比,HTTP POST特征值的F-Measure最好; 应用图挖掘技术可以有效过滤合法网站,提高整个流程运行效率20%。
Abstract:
A new method is proposed to identify illegal website efficiently.Essential information extracted from HTTP POST is hashed; the degree of website similarity associated with hash value match is measured; unknown websites are classified by the illegal website templates extracted from a large uncategorized corpus by clustering.The identification efficiency is improved by filtering legal websites using graph mining.The method is experimented and tested on gambling websites massively in a real environment.The results show that the precision of gambling website test of this method is 1; compared with URL,HTML and semantic features,the F-Measure of HTTP POST features is the best; legal websites can be filtered effectively using graph mining,and the operational efficiency can be improved by 20%.

参考文献/References:

[1] 恶意网站实验室[EB/OL].http://www.mwsl.org.cn/,2015-05-11.
[2]李洋,刘飚,封化民.基于机器学习的网页恶意代码检测方法[J].北京电子科技学院学报,2012,20(4):36-40,12. Li Yang,Liu Biao,Feng Huamin.Malicious web pages detection based on machine learning[J].Journal of Beijing Electronic Science & Technology Institute,2012,20(4):36-40,12.
[3]黄华军,钱亮,王耀钧.基于异常特征的钓鱼网站URL检测技术[J].信息网络安全,2012(1):23-25,67.
Huang Huajun,Qian Liang,Wang Yaojun.Detection of phishing URL based on abnormal feature[J].Netinfo Security,2012(1):23-25,67.
[4]王涛,余顺争.基于统计学习的挂马网页实时检测[J].计算机科学,2011,38(1):87-90,129.
Wang Tao,Yu Shunzheng.Real-time detection of malicious web pages based on statistical learning[J].Computer Science,2011,38(1):87-90,129.
[5]Braun B,Johns M,Koestler J.PhishSafe:Leveraging modern JavaScript API's for transparent and robust protection[EB/OL].http://web.sec.uni-passau.de/papers/2014_Braun_Koestler_Johns_Posegga-PhishSafe_Leveraging_Modern_JavaScript_APIs_for_Transparent_and_Robust_Protection.pdf,2015-04-18.
[6]倪平,陈正果,欧阳雄弈,等.Web恶意代码主动检测与分析系统的设计与实现[J].计算机应用,2011,31(z2):106-108.
Ni Ping,Chen Zhengguo,Ouyang Xiongyi,et al.Design and implementation of active detection and analysis system for web malicious code[J].Journal of Computer Applications,2011,31(z2):106-108.
[7]Urvoy T,Chauveau E,Filoche P.Tracking web spam with HTML style similarities[J].TWEB,2008,2(1):1-28.
[8]Apache.Hadoop information[EB/OL].http://hadoop.apache.org/,2015-05-11.
[9]Dean J,Ghemawat S.MapReduce:Simplified data processing on large clusters[EB/OL].http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.4448&or=7,2015-04-18.
[10]Akoglu L,Mcglohon M,Faloutsos C.OddBall:Spotting anomalies in weighted graphs[EB/OL].http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.168.6324,2015-04-18.
[11]Ma J,Saul L K,Savage S,et al.Beyond blacklists:Learning to detect malicious web sites from suspicious URLs[EB/OL].http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153.3276,2015-04-18.
[12]Ma J,Saul L K,Savage S.Identifying suspicious URLs:An application of large-scale online learning[EB/OL].http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153.3318,2015-04-18.

备注/Memo

备注/Memo:
收稿日期:2014-09-19 修回日期:2015-01-07
基金项目:国家自然科学基金(61472242); 公安部第三研究所开放基金(C13610)
作者简介:张瀚珑(1989-),男,硕士,主要研究方向:数据挖掘、图挖掘、违法网站识别,E-mail:hanlongzhang@foxmail.com; 通讯作者:沈备军(1969-),女,博士,副教授,主要研究方向:软件工程和数据挖掘,E-mail:bjshen@sjtu.edu.cn。
引文格式:张瀚珑,沈备军,王永剑.基于模板检测的违法网站识别方法[J].南京理工大学学报,2015,39(3):266-271.
投稿网址:http://zrxuebao.njust.edu.cn
更新日期/Last Update: 2015-06-30