Illegal website identification method based on template detection


Illegal website identification method based on template detection
Zhang Hanlong1Shen Beijun1Wang Yongjian2
1.School of Software,Shanghai Jiao Tong University,Shanghai 200240,China; 2.The Third Research Institute of Ministry of Public Security,Shanghai 200031,China
template detection illegal website identification similarity degree clustering graph mining gambling websites
A new method is proposed to identify illegal website efficiently.Essential information extracted from HTTP POST is hashed; the degree of website similarity associated with hash value match is measured; unknown websites are classified by the illegal website templates extracted from a large uncategorized corpus by clustering.The identification efficiency is improved by filtering legal websites using graph mining.The method is experimented and tested on gambling websites massively in a real environment.The results show that the precision of gambling website test of this method is 1; compared with URL,HTML and semantic features,the F-Measure of HTTP POST features is the best; legal websites can be filtered effectively using graph mining,and the operational efficiency can be improved by 20%.


