###
DOI:
计算机系统应用英文版:2010,19(3):49-52
本文二维码信息
码上扫一扫!
主题搜索引擎网络爬虫搜索策略的研究与实现
(北京化工大学 信息研究院 北京 100029)
Search Strategy and Achieve of the Topic Search Engine Spider
摘要
图/表
参考文献
相似文献
本文已被:浏览 2026次   下载 4109
Received:June 06, 2009    
中文摘要: 根据网络页面结构的特点,提出通过页面之间的主题传递来预测页面主题相关性的方法,解决了主题爬虫通道堵塞,抓取遗漏的问题。首先根据锚文本传递一个相关性信息值,如果锚文本给出的信息是相关,相关阈值就直接传递;如果是不相关,就乘以遗传基因比例之后传递。传递的过程中如果遇到相关的网页就恢复链接的相关性信息值到初始值。最后根据实验结果验证了算法的查全率与查准率,查全率有显著的提高。
中文关键词: 网络爬虫  搜索引擎  主题相关  遗传  抓取
Abstract:According to the characteristics of the cyber page structure, this paper proposes the theme which predicts the correlativity by delivering the theme among the pages, and solves the problems of channel jamming and capture omission. Firstly, a correlative information value is delivered according to the anchor text. If the information given by the anchor text is correlated, the correlative threshold will be delivered directly. Otherwise, it will be multiplied by the genetic ratio before delivery. In the process of the delivery, correlative information value may be reset to the initial value if it encounters the correlative Web page. At last, the recall ratio is proven to be greatly improved based on the experimental result.
文章编号:     中图分类号:    文献标志码:
基金项目:
引用文本:
刘淑梅,夏亮,许南山.主题搜索引擎网络爬虫搜索策略的研究与实现.计算机系统应用,2010,19(3):49-52
LIU Shu-Mei,XIA Liang,XU Nan-Shan.Search Strategy and Achieve of the Topic Search Engine Spider.COMPUTER SYSTEMS APPLICATIONS,2010,19(3):49-52