Smart Crawler: A Two-Stage Crawler ForEfficiently Harvestingdeep Web Interfaces


as we know that web grows at a very quick

speed, so there has been increased interest in procedures

that help efficiently localize deep-web interfaces. The

deep Web, i.e., contents unseen behind HTML forms,

has long been recognized as a notable gap in search

engine coverage. Later it speaks to an general segment of

structured data on the net, retrieving to Deep-Web

content has been a long-standing challenge for the

database community [1]. The fast development of World-

Wide Web poses phenomenal scaling difficulties for

universally useful crawlers and web search engines.

Though, due to the large quantity of web capitals and the

lively nature of deep web, achieving wide coverage and

very high efficiency is challenging problem. We propose

two-stage framework, namely Smart Crawler, for

effective harvesting deep web interfaces, both stages

performs the different procedures[2].In the first stage,

Smart Crawler achieves site-based searching for center

pages with the help of search engines, for escaping

visiting a large number of pages. To achieve more

accurate results for a focused crawl, Smart Crawler

grades websites to arrange highly appropriate ones for a

given topic which is demanded by the user. In the second

stage, Smart Crawler achieves fast in-site searching by

mining most relevant links with an adaptive link-ranking

[3]. To eliminate preference on visiting some highly

relevant links in hidden web directories, we design a link

tree data structure to achieve wider coverage for a

website or the URL given.

Our results on a set of representative domains

show the agility and accuracy of the proposed crawler

framework. This Smart Crawler efficiently retrieves

deep-web interfaces from large-scale sites and realizes

higher harvest rates than other crawlers.

Keywords:- Smart crawler, Site-locating, In-site exploring

,classification, Ranking.

