We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Progressive Deep Web Crawling Through Keyword Queries For Data Enrichment

Formal Metadata

Title
Progressive Deep Web Crawling Through Keyword Queries For Data Enrichment
Title of Series
Number of Parts
155
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Data enrichment is the act of extending a local database with new attributes from external data sources. In this paper, we study a novel problem—how to progressively crawl the deep web (i.e., a hidden database) through a keyword-search API to enrich a local database in an e ective way. This is chal- lenging because these interfaces often limit the data access by enforcing the top-k constraint or limiting the number of queries that can be issued within a time window. In response, we propose SmartCrawl, a new framework to collect re- sults e ectively. Given a query budget b, SmartCrawl rst constructs a query pool based on the local database, and then iteratively issues a set of most bene cial queries to the hidden database such that the union of the query results can cover the maximum number of local records. The key technical challenge is how to estimate query bene t, i.e., the number of local records that can be covered by a given query. A simple approach is to estimate it as the query frequency in the local database. We nd that this is ine ective due to i) the impact of |?D|, where |?D| represents the number of local records that cannot be found in the hidden database, and ii) the top-k constraint enforced by the hidden database. We study how to mitigate the negative impacts of the two factors and propose e ective optimization techniques to improve performance. The experimental results show that on both simulated and real-world hidden databases, SmartCrawl signi cantly increases coverage over the local database as compared to the baselines.