论文标题

网页上的变更率估计和最佳新鲜度

Change Rate Estimation and Optimal Freshness in Web Page Crawling

论文作者

Avrachenkov, Konstantin, Patil, Kishor, Thoppe, Gugan

论文摘要

为了提供快速准确的结果,搜索引擎可维护整个网络的本地快照。而且,为了使此本地高速缓存更新,它采用了一个爬行者来跟踪各种网页上的更改。但是,有限的带宽可用性和服务器限制对爬行频率施加了一些限制。因此,理想的爬行率是最大化当地缓存的新鲜度并尊重上述约束的速率。 Azar等。 2018年最近提出了一种可解决的算法来解决此优化问题。但是,他们假设对确切的页面变更率的了解,这在实践中是不现实的。我们在这里解决这个问题。具体而言,我们提供了两种新型方案,用于在线估计页面变更率。这两个方案都只需要有关页面更改过程的部分信息,即,他们只需要知道自上次爬行实例以来页面是否已更改。对于这两种方案,我们都证明了收敛性,并且也得出了它们的收敛速率。最后,我们提供了一些数值实验,以将我们提出的估计器的性能与现有估计值(例如MLE)进行比较。

For providing quick and accurate results, a search engine maintains a local snapshot of the entire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across various web pages. However, finite bandwidth availability and server restrictions impose some constraints on the crawling frequency. Consequently, the ideal crawling rates are the ones that maximise the freshness of the local cache and also respect the above constraints. Azar et al. 2018 recently proposed a tractable algorithm to solve this optimisation problem. However, they assume the knowledge of the exact page change rates, which is unrealistic in practice. We address this issue here. Specifically, we provide two novel schemes for online estimation of page change rates. Both schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawled instance. For both these schemes, we prove convergence and, also, derive their convergence rates. Finally, we provide some numerical experiments to compare the performance of our proposed estimators with the existing ones (e.g., MLE).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源