UniCrawl: A Practical Geographically Distributed Web Crawler
Do Le Quoc, Christof Fetzer, Pascal Felber, Etienne Rivière, Valerio Schiavoni & Pierre Sutra
Résumé |
As the wealth of information available on the web keeps growing,
being able to harvest massive amounts of data has become a major
challenge. Web crawlers are the core components to retrieve such
vast collections of publicly available data. The key limiting
factor of any crawler architecture is however its large
infrastructure cost. To reduce this cost, and in particular the
high upfront investments, we present in this paper a geo-
distributed crawler solution, UniCrawl. UniCrawl orchestrates
several geographically distributed sites. Each site operates an
independent crawler and relies on well-established techniques for
fetching and parsing the content of the web. UniCrawl splits the
crawled domain space across the sites and federates their storage
and computing resources, while minimizing thee inter-site
communication cost. To assess our design choices, we evaluate
UniCrawl in a controlled environment using the ClueWeb12 dataset,
and in the wild when deployed over several remote locations. We
conducted several experiments over 3 sites spread across Germany.
When compared to a centralized architecture with a crawler simply
stretched over several locations, UniCrawl shows a performance
improvement of 93.6% in terms of network bandwidth consumption, and
a speedup factor of 1.75. |
Mots-clés |
web crawler, geo-distributed system, cloud federation, storage, map-reduce. |
Citation | D. Le Quoc, et al., "UniCrawl: A Practical Geographically Distributed Web Crawler," in IEEE Cloud 2015: 8th IEEE International Conference on Cloud Computing, New York, USA, 2015. |
Type | Actes de congrès (Anglais) |
Nom de la conférence | IEEE Cloud 2015: 8th IEEE International Conference on Cloud Computing (New York, USA) |
Date de la conférence | 27-6-2015 |
Editeur commercial | IEEE |
URL | http://www.thecloudcomputing.org/2015/ |
Liée au projet | LEADS: Large-Scale Elastic Architecture for Data as a Ser... |