Options
LEADS: Large-Scale Elastic Architecture for Data as a Service
Titre du projet
LEADS: Large-Scale Elastic Architecture for Data as a Service
Description
LEADS is a research project funded by the European Commission under the FP7. Three universities and four companies team up to build and demonstrate a novel cloud service model named Data-as-a-Service, on top of an innovative infrastructure based on multiple energy-conscious micro-clouds.
LEADS works on answering the demand of companies wishing to exploit the wealth of public data available on today's Internet. Large IT-oriented companies can crawl, store, and query large amounts of data in their own premises. Small companies and companies that are not on the Internet analytics business may still want to:
- Analyze the Web graph to extract business intelligence;
- Match company-specific private data to public data;
- Monitor in real-time public data evolution and detect trends, identify opinion leaders, etc.;
- Propose novel data-enabled services: complex graph analytics, real-time data aggregation, ...
Yet, many companies may not be able or willing to crawl, store and process in-house, because of its associated high cost, complexity, or simply because they miss the necessary expertise. Can these companies rely on larger ones for accessing and processing public content? Querying capabilities, and data freshness and comprehensiveness would depend on the provider's good will. Also, there are little guarantees on confidentiality. Data is power, and power is seldom willingly shared.
LEADS proposes Data-as-a-Service as a solution to the need for small actors to take advantage of big public data, by mutualizing the costs of extracting, storing and processing public data, while offering rich and extensible possibilities, including privacy-protecting querying on public and private data including data updated in real-time, and more.
LEADS works on answering the demand of companies wishing to exploit the wealth of public data available on today's Internet. Large IT-oriented companies can crawl, store, and query large amounts of data in their own premises. Small companies and companies that are not on the Internet analytics business may still want to:
- Analyze the Web graph to extract business intelligence;
- Match company-specific private data to public data;
- Monitor in real-time public data evolution and detect trends, identify opinion leaders, etc.;
- Propose novel data-enabled services: complex graph analytics, real-time data aggregation, ...
Yet, many companies may not be able or willing to crawl, store and process in-house, because of its associated high cost, complexity, or simply because they miss the necessary expertise. Can these companies rely on larger ones for accessing and processing public content? Querying capabilities, and data freshness and comprehensiveness would depend on the provider's good will. Also, there are little guarantees on confidentiality. Data is power, and power is seldom willingly shared.
LEADS proposes Data-as-a-Service as a solution to the need for small actors to take advantage of big public data, by mutualizing the costs of extracting, storing and processing public data, while offering rich and extensible possibilities, including privacy-protecting querying on public and private data including data updated in real-time, and more.
Chercheur principal
Statut
Completed
Date de début
1 Octobre 2013
Date de fin
30 Septembre 2015
Organisations
Site web du projet
Identifiant interne
27796
identifiant
6 Résultats
Voici les éléments 1 - 6 sur 6
- PublicationMétadonnées seulementTOPiCo: Detecting Most Frequent Items from Multiple High-Rate Event Streams(: ACM, 2015-6-29)
; ; ; ; ;Matos, MiguelOliveira, RuiSystems such as social networks, search engines or trading platforms operate geographically distant sites that continu- ously generate streams of events at high-rate. Such events can be access logs to web servers, feeds of messages from participants of a social network, or financial data, among others. The ability to timely detect trends and popularity variations is of paramount importance in such systems. In particular, determining what are the most popular events across all sites allows to capture the most relevant informa- tion in near real-time and quickly adapt the system to the load. This paper presents TOPiCo, a protocol that com- putes the most popular events across geo-distributed sites in a low cost, bandwidth-efficient and timely manner. TOPiCo starts by building the set of most popular events locally at each site. Then, it disseminates only events that have a chance to be among the most popular ones across all sites, significantly reducing the required bandwidth. We give a correctness proof of our algorithm and evaluate TOPiCo using a real-world trace of more than 240 million events spread across 32 sites. Our empirical results shows that (i) TOPiCo is timely and cost-efficient for detecting popular events in a large-scale setting, (ii) it adapts dynamically to the distribution of the events, and (iii) our protocol is particularly efficient for skewed distributions. - PublicationMétadonnées seulementA Practical Distributed Universal Construction with Unknown Participants(2014-12-18)
; ; Modern distributed systems employ atomic read-modify-write primitives to coordinate concurrent operations. Such primitives are typically built on top of a central server, or rely on an agreement protocol. Both approaches provide a universal construction, that is, a general mechanism to construct atomic and responsive objects. These two techniques are however known to be inherently costly. As a consequence, they may result in bottlenecks in applications using them for coordination. In this paper, we investigate another direction to implement a universal construction. Our idea is to delegate the implementation of the universal construction to the clients, and solely implement a distributed shared atomic memory on the servers side. The construction we propose is obstruction-free. It can be implemented in a purely asynchronous manner, and it does not assume the knowledge of the participants. It is built on top of grafarius and racing objects, two novel shared abstractions that we introduce in detail. To assess the benefits of our approach, we present a prototype implementation on top of the Cassandra data store, and compare it empirically to the Zookeeper coordination service. - PublicationMétadonnées seulementConstruction universelle d’objets partagés sans connaissance des participants(2015-6-2)
; ; Une construction universelle est un algorithme permettant à un ensemble de processus concurrents d’accéder à un objet partagé en ayant l’illusion que celui-ci est disponible localement. blue Nous présentons un algorithme permettant la mise en œuvre d’une telle construction dans un système à mémoire partagée. Notre construction est sans verrou, et contrairement aux approches proposées précédemment, ne nécessite pas que les processus accédant à l’objet partagé soient connus. De plus, elle est adaptative : en notant n le nombre total de processus dans le système et k < n le nombre de processus qui utilisent l’objet partagé, tout processus effectue Θ(k) pas de calcul en l’absence de contention. - PublicationMétadonnées seulementZooFence: Principled Service Partitioning and Application to the ZooKeeper Coordination ServiceCloud computing infrastructures leverage fault-tolerant and geographically distributed services in order to meet the requirements of modern applications. Each service deals with a large number of clients that compete for the resources it offers. When the load increases, the service needs to scale. In this paper, we investigate a scalability solution which consists in partitioning the service state. We formulate specific conditions under which a service is partitionable. Then, we present a general algorithm to build a dependable and consistent partitioned service. To assess the practicability of our approach, we implement and evaluate the ZooFence coordination service. ZooFence orchestrates several instances of ZooKeeper and presents the exact same API and semantics to its clients. It automatically splits the coordination service state among ZooKeeper instances while being transparent to the application. By reducing the convoy effect on operations and leveraging the workload locality, our approach allows proposing a coordination service with a greater scalability than with a single ZooKeeper instance. The evaluation of ZooFence assesses this claim for two benchmarks, a synthetic service of concurrent queues and the BookKeeper distributed logging engine.
- PublicationMétadonnées seulementOn the Support of Versioning in Distributed Key-Value Stores(: IEEE, 2014-10-6)
; ; ; ; ; ;Coehlo, Fábio ;Oliveira, Rui ;Matos, MiguelVilaça, RicardoThe ability to access and query data stored in multiple versions is an important asset for many applications, such as Web graph analysis, collaborative editing platforms, data forensics, or correlation mining. The storage and retrieval of versioned data requires a specific API and support from the storage layer. The choice of the data structures used to maintain versioned data has a fundamental impact on the performance of insertions and queries. The appropriate data structure also depends on the nature of the versioned data and the nature of the access patterns. In this paper we study the design and implementation space for providing versioning support on top of a distributed key-value store (KVS). We define an API for versioned data access supporting multiple writers and show that a plain KVS does not offer the necessary synchronization power for implementing this API. We leverage the support for listeners at the KVS level and propose a general construction for implementing arbitrary types of data structures for storing and querying versioned data. We explore the design space of versioned data storage ranging from a flat data structure to a distributed sharded index. The resulting system, ALEPH, is implemented on top of an industrial-grade open-source KVS, Infinispan. Our evaluation, based on real-world Wikipedia access logs, studies the performance of each versioning mechanisms in terms of load balancing, latency and storage overhead in the context of different access scenarios. - PublicationMétadonnées seulementUniCrawl: A Practical Geographically Distributed Web Crawler(: IEEE, 2015-6-27)
;Le Quoc, Do ;Fetzer, Christof; ; ; As the wealth of information available on the web keeps growing, being able to harvest massive amounts of data has become a major challenge. Web crawlers are the core components to retrieve such vast collections of publicly available data. The key limiting factor of any crawler architecture is however its large infrastructure cost. To reduce this cost, and in particular the high upfront investments, we present in this paper a geo- distributed crawler solution, UniCrawl. UniCrawl orchestrates several geographically distributed sites. Each site operates an independent crawler and relies on well-established techniques for fetching and parsing the content of the web. UniCrawl splits the crawled domain space across the sites and federates their storage and computing resources, while minimizing thee inter-site communication cost. To assess our design choices, we evaluate UniCrawl in a controlled environment using the ClueWeb12 dataset, and in the wild when deployed over several remote locations. We conducted several experiments over 3 sites spread across Germany. When compared to a centralized architecture with a crawler simply stretched over several locations, UniCrawl shows a performance improvement of 93.6% in terms of network bandwidth consumption, and a speedup factor of 1.75.