Voici les éléments 1 - 10 sur 24
  • Publication
    Métadonnées seulement
    BRISA: Combining Efficiency and Reliability in Epidemic Data Dissemination
    (2012-5-21)
    Matos, Miguel
    ;
    ; ;
    Oliveira, Rui
    ;
    There is an increasing demand for efficient and robust systems able to cope with today's global needs for intensive data dissemination, e.g., media content or news feeds. Unfortunately, traditional approaches tend to focus on one end of the efficiency/robustness design spectrum, by either leveraging rigid structures such as trees to achieve efficient distribution, or using loosely-coupled epidemic protocols to obtain robustness. In this paper we present BRISA, a hybrid approach combining the robustness of epidemic-based dissemination with the efficiency of tree-based structured approaches. This is achieved by having dissemination structures such as trees implicitly emerge from an underlying epidemic substrate by a judicious selection of links. These links are chosen with local knowledge only and in such a way that the completeness of data dissemination is not compromised, i.e., the resulting structure covers all nodes. Failures are treated as an integral part of the system as the dissemination structures can be promptly compensated and repaired thanks to the underlying epidemic substrate. Besides presenting the protocol design, we conduct an extensive evaluation in a real environment, analyzing the effectiveness of the structure creation mechanism and its robustness under faults and churn. Results confirm BRISA as an efficient and robust approach to data dissemination in the large scale.
  • Publication
    Métadonnées seulement
    TOPiCo: Detecting Most Frequent Items from Multiple High-Rate Event Streams
    (: ACM, 2015-6-29) ; ; ; ;
    Matos, Miguel
    ;
    Oliveira, Rui
    Systems such as social networks, search engines or trading platforms operate geographically distant sites that continu- ously generate streams of events at high-rate. Such events can be access logs to web servers, feeds of messages from participants of a social network, or financial data, among others. The ability to timely detect trends and popularity variations is of paramount importance in such systems. In particular, determining what are the most popular events across all sites allows to capture the most relevant informa- tion in near real-time and quickly adapt the system to the load. This paper presents TOPiCo, a protocol that com- putes the most popular events across geo-distributed sites in a low cost, bandwidth-efficient and timely manner. TOPiCo starts by building the set of most popular events locally at each site. Then, it disseminates only events that have a chance to be among the most popular ones across all sites, significantly reducing the required bandwidth. We give a correctness proof of our algorithm and evaluate TOPiCo using a real-world trace of more than 240 million events spread across 32 sites. Our empirical results shows that (i) TOPiCo is timely and cost-efficient for detecting popular events in a large-scale setting, (ii) it adapts dynamically to the distribution of the events, and (iii) our protocol is particularly efficient for skewed distributions.
  • Publication
    Métadonnées seulement
    Lightweight, Efficient, Robust Epidemic Dissemination
    (2013-1-13)
    Matos, Miguel
    ;
    ; ;
    Oliveira, Rui
    ;
    Gossip-based protocols provide a simple, scalable, and robust way to disseminate messages in large-scale systems. In such protocols, messages are spread in an epidemic manner. Gossiping may take place between nodes using push, pull, or a combination. Push-based systems achieve reasonable latency and high resilience to failures but may impose an unnecessarily large redundancy and overhead on the system. At the other extreme, pull-based protocols impose a lower overhead on the network at the price of increased latencies. A few hybrid approaches have been proposed—typically pushing control messages and pulling data—to avoid the redundancy of high-volume content and single-source streams. Yet, to the best of our knowledge, no other system intermingles push and pull in a multiple-senders scenario, in such a way that data messages of one help in carrying control messages of the other and in adaptively adjusting its rate of operation, further reducing overall cost and improving both on delays and robustness. In this paper, we propose an efficient generic push-pull dissemination protocol, Pulp, which combines the best of both worlds. Pulp exploits the efficiency of push approaches, while limiting redundant messages and therefore imposing a low overhead, as pull protocols do. Pulp leverages the dissemination of multiple messages from diverse sources: by exploiting the push phase of messages to transmit information about other disseminations, Pulp enables an efficient pulling of other messages, which themselves help in turn with the dissemination of pending messages. We deployed Pulp on a cluster and on PlanetLab. Our results demonstrate that Pulp achieves an appealing trade-off between coverage, message redundancy, and propagation delay.
  • Publication
    Métadonnées seulement
    CoFeed: privacy-preserving Web search recommendation based on collaborative aggregation of interest feedback
    (2013-1-13) ; ;
    Leonini, Lorenzo
    ;
    Luu, Toan
    ;
    Rajman, Martin
    ;
    ; ;
    Valerio, José
    Search engines essentially rely on the structure of the graph of hyperlinks. Although accurate for the main trend, this is not effective when some query is ambiguous. Leveraging semantic information by the mean of interest matching allows proposing complementary results that are tailored to the user's expectations. This paper proposes a collaborative search companion system, CoFeed, that collects user search queries and that considers feedback to build user-centric and document-centric profiling information. Over time, the system constructs ranked collections of elements that maintain the required information diversity and enhance the user search experience by presenting additional results tailored to the user's interest space. This collaborative search companion requires a supporting architecture adapted to large user populations generating high request loads. To that end, it integrates mechanisms for ensuring scalability and load balancing of the service under varying loads and user interest distributions. Moreover, collecting the recommendation data poses the problem of users’ privacy, and the bias one peer can induce to the system by sending fake recommendations. To that end, CoFeed ensures both publisher anonymity and rate limitation. With the former, the origin of the data is never known by the server that processes it, even if several servers collude to spy on some user. The latter, combined with decoupled authentication, allows to minimize the influence of cheating peers sending fake recommendations. Experiments with a deployed prototype highlight the efficiency of the system by analyzing improvement in search relevance, computational cost, scalability and load balancing.
  • Publication
    Métadonnées seulement
    UniCrawl: A Practical Geographically Distributed Web Crawler
    (: IEEE, 2015-6-27)
    Le Quoc, Do
    ;
    Fetzer, Christof
    ;
    ; ; ;
    As the wealth of information available on the web keeps growing, being able to harvest massive amounts of data has become a major challenge. Web crawlers are the core components to retrieve such vast collections of publicly available data. The key limiting factor of any crawler architecture is however its large infrastructure cost. To reduce this cost, and in particular the high upfront investments, we present in this paper a geo- distributed crawler solution, UniCrawl. UniCrawl orchestrates several geographically distributed sites. Each site operates an independent crawler and relies on well-established techniques for fetching and parsing the content of the web. UniCrawl splits the crawled domain space across the sites and federates their storage and computing resources, while minimizing thee inter-site communication cost. To assess our design choices, we evaluate UniCrawl in a controlled environment using the ClueWeb12 dataset, and in the wild when deployed over several remote locations. We conducted several experiments over 3 sites spread across Germany. When compared to a centralized architecture with a crawler simply stretched over several locations, UniCrawl shows a performance improvement of 93.6% in terms of network bandwidth consumption, and a speedup factor of 1.75.
  • Publication
    Accès libre
  • Publication
    Métadonnées seulement
    WHISPER: Middleware for Confidential Communication in Large-Scale Networks
    A wide range of distributed applications requires some form of confidential communication between groups of users. In particular, the messages exchanged between the users and the identity of group members should not be visible to external observers. Classical approaches to confidential group communication rely upon centralized servers, which limit scalability and represent single points of failure. In this paper, we present WHISPER, a fully decentralized middleware that supports confidential communications within groups of nodes in large-scale systems. It builds upon a peer sampling service that takes into account network limitations such as NAT and firewalls. WHISPER implements confidentiality in two ways: it protects the content of messages exchanged between the members of a group, and it keeps the group memberships secret to external observers. Using multi-hops paths allows these guarantees to hold even if attackers can observe the link between two nodes, or be used as content relays for NAT bypassing. Evaluation in real-world settings indicates that the price of confidentiality remains reasonable in terms of network load and processing costs.
  • Publication
    Métadonnées seulement
    SplayNet: Distributed User-Space Topology Emulation
    Network emulation allows researchers to test distributed applications on diverse topologies with fine control over key properties such as delays, bandwidth, congestion, or packet loss. Current approaches to network emulation require using dedicated machines and low-level operating system support. They are generally limited to one user deploying a single topology on a given set of nodes, and they require complex management. These constraints restrict the scope and impair the uptake of network emulation by designers of distributed applications. We propose a set of novel techniques for network emulation that operate only in user-space without specific operating system support. Multiple users can simultaneously deploy several topologies on shared physical nodes with minimal setup complexity. A modular network model allows emulating complex topologies, including congestion at inner routers and links, without any centralized orchestration nor dedicated machine. We implement our user-space network emulation mechanisms in SplayNet, as an extension of an open-source distributed testbed. Our evaluation with a representative set of applications and topologies shows that SplayNet provides accuracy comparable to that of low-level systems based on dedicated machines, while offering better scalability and ease of use.
  • Publication
    Métadonnées seulement
    A Performance Evaluation of Erasure Coding Libraries for Cloud-Based Data Stores
    Erasure codes have been widely used over the last decade to implement reliable data stores. They offer interesting trade-offs between efficiency, reliability, and storage overhead. Indeed, a distributed data store holding encoded data blocks can tolerate the failure of multiple nodes while requiring only a fraction of the space necessary for plain replication, albeit at an increased encoding and decoding cost. There exists nowadays a number of libraries implementing several variations of erasure codes, which notably differ in terms of complexity and implementation-specific optimizations. Seven years ago, Plank et al. [14] have conducted a comprehensive performance evaluation of open-source erasure coding libraries available at the time to compare their raw performance and measure the impact of different parameter configurations. In the present experimental study, we take a fresh perspective at the state of the art of erasure coding libraries. Not only do we cover a wider set of libraries running on modern hardware, but we also consider their efficiency when used in realistic settings for cloud-based storage, namely when deployed across several nodes in a data centre. Our measurements therefore account for the end-to-end costs of data accesses over several distributed nodes, including the encoding and decoding costs, and shed light on the performance one can expect from the various libraries when deployed in a real system. Our results reveal important differences in the efficiency of the different libraries, notably due to the type of coding algorithm and the use of hardware-specific optimizations.