Options
Kropf, Peter
Résultat de la recherche
Using multiperspective observations to improve data quality in distributed systems
2023, Gkikopoulos, Panagiotis, Schiavoni, Valerio, Kropf, Peter, Josef Spillner
Les systèmes pilotés par les données deviennent rapidement un paradigme de premier plan, l’avènement de l’IA et des systèmes cyber-physiques intelligents devenant une caractéristique déterminante de l’époque moderne. On dit souvent que la qualité des données qui entrent dans ces systèmes est le principal déterminant du comportement et des décisions qu’ils produisent. Il est donc primordial de fournir de meilleures stratégies de gestion et d’amélioration de la qualité des données pour les systèmes autonomes. Un autre attribut de l’infrastructure moderne basée sur les données est sa nature hautement distribuée. En effet, les déploiements de cloud, d’IoT et de continuum/fog sont presque omniprésents dans la pratique actuelle. Dans de nombreux cas, les systèmes pilotés par les données susmentionnés sont déployés sur ce type d’infrastructure en premier lieu. Nous voyons une opportunité de tirer parti de l’ubiquité des ressources de calcul distribuées pour ajouter une couche d’assurance qualité aux données. Notre travail s’inspire de deux sources principales. D’une part, la nécessité de collecter de manière cohérente des données brutes de haute qualité et des informations et mesures dérivées, en présence d’erreurs, de problèmes et d’autres interférences. D’autre part, nous nous inspirons de la fusion de données, une méthodologie couramment utilisée pour combiner des données provenant de différentes sources afin d’obtenir des informations de meilleure qualité que la somme de leurs parties. Nous envisageons une généralisation de la fusion de données à tous les formats d’ensembles de données, en particulier lorsqu’elles sont obtenues en doublons redondants par des observateurs indépendants. Nous appelons ce type d’observation une "observation multiperspective". Notre méthodologie de base consiste à concevoir, mettre en oeuvre et évaluer ce concept d’observation multi-perspective. La première partie est un système d’observateurs indépendants, représentés comme des noeuds dans une architecture distribuée, des collaborateurs dans un projet crowdsourcé ou même simplement des capteurs matériels dans une configuration traditionnelle de fusion de capteurs. Nous commençons par présenter la première partie de cette stratégie d’observation, l’acquisition de données. Nous présentons notre premier scénario motivant, qui consiste à suivre l’évolution de l’écosystème cloud-native dans un observatoire distribué "démocratique". Nous fournissons ensuite notre implémentation de cet observatoire et présentons son utilisation pour comprendre et améliorer le support matériel dans les images Docker. En outre, nous discutons de l’intégration de notre système d’acquisition de données avec des outils de reproductibilité et de preuve des données centrés sur la science des données. Ce travail nous permet également d’étudier les limites et les défis liés à l’obtention de données de la part d’observateurs indépendants. Nous utilisons nos résultats pour développer nos méthodologies dans la partie suivante de ce travail. Nous présentons ensuite la solution que nous proposons pour résoudre les problèmes de qualité des données découverts : Consensus centré sur les données (DCC). En utilisant notre système d’acquisition de données et les données que nous avons obtenues, nous développons une architecture de système pour fusionner les observations en une vue commune de la vérité, sur laquelle tous les observateurs sont d’accord. Nous étudions ensuite les algorithmes que nous pouvons utiliser pour y parvenir, ainsi que les implications de notre système en termes de performances. Enfin, nous nous concentrons sur les algorithmes eux-mêmes et présentons notre propre contribution à l’espace des électeurs définis par logiciel, l’algorithme de vote AVOC, et VDX, une spécification générique pour décrire les électeurs définis par logiciel. Nous évaluons notre contribution à la fois par rapport et en conjonction avec l’état de l’art dans un exemple de fusion de capteurs pour montrer que les algorithmes de vote peuvent en effet augmenter à la fois la qualité des résultats et la performance lorsqu’ils sont correctement combinés avec des approches de fusion de données standard. Dans l’ensemble, cette thèse présente la stratégie de l’observation multi-perspective dans une approche de bout en bout, de l’acquisition des données à la réconciliation des conflits et à la détermination des gains de qualité des données. Nous montrons où cette méthodologie est applicable et fournissons une mise en oeuvre du modèle ainsi qu’une évaluation de ses performances et de ses limites. ABSTRACT Data-driven systems are quickly becoming a prominent paradigm, with the advent of AI and smart cyber-physical systems becoming a defining characteristic of the modern day. It is often stated that the quality of data going into such systems is the primary determinant of the behaviour and decisions they produce. It is thus paramount to provide better strategies for managing and improving data quality for autonomic systems. Another attribute of modern data-driven infrastructure is its highly distributed nature. Indeed cloud, IoT and continuum/fog deployments are almost ubiquitous in current practice. In many cases, the above-mentioned data-driven systems are deployed to such infrastructure in the first place. We see an opportunity to leverage the ubiquity of distributed compute resources to add a layer of quality assurance to data. Our work is inspired by two main sources. On one hand, the need to consistently collect high-quality raw data and derived insights and metrics, in the presence of faults, issues and other interference. On the other hand, we draw inspiration from data fusion, a methodology commonly used to combine data from different sources to obtain insights of higher quality than the sum of its parts. We envision a generalisation of data fusion to all formats of datasets, particularly when obtained in redundant duplicates from independent observers. We call such an observation, a ’multiperspective observation’. Our core methodology is to design, implement and evaluate this concept of multiperspective observation. The first part is a system of independent observers, represented as nodes in a distributed architecture, collaborators in a crowdsourced project or even just hardware sensors in a traditional sensor-fusion setup. We begin by presenting the first part of this observation strategy, data acquisition. We show our first motivating scenario, tracking the evolution of the cloud-native ecosystem in a ’democratic’ distributed observatory. We then provide our implementation of this observatory and present its use in understanding and improving hardware support in Docker images. Further, we discuss the integration of our data acquisition system with data science-centric reproducibility and data provenance tooling. This work also serves to help us study the limitations and challenges of obtaining data from independent observers. We use our findings to develop our methodologies for the next part of this work. Then, we introduce our proposed solution to the discovered issues in data quality: Data-Centric Consensus (DCC). Using our data acquisition system and the data we obtained, we develop a system architecture to merge the observations into a common view of the truth, that all observers agree to. We then investigate the algorithms we can use to achieve this, and the performance implications of our system. Finally, we focus on the algorithms themselves, and present our own contribution to the space of software-defined voters, the AVOC voting algorithm, and VDX, a generic specification for describing software-defined voters. We evaluate our contribution both against and in conjunction with the state-of-the-art in a sensor fusion example to show that voting algorithms can indeed augment both output quality and performance when correctly combined with standard data fusion approaches. All in all, this thesis presents the strategy of multiperspective observation in an end-to-end approach, from acquiring data, to reconciling conflicts to determining the gains in data quality. We show where this methodology is applicable and provide model implementation and an evaluation of both its performance and limitations.
THUNDERSTORM: A Tool to Evaluate Dynamic Network Topologies on Distributed Systems
2019-10-1, Liechti, Luca, Gouveia, Paulo, Neves, João, Kropf, Peter, Matos, Miguel, Schiavoni, Valerio
Abstract—Network dynamics, such as sudden changes in latency or available bandwidth, have a significant impact on the performance of distributed systems. While such dynamics are common, especially in WAN deployments, existing tools lack the capabilities to systematically evaluate the impact of such changes in real systems. We present THUNDERSTORM, a tool to evaluate the impact of dynamic network topologies on the performance of large-scale distributed systems. THUNDERSTORM is a fully functional tool that integrates with Kubernetes and can be used to evaluate off-the-shelf applications. THUNDERSTORM defines an easy-to-use language to describe arbitrarily complex network topologies and dynamic events used to enrich the default container composition descriptors. Our evaluation, using micro- and macro-benchmarks, as well as off-the-shelf unmodified systems (e.g., Apache Cassandra, MariaDB) shows that THUNDERSTORM is easy to use, accurate in reproducing dynamic behaviours and that it can help researchers uncover unexpected behaviours otherwise very costly to reproduce in real deployments typically captured only during malfunctioning periods.
Integrating hydrological modelling, data assimilation and cloud computing for real-time management of water resources
2017-7-1, Kropf, Peter, Kurtz, Wolfgang, Lapin, Andrei, Tang, Qi, Schilling, Oliver, Schiller, Eryk, Braun, Torsten, Hunkeler, Daniel, Vereecken, Harry, Sudicky, Edward, Franssen, Harrie-Jan Hendricks, Brunner, Philip
Online data acquisition, data assimilation and integrated hydrological modelling have become more and more important in hydrological science. In this study, we explore cloud computing for integrating field data acquisition and stochastic, physically-based hydrological modelling in a data assimilation and optimisation framework as a service to water resources management. For this purpose, we developed an ensemble Kalman filter-based data assimilation system for the fully-coupled, physically-based hydrological model HydroGeoSphere, which is able to run in a cloud computing environment. A synthetic data assimilation experiment based on the widely used tilted V-catchment problem showed that the computational overhead for the application of the data assimilation platform in a cloud computing environment is minimal, which makes it well-suited for practical water management problems. Advantages of the cloud-based implementation comprise the independence from computational infrastructure and the straightforward integration of cloud-based observation databases with the modelling and data assimilation platform.
Lessons Learned from Applying Big Data Paradigms to a Large Scale Scientific Workflow
2016-11-14, Kropf, Peter
The increasing amount of data related to the execution of scientific workflows has raised awareness of their shift towards parallel data-intensive problems. In this paper, we deliver our experience with combining the traditional high-performance computing and grid-based approaches for scientific workflows, with Big Data analytics paradigms. Our goal was to assess and discuss the suitability of such data-intensive-oriented mechanisms for production-ready workflows, especially in terms of scalability, focusing on a key element in the Big Data ecosystem: the data-centric programming model. Hence, we reproduced the functionality of a MPI-based iterative workflow from the hydrology domain, EnKF-HGS, using the Spark data analysis framework. We conducted experiments on a local cluster, and we relied on our results to discuss promising directions for further research.
Applying big data paradigms to a large scale scientific workflow: Lessons learned and future directions
2020-6-1, Kropf, Peter, Lapin, Andrei, Carretero, Jesus, CaÃno-Lores, Silvina
The increasing amounts of data related to the execution of scientific workflows has raised awareness of their shift towards parallel data-intensive problems. In this paper, we deliver our experience combining the traditional high-performance computing and grid-based approaches with Big Data analytics paradigms, in the context of scientific ensemble workflows. Our goal was to assess and discuss the suitability of such data-oriented mechanisms for production-ready workflows, especially in terms of scalability. We focused on two key elements in the Big Data ecosystem: the data-centric programming model, and the underlying infrastructure that integrates storage and computation in each node. We experimented with a representative MPI-based iterative workflow from the hydrology domain, EnKFHGS, which we re-implemented using the Spark data analysis framework. We conducted experiments on a local cluster, a private cloud running OpenNebula, and the Amazon Elastic Compute Cloud (AmazonEC2). The results we obtained were analysed to synthesize the lessons we learned from this experience, while discussing promising directions for further research.
A Roadmap for Research in Sustainable Ultrascale Systems
2018, Sousa, Leonel, Kropf, Peter, Kuonen, Pierre, Prodan, Radu, Trinh, Tuan Anh, Carreto, Jesus
The COST Action IC1305 (NESUS) proposes in this research roadmap research objectives and twelve associated recommendations, which in combination, can help bring about the notable changes required to make true the existence of sustainable ultrascale computing systems. Moreover, they are useful for industry and stakeholders to define a path towards ultrascale systems.
A LRAAM-based Partial Order Function for Ontology Matching in the Context of Service Discovery
2017-6-14, Ludolph, Hendrik, Babin, Gilbert, Kropf, Peter
The demand for Software as a Service is heavily increasing in the era of Cloud. With this demand comes a proliferation of third-party service offerings to fulfill it. It thus becomes crucial for organizations to find and select the right services to be integrated into their existing tool landscapes. Ideally, this is done automatically and continuously. The objective is to always provide the best possible support to changing business needs. In this paper, we explore an artificial neural network implementation, an LRAAM, as the specific oracle to control the selection process. We implemented a proof of concept and conducted experiments to explore the validity of the approach. We show that our implementation of the LRAAM performs correctly under specific parameters. We also identify limitations in using LRAAM in this context.
Efficient Broadcasting Algorithm in Harary-like Networks}
2017-8-1, Bhabak, Puspal, Harutyunyan, Hovhannes, Kropf, Peter
In this paper, we analyze the properties of Harary graphs and some derivatives with respect to the achievable performance of communication within network structures based on these graphs. In particular we defined Cordal-Haray graphs on n nodes which can be constructed for any even n for any odd degree between 3 and 2[log n] - 1. We also present a simple algorithm for fast message broadcasting in this network. Our analysis show that when nodes of a Cordal-Harary Graph have logarithmic degree then the broadcasting time will be as small as [log n] which is the minimum possible value for a network on n nodes. All this properties show that Cordal-Harary is a very good network architecture for parallel processing.
Methodological Approach to Data-Centric Cloudific- ation of Scientific Iterative Workflows
2016-12-14, Kropf, Peter
The computational complexity and the constantly increas- ing amount of input data for scientific computing models is threatening their scalability. In addition, this is leading towards more data-intensive scientific computing, thus rising the need to combine techniques and in- frastructures from the HPC and big data worlds. This paper presents a methodological approach to cloudify generalist iterative scientific work- flows, with a focus on improving data locality and preserving perfor- mance. To evaluate this methodology, it was applied to an hydrologi- cal simulator, EnKF-HGS. The design was implemented using Apache Spark, and assessed in a local cluster and in Amazon Elastic Compute Cloud (EC2) against the original version to evaluate performance and scalability.