Options
Applying big data paradigms to a large scale scientific workflow: Lessons learned and future directions
Auteur(s)
Carretero, Jesus
Caíno-Lores, Silvina
Date de parution
2020-6-1
In
Future Gener. Comput. Syst.
Vol.
110
De la page
440
A la page
452
Revu par les pairs
1
Résumé
The increasing amounts of data related to the execution of scientific workflows has raised awareness
of their shift towards parallel data-intensive problems. In this paper, we deliver our experience combining
the traditional high-performance computing and grid-based approaches with Big Data analytics
paradigms, in the context of scientific ensemble workflows. Our goal was to assess and discuss the
suitability of such data-oriented mechanisms for production-ready workflows, especially in terms of
scalability. We focused on two key elements in the Big Data ecosystem: the data-centric programming
model, and the underlying infrastructure that integrates storage and computation in each node. We
experimented with a representative MPI-based iterative workflow from the hydrology domain, EnKFHGS,
which we re-implemented using the Spark data analysis framework. We conducted experiments on
a local cluster, a private cloud running OpenNebula, and the Amazon Elastic Compute Cloud (AmazonEC2).
The results we obtained were analysed to synthesize the lessons we learned from this experience, while
discussing promising directions for further research.
of their shift towards parallel data-intensive problems. In this paper, we deliver our experience combining
the traditional high-performance computing and grid-based approaches with Big Data analytics
paradigms, in the context of scientific ensemble workflows. Our goal was to assess and discuss the
suitability of such data-oriented mechanisms for production-ready workflows, especially in terms of
scalability. We focused on two key elements in the Big Data ecosystem: the data-centric programming
model, and the underlying infrastructure that integrates storage and computation in each node. We
experimented with a representative MPI-based iterative workflow from the hydrology domain, EnKFHGS,
which we re-implemented using the Spark data analysis framework. We conducted experiments on
a local cluster, a private cloud running OpenNebula, and the Amazon Elastic Compute Cloud (AmazonEC2).
The results we obtained were analysed to synthesize the lessons we learned from this experience, while
discussing promising directions for further research.
Identifiants
Type de publication
journal article
Dossier(s) à télécharger