Dare Platform

The core components that will be extended, improved and ultimately incorporated in the DARE hyper-platform are:

dispel4py, a high-level streaming dataflow specification API and library. Focusing on the requirements that arise when working with huge datasets and demanding computation, dispel4py enables controlled collection of data-lineage information used by DARE’s tools. It can be sent to production on a wide variety of platforms: multi-core large shared memory architectures, HPC clusters and Clouds running data-intensive systems. Automatically optimised mappings for shared memory, Apache Storm/SPARK, Exareme and MPI deliver production performance.

S-ProvFlow, a set of components in support of Reproducibility as a Service (RaaS). It includes a NoSQL document-store for the storage of the provenance and lineage metadata, a service layer in the form of a Web API and a suite of interactive provenance access tools. The data-model specialises the W3C-PROV recommendation for data-intensive application (S-PROV). RaaS addresses the limitations of grids and computational infrastructures in terms of flexible lineage metadata management services and tools, from its acquisition and representation to its rapid exploitation. Data lineage information, stored and accessible through the RaaS layer, can be used at any stage of the cycle.

Exareme, a system for large-scale dataflow processing on the cloud. It offers a declarative language to the users. Exareme is a highly configurable system with new functionalities added frequently. These functionalities include federation, stream processing, compatibility with Apache Spark, lossy and lossless streaming compression, privacy preserved data mining etc.

Semagrow for semantics and linked-data support. The Semagrow query engine acts as a query federator between heterogeneous linked-data sources, enabling complex queries. Semagrow features a sophisticated source selection and query optimization to decide where and which subqueries must be generated to which underlying data sources. Semagrow also tackles the issues of semantic heterogeneity, that is, sources make use of various vocabularies to express the relations among their data. It will support the resolution of data and resources based on high-level queries and metadata.

The BigDataEurope Integrator platform (BDI), a customised, cloud-ready and modular integrator platform, bringing together commercial and research, production-ready components for big-data analytics. It contains tools to help with composing and configuring BDE instances/environments, and monitoring them during deployment, test and production. BDI is based on the Docker platform due to its flexibility, reusability, popularity and compatibility with all major cloud provision services. BDI will provide data management and analytics functionality, and it will be extended to form the basis on top of which DARE components will be deployed.