Linked Water Dataspace: Real-time pipeline

Waternomics demo

Real-time data ingestion is one of the important requirements for the design of the Waternomics platform. A design choice has been made in favour of following the design principle of  Lambda Architecture for sensor data ingestion.

In this post, we refer to the data flowing from the lowest layer of entire Waternomics platform (i.e, hardware layer) to the data (i,e. data layer) and part of the software layers (i.e., event processing in support services layer).

The dataspace is designed with respect to the lambda architecture that covers both batch and real-time event processing. We use an event processing engine for managing live sensor events, and deploy a scalable message oriented middleware for passing data between real-time sources and applications. We use middleware based on published/subscribe pattern, such as Apache Kafka, and Apache Spark Streaming for real-time aggregation jobs.


Waternomics demo


The real-time aspect of the dataspace is realized via the following elements:

  • A RESTful API adapter for the sensor.
  • The Kafka middleware which forms the backbone of events distribution and reliability guarantee for production/consumption staging.
  • The Spark Streaming map/reduce jobs which do the first shot of aggregation and processing.
  • The Druid real-time node, which moves the dimensional aggregated stream data from the Kafka bus into the Druid deep storage and make it available for further querying.
  • Other components in the dataspace which could consume from the Kafka nodes on the fly such as enrichment of events, RDFization, Collider nodes for matching, etc.

Waternomics demo


In the following is a demo video of the real-time pipline that shows data flowing from sensors to the Waternomics dataspace as well as a demo of querying aggregated data. For more details, we refer the readers to our deliverable D3.1.1 and D3.1.2.