Federated Data Platform for Analytics Applications / Integrated Law Enforcement (D2D CRC)

The Integrated Law Enforcement Project, part of the D2D CRC, aims to develop an open architecture for federated data access over a heterogeneous data sources and front-end analytic applications. The project will systematically integrate (meta-)data across different tool chains. In addition, the project will allow for the conducting of analytic processes that leverage tools from multiple vendors. The developed architecture will support an array of analytic processes, with initial focus on Identity Resolution. To achieve this, the project will address data and software heterogeneity, capture and storage of meta-data (including provenance), and the definition and execution of distributed analytic pipelines.

Project Leader: Professor Markus Stumptner
Supervisors: Dr. Wolfgang Mayer, Dr. Georg Grossmann
PhD Students: John Wondoh, Marianne Rieckmann, Hossein Mobasseri, Ruth Frimpong
Research Assistants and Fellows: Dr. Wenhao Li, Dr. Zaiwen Feng, Amir Kashefi


Overview

As part of the Data to Decisions Cooperative Research Centre (D2D CRC), the Integrated Law Enforcement Project is developing an open architecture for federated data access. The architecture will sit atop a heterogeneous ecosystem of existing data sources and front-end analytic applications, translating best practices from Enterprise Application Integration into the “Big Data” analytic pipelines. It covers three main areas:

  • data and software heterogeneity;
  • capture and storage of meta-data, including provenance; and
  • definition and execution of distributed analytic pipelines

The project will systematically integrate data and meta-data across different tool-chains, without requiring that the data be ingested. The initial application focus is on Identity Resolution; however, additional analytic processes will be supported, by the provision of an architecture that can leverage tools from multiple vendors for the conducting of analytic distributed processes.

This project consists of a number of interrelated sub-projects, the largest of which is the federated data platform, each focusing on a different aspect of the solution within the overall architecture. Some of the sub-projects are carried out by post-doctoral fellows and PhD candidates associated with KSE lab.

Base Architecture

Underlying the project’s different aspects is an architecture that will provide a comprehensive data management framework. This framework incorporates and relies on a well-defined share data and meta-data model supported by the definition of vendor-agnostic interfaces for data access and process pipelines comprising the analytic services of different tools.

Data Integration

The work on data integration involves the creation of executable mappings between the architecture’s (meta-)data model and the the data models and APIs of individual systems. Moving beyond present ETL (Extract, Transform, Load) and data access approaches, this work will develop bi-directional mappings. Such mappings will allow the propagation of updates to and from the federated knowledge hub. Model-driven approaches will be applied to the development of meta-models and mappings, allowing for the early detection and semi-automated resolution of mapping problems: issues that are typically left for manual resolution by software engineers.

Meta-data, Security, and Provenance

The federated architecture will involve a modular meta-data model supporting the the capture of provenance, security, confidence, and other meta-data related to entities and links. The meta-layer will present a Knowledge Graph-like view of the federated data hub that will form the foundation for information use, governance, data quality protocols, analytic pipelines, and the exploration and justification of results.

Analytic Processes

The open federated architecture will allow users to construct complex analytic pipelines that seamlessly blend services offered by different tools and vendors. This will provide users the ability to access multiple specialised algorithms where appropriate, supporting distributed workflows and the special needs of end-user groups. The resulting orchestration platform will make use of the meta-data layer to support security mechanisms and access controls for individual data sources. Moreover, the platform will allow users to utilise familiar languages and tools based on their experience and needs while avoiding vendor lock-in and providing broader capabilities through a diverse array of tools.