Skip to content

Knowledge graph

LBruyndonckx edited this page Jan 25, 2022 · 4 revisions

Module Status

Status
Module Owner Bart Maertens
Module Status Basics in place, needs cleanup and refactoring
Jira board

Module Scope

Problem Definition
Problem statement Build a modular system to onboard metadata about a Hop project, its execution logging, infrastructure and connected systems (git, database schemas etc) into a Neo4j database.
Impact of Problem Allow project teams to follow how data flows move through an organization and allow the project teams to perform impact analysis.
Brief description of the solution A modular framework needs to be built to allow data from a variety of sources to be connected in one overarching Neo4j graph. This graph needs to be updated frequently, ideally on each commit, and needs to be run a number of queries for anomaly detection, trends analysis. Based on these queries, on or more alarms may need to be raised.

Concept

Data lineage includes the data origin, what happens to it and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.[wikipedia]

  • Build lineage graph:
    • Read metadata from Hop projects, configuration, execution logs and connected systems like RDBMS, Active Directory, git, JIRA, ….
    • Configure the lineage code from other repositories, ideally update on every run of a workflow or pipeline or from a schedule
  • Process lineage results, perform impact analysis:
    • e.g. what happens when a source database table/column is changed
    • raise alarms when needed, provide aggregated trend results
Clone this wiki locally