Data Pipeline Lineage

Overview#

A pipeline that ingests data and produces no visible audit of how that data flowed, which source fed which transformation, which transformation produced which dataset, is a black box. When an analyst receives an alert tied to a conflated record, the question is always the same: where did this data come from, and which step corrupted it? Without lineage, the answer requires manually tracing through logs, connector configurations, and normalisation code. With lineage, the full flow from raw connector extraction to normalised entity creation is recorded as a directed acyclic graph and displayed as a visual diagram on the pipeline detail page.

The Data Pipeline Lineage module implements the OpenLineage open standard (Linux Foundation, Apache 2.0) to capture START, COMPLETE, and FAIL events at each pipeline stage. Every event is stored in the platform record, scoped to the organisation, and assembled into a DAG on demand. A canvas-based DAG viewer in the data-pipelines interface shows nodes coloured by role (source, transform, sink), directed edges labelled with the job name and execution duration, and a status indicator on each node reflecting the most recent lifecycle event. Clicking a node exposes its role and status in a detail panel without navigating away from the pipeline view.

Last Reviewed: 2026-04-14 Last Updated: 2026-04-14

Key Features#

OpenLineage Standard Events: Every pipeline step emits OpenLineage RunEvents conforming to the OpenLineage Specification (openlineage.io). START events are emitted before connector extraction begins; COMPLETE events are emitted after successful normalisation; FAIL events are emitted on any unhandled exception. All events include the job name, run UUID, input datasets, and output datasets in the standard OpenLineage schema.
platform record store Lineage Store: Lineage events are persisted to the dedicated lineage events table and a derived edges table. Both tables carry organisation scope on every row, enforcing the platform's multi-tenant isolation requirement. Upsert semantics on the edges table ensure that re-runs update duration without duplicating edge records.
DAG Query API: Two governed read workflows expose lineage data to the frontend. a run-specific DAG query assembles the full node/edge graph for a specific run by joining events and edges tables. a recent-runs list query returns the most recent lineage events for the current organisation, enabling a historical run list. Both queries are protected by an authenticated-user guard and scoped to organisation scope.
Canvas DAG Viewer: The lineage viewer renders nodes and edges on an HTML Canvas element with no third-party visualisation dependency. Nodes are colour-coded by role: blue for source datasets, orange for transform steps, green for sink datasets. A status dot on each node reflects the most recent event type (green = COMPLETE, red = FAIL, blue = START). Edges show job names and execution duration in milliseconds.
Non-Breaking Instrumentation: All OpenLineage emit calls in the ingestion pipeline are wrapped in isolated try/except blocks. A failure to write a lineage event never interrupts the ingestion flow; errors are logged at DEBUG level and the pipeline continues. This means the feature can be deployed without risk to existing data ingestion reliability.
Organisation-Scoped Tenant Isolation: Every read and write operation in the lineage service includes organisation scope in the SQL WHERE clause. The authorised workflow handlers extract organisation scope from the authenticated user context before querying. Cross-tenant lineage data is structurally impossible to access through the API.

Use Cases#

Root-Cause Analysis: When a downstream alert or entity record is incorrect, analysts trace the lineage DAG to identify which connector extraction step produced the anomalous value and at which normalisation stage it entered the pipeline.
Pipeline Audit: Compliance reviews require demonstration that data entered the platform from declared sources. The lineage event log provides a verifiable record of each ingestion run, its inputs, outputs, and outcome.
Performance Monitoring: Edge duration values (in milliseconds) surface which pipeline steps are slowest, enabling targeted optimisation of connector extractors or transformation logic.
Failure Investigation: FAIL events include the error message from the ingestion exception, giving operators immediate context about what went wrong without accessing server logs.

Integration#

Data Pipelines: The lineage DAG viewer appears on the pipeline detail page immediately below the pipeline metadata card, showing lineage for the pipeline's run ID.
Ingestion Pipeline: The ingestion coordinator is instrumented with start, complete, and fail event calls wrapping the connector extraction and normalisation phases.
OpenLineage Standard: Events conform to the OpenLineage Specification (https://openlineage.io/spec), an Apache 2.0 open standard governed by the Linux Foundation. The data model is compatible with future integration of any OpenLineage-compatible cataloguing or observability backend.
platform record store: Lineage events and edges are stored in dedicated tables (dedicated lineage events and edges tables) in the primary platform record store instance, consistent with the platform's platform record store-first data architecture.

Open Standards#

OpenLineage: An open specification governed by the Linux Foundation (Apache 2.0) for capturing and sharing data lineage events across pipeline stages, including start, complete, and fail run events with full input/output dataset metadata.
W3C PROV-DM (PROV Data Model): The W3C standard for representing and exchanging provenance information, providing the conceptual foundation for directed acyclic graph representations of data flow and transformation history.
OAuth 2.0 and JWT Bearer Token: Token-based authentication protects typed, auditable read and write workflows across the platform.
ISO 8601: The international standard for date and time representations, used for all event timestamps recorded in the lineage store to ensure consistent, sortable, and interoperable time data.