Overview#
A pipeline that ingests data and produces no visible audit of how that data flowed — which source fed which transformation, which transformation produced which dataset — is a black box. When an analyst receives an alert tied to a conflated record, the question is always the same: where did this data come from, and which step corrupted it? Without lineage, the answer requires manually tracing through logs, connector configurations, and normalisation code. With lineage, the full flow from raw connector extraction to normalised entity creation is recorded as a directed acyclic graph and displayed as a visual diagram on the pipeline detail page.
The Data Pipeline Lineage module implements the OpenLineage open standard (Linux Foundation, Apache 2.0) to capture START, COMPLETE, and FAIL events at each pipeline stage. Every event is persisted to PostgreSQL, scoped to the organisation, and assembled into a DAG on demand. A canvas-based DAG viewer in the data-pipelines interface shows nodes coloured by role (source, transform, sink), directed edges labelled with the job name and execution duration, and a status indicator on each node reflecting the most recent lifecycle event. Clicking a node exposes its role and status in a detail panel without navigating away from the pipeline view.
Diagram
graph LR
A[Data Source] --> B[OL START event]
B --> C[Connector Extraction]
C --> D[OL COMPLETE event]
D --> E[Transformation / Normalisation]
E --> F[OL COMPLETE event]
F --> G[Entity Creation]
G --> H[PipelineLineageEvent table]
H --> I[Lineage DAG Visualisation]
C --> J[OL FAIL event]
J --> HLast Reviewed: 2026-04-14 Last Updated: 2026-04-14
Key Features#
-
OpenLineage Standard Events: Every pipeline step emits OpenLineage RunEvents conforming to the OpenLineage Specification (openlineage.io). START events are emitted before connector extraction begins; COMPLETE events are emitted after successful normalisation; FAIL events are emitted on any unhandled exception. All events include the job name, run UUID, input datasets, and output datasets in the standard OpenLineage schema.
-
PostgreSQL Lineage Store: Lineage events are persisted to the
pipeline_lineage_eventstable and derived edges topipeline_lineage_edges. Both tables carryorganization_idon every row, enforcing the platform's multi-tenant isolation requirement. Upsert semantics on the edges table ensure that re-runs update duration without duplicating edge records. -
DAG Query API: Two GraphQL queries expose lineage data to the frontend.
getPipelineLineageDag(runId)assembles the full node/edge graph for a specific run by joining events and edges tables.listRecentPipelineRuns(limit)returns the most recent lineage events for the current organisation, enabling a historical run list. Both queries are protected byIsAuthenticatedand scoped toorganization_id. -
Canvas DAG Viewer: The lineage viewer renders nodes and edges on an HTML Canvas element with no third-party visualisation dependency. Nodes are colour-coded by role: blue for source datasets, orange for transform steps, green for sink datasets. A status dot on each node reflects the most recent event type (green = COMPLETE, red = FAIL, blue = START). Edges show job names and execution duration in milliseconds.
-
Non-Breaking Instrumentation: All OpenLineage emit calls in the ingestion pipeline are wrapped in isolated try/except blocks. A failure to write a lineage event never interrupts the ingestion flow; errors are logged at DEBUG level and the pipeline continues. This means the feature can be deployed without risk to existing data ingestion reliability.
-
Organisation-Scoped Tenant Isolation: Every read and write operation in the lineage service includes
organization_idin the SQL WHERE clause. The GraphQL resolvers extractorganization_idfrom the authenticated user context before querying. Cross-tenant lineage data is structurally impossible to access through the API.
Use Cases#
- Root-Cause Analysis: When a downstream alert or entity record is incorrect, analysts trace the lineage DAG to identify which connector extraction step produced the anomalous value and at which normalisation stage it entered the pipeline.
- Pipeline Audit: Compliance reviews require demonstration that data entered the platform from declared sources. The lineage event log provides a verifiable record of each ingestion run, its inputs, outputs, and outcome.
- Performance Monitoring: Edge duration values (in milliseconds) surface which pipeline steps are slowest, enabling targeted optimisation of connector extractors or transformation logic.
- Failure Investigation: FAIL events include the error message from the ingestion exception, giving operators immediate context about what went wrong without accessing server logs.
Integration#
- Data Pipelines: The lineage DAG viewer appears on the pipeline detail page immediately below the pipeline metadata card, showing lineage for the pipeline's run ID.
- Ingestion Pipeline: The
IngestionCoordinatorServiceis instrumented withOpenLineageService.emit_start,emit_complete, andemit_failcalls wrapping the connector extraction and normalisation phases. - OpenLineage Standard: Events conform to the OpenLineage Specification (https://openlineage.io/spec), an Apache 2.0 open standard governed by the Linux Foundation. The data model is compatible with future integration of any OpenLineage-compatible cataloguing or observability backend.
- PostgreSQL: Lineage events and edges are stored in dedicated tables (
pipeline_lineage_events,pipeline_lineage_edges) in the primary PostgreSQL instance, consistent with the platform's PostgreSQL-first data architecture.
Open Standards#
- OpenLineage — https://openlineage.io/ — Linux Foundation project, Apache License 2.0