Data Lineage

Overview#

An analyst flags that a risk score in a case file looks wrong. The score came from an automated enrichment pipeline, which drew from three separate data sources, one of which was updated by a batch import two weeks ago. Without lineage tracking, finding the root cause means manually tracing back through pipeline logs, schema histories, and source system change records, a process that typically takes hours and sometimes turns up nothing conclusive. With lineage tracking, the same investigation takes minutes: follow the dependency graph from the symptom back to the source, and the problem is visible.

Data Lineage in Argus captures end-to-end movement of data through the platform: from source connectors through ingestion pipelines, transformation stages, enrichment steps, and into platform record store and the entity graph. It tracks column-level dependencies, surfaces impact analysis before schema changes are made, and generates the compliance documentation that intelligence agencies, financial crime units, healthcare data controllers, and government registries need for regulatory audit responses. When data moves across systems, this module knows where it came from, what happened to it, and where it went.

Key Features#

Automated Lineage Capture: Automatically discover and track data lineage across databases, ETL tools, BI dashboards, notebooks, and ML workflows without manual documentation. Lineage is built as data flows, not after the fact.
Multi-Level Visualisation: Explore data flows at dataset level, column level, transformation level, and field level with interactive graph views, filtering, and drill-down navigation.
Impact Analysis: Simulate the effects of schema changes, query modifications, or pipeline deletions to identify all affected downstream systems before making changes. Stakeholders are notified automatically.
Root Cause Tracing: Trace data quality issues from symptoms in reports back to their source. What would take hours of manual investigation is completed in minutes using the dependency graph.
Compliance and Regulatory Lineage: Generate automated audit trails and lineage documentation for GDPR, HIPAA, SOX, and sector-specific requirements. Data subject rights tracking (right of access, right to erasure) is supported through full lifecycle visibility.
Lineage-Based Access Control: Apply fine-grained access controls based on data sensitivity that automatically propagate through the lineage graph, ensuring consistent protection from source to consumption.
Time Travel: View historical lineage snapshots to understand how data flows have changed over time. Useful for post-incident investigation and regulatory review.
Stakeholder Notification: Automatically notify owners of downstream systems when upstream changes are proposed, with integrated approval workflows before changes are deployed.
Sensitive Data Tracking: Tag and trace PII and sensitive data fields through their entire lifecycle, tracking encryption, masking, and cross-border transfers at the field level.
Quality Monitoring Checkpoints: Place quality checks along lineage paths to detect issues early and prevent propagation to downstream consumers.

Use Cases#

Regulatory Audit Response: Respond to audits quickly by generating complete data lineage reports showing how data flows from source to destination, with full transformation history and access controls documented automatically. What once took days of manual evidence gathering is produced on demand.
Safe Schema Evolution: Before modifying database schemas or transformation logic, run impact analysis to identify exactly which dashboards, reports, ML models, and downstream pipelines will be affected, and notify all stakeholders before the change is made.
Data Quality Investigation: When a report shows incorrect data, trace the lineage backwards through every transformation and source to pinpoint the root cause in minutes rather than hours of log trawling.
Privacy Compliance: Track personal data across all systems to fulfil GDPR right-to-access and right-to-erasure requests, with automated deletion propagation verification across every storage location.
Change Management: Establish governance workflows where proposed data changes are reviewed with full impact analysis, stakeholder sign-off, and post-change monitoring before being applied to production.

Integration#

Data Lineage integrates with the platform's ETL pipelines, transformation engine, data quality module, and entity graph to capture lineage automatically as data flows through the system. Lineage metadata is stored in platform record store alongside operational data. The module connects with 40-plus data tools including data warehouses, ETL platforms, BI tools, and ML frameworks, providing unified lineage visibility regardless of technology stack.

Open Standards#

OpenLineage Specification: Pipeline lineage events are emitted using the OpenLineage Job/Run/Dataset/RunEvent model (Linux Foundation project, https://openlineage.io/spec), with START, COMPLETE, FAIL, and ABORT lifecycle transitions persisted per pipeline job run.
W3C PROV-DM (Provenance Data Model, 2013): Entity-level provenance records implement the core W3C PROV-DM concepts of prov:Entity, prov:Activity, and prov:Agent, including PROV_WAS_GENERATED_BY and PROV_WAS_DERIVED_FROM relationship types for dependency graph traversal.
W3C PROV-JSON / PROV-JSON-LD: Provenance chains are exportable as W3C PROV-JSON-LD serialisations, conforming to https://www.w3.org/TR/prov-json/, enabling interoperability with external provenance-aware tools and audit consumers.
OAuth 2.0 and JWT Bearer Token: Token-based authentication protects typed, auditable read and write workflows across the platform.
ISO 8601: All lineage event timestamps, provenance node times, and pipeline run durations are recorded and returned as ISO 8601 strings, ensuring consistent temporal representation across audit exports and regulatory reports.
GDPR (Regulation (EU) 2016/679), Articles 17 and 20: The lineage graph underpins right-to-erasure (Article 17) compliance by tracking every storage location personal data has flowed to, enabling automated propagation of deletion requests and verification that erasure is complete across all sinks.

Last Reviewed: 2026-02-23 Last Updated: 2026-04-14