Data Quality Monitoring

Overview#

Silent feed degradation is one of the costliest failure modes in intelligence operations: bad batches contaminate the investigation graph long before any analyst notices. Data Quality Monitoring runs statistical drift analysis after every connector ingestion batch, comparing the incoming data against a stable reference baseline and raising a Human-in-the-Loop review request the moment distributional change exceeds a configurable threshold. The result is a continuous, automated quality gate that catches upstream schema changes, missing-value surges, and structural anomalies at the ingestion boundary, before they can affect investigation quality.

Key Features#

Per-Run Drift Computation: After every successful ingestion batch the platform computes a full distributional drift report comparing the batch against the connector's reference baseline. The report covers every column in the feed, measuring Population Stability Index (PSI), Jensen-Shannon divergence, and Wasserstein distance according to column type. A single share-of-drifted-columns score summarises each run at a glance.
Data Quality Metrics: Alongside drift detection, the platform evaluates data completeness, flagging missing-value rates, constant columns, and out-of-range values. Alert counts are stored next to the drift score so analysts can distinguish distributional shift from structural data problems.
Automatic Baseline Capture: The first ingestion run for a new connector automatically establishes a reference baseline. All subsequent runs compare against it. Authorised users can reset the baseline through the administration interface after confirming that a legitimate upstream data-model change has occurred.
Human-in-the-Loop Pausing on Threshold Breach: When a drift score exceeds the configured threshold (default 0.3), the platform creates a review request in the Human-in-the-Loop queue and holds the connector until an analyst approves resumption or rejects the batch. Severity is graded: HIGH for scores above 0.6 and MEDIUM for scores between 0.3 and 0.6.
Interactive HTML Report Access: Full drift and quality reports are stored in object storage after each run. Analysts access them through time-limited download links directly from the administration dashboard, with no requirement for data-science tooling or backend access. Links remain valid for seven days.
Fault-Tolerant Background Execution: Quality monitoring runs as a non-blocking background task. If the quality analysis library is unavailable or encounters an unexpected error, the ingestion job completes normally without disruption. Data quality monitoring degrades gracefully and never blocks intelligence feeds.
Per-Tenant Data Sovereignty: Every quality report and drift baseline is strictly scoped to the requesting organisation. No drift baseline or quality record is ever accessible across tenant boundaries, satisfying per-tenant data sovereignty requirements.

Use Cases#

Detecting Feed Degradation Before Graph Contamination#

Run automated drift detection on every OSINT connector batch so that structural changes in upstream data models are caught at the ingestion boundary, before malformed entities or broken relationships reach the investigation graph and corrupt analyst work products.

Monitoring Entity Extraction Quality Over Time#

Track quality-score trends across weeks of ingestion for connectors that produce entity descriptions, notes, and text fields. A gradual increase in missing values or a spike in constant-column alerts can indicate that an upstream provider has changed their output schema or begun throttling content.

Confirming Legitimate Data Model Changes#

When an upstream OSINT provider publishes a schema update, analysts use the HTML drift report to verify that the observed change is attributable to the known update rather than an underlying data quality problem. After confirmation, they reset the baseline so future runs compare against the new structure.

Regulatory Evidence for Data Provenance#

The per-run drift scores and quality metrics stored with ISO 8601 timestamps provide a complete, timestamped record of ingestion data quality for each connector. This supports audit-trail and data-lineage requirements for intelligence provenance under applicable compliance frameworks.

Integration#

Customers and developers interact with Data Quality Monitoring entirely through the platform's typed integration layer, authenticated with OAuth 2.0 bearer tokens. No direct database or storage access is required.

Querying run reports: The data-quality query namespace exposes per-connector run history and an aggregated quality dashboard showing average drift score and trend direction (STABLE, IMPROVING, or DEGRADING) over a configurable time window. Both operations require the data_quality:read permission scope.
Resetting the baseline: A dedicated write workflow allows authorised users (requiring data_quality:admin scope) to replace a connector's reference baseline with the most recent ingestion data, confirming that an observed drift represents an acceptable upstream change.
Human-in-the-Loop queue: Threshold breaches automatically create review requests in the shared HITL queue, which surfaces alongside changepoint and causal-discovery alerts in the administration review interface. No additional wiring is needed.
Object storage report links: The API returns time-limited presigned URLs for each HTML report. Integrating systems can surface these links in their own dashboards or embed them in analyst notification workflows without any additional storage credentials.
Connector registry: Quality monitoring uses the connector and run identifiers already present in the connector registry. There is no separate enrolment step; every connector begins receiving quality monitoring from its first ingestion run.

Open Standards#

Evidently AI (Apache 2.0): Open-source data and ML pipeline monitoring framework that provides the distributional drift and data quality metric computations, including the HTML report format consumed by analysts.
PSI - Population Stability Index: Kullback-Leibler divergence variant widely used in data monitoring to measure distributional shift between a reference population and a current batch; applied to categorical and low-cardinality numerical columns.
Jensen-Shannon Divergence: Symmetric, bounded (0 to 1) divergence measure derived from KL divergence; applied to numerical columns where PSI is not appropriate.
Wasserstein Distance (Earth Mover's Distance): Measures the minimum transport cost between two probability distributions; applied to continuous numerical columns to capture shape differences that symmetric divergence measures can miss.
ISO 8601: All run timestamps and time-window query parameters use ISO 8601 date-time format, enabling unambiguous interoperability with downstream SIEM, SOAR, and audit systems.
STANAG 4774 (Confidentiality Metadata Label Standard): Post-ingestion classification metadata is applied to each ingestion event before quality monitoring records are persisted, ensuring that data-quality provenance records carry the same classification labels as the underlying intelligence data.
W3C PROV-DM (Provenance Data Model): Ingestion events are recorded as provenance entities under the W3C PROV Data Model, linking each quality run report to the originating ingestion activity for audit-trail and data-lineage traceability.
OAuth 2.0 and JWT Bearer Token: Token-based authentication protects typed, auditable read and write workflows across the platform.
OAuth 2.0 (RFC 6749) / JWT (RFC 7519): programmatic access is authenticated with OAuth 2.0 bearer tokens in JSON Web Token format; permission scopes (data_quality:read, data_quality:admin) are enforced on every workflow handler.

Security and Compliance#

All data quality records are organisation-scoped at the database level, satisfying per-tenant data sovereignty requirements. Permission checks are enforced on every governed workflow independently of authentication, so a valid token without the appropriate scope cannot read or mutate quality data belonging to any organisation. HTML report links are time-limited and generated on demand; no persistent public URLs are created. The fault-tolerant execution model ensures that a quality monitoring failure never causes data to be silently skipped: the ingestion run completes, but the absence of a quality record for that run is itself a detectable signal for operational monitoring.

Last Reviewed: 2026-04-14 Last Updated: 2026-04-14

Data Quality Monitoring

Ready to Build?