{"id":"data-quality-monitoring","slug":"data-quality-monitoring","title":"Data Quality Monitoring","description":"An OSINT intelligence feed looks healthy right up until it starts producing garbage. A social media connector that silently shifts from returning structured entity objects to returning free-text blobs, a flight tracking ","category":"data-integration","tags":["data-integration","ai","compliance"],"lastModified":"2026-04-14","source_ref":"content/modules/data-quality-monitoring.md","url":"/developers/data-quality-monitoring","htmlPath":"/developers/data-quality-monitoring","jsonPath":"/api/docs/modules/data-quality-monitoring","markdownPath":"/api/docs/modules/data-quality-monitoring?format=markdown","checksum":"29b60704283b778661b3ae3e136d978274ab5ff64092bcfdbcbda0f6927e40df","headings":[{"id":"overview","text":"Overview","level":2},{"id":"key-features","text":"Key Features","level":2},{"id":"use-cases","text":"Use Cases","level":2},{"id":"integration","text":"Integration","level":2},{"id":"open-standards","text":"Open Standards","level":2}],"markdown":"# Data Quality Monitoring\n\n## Overview\n\nAn OSINT intelligence feed looks healthy right up until it starts producing garbage. A social media connector that silently shifts from returning structured entity objects to returning free-text blobs, a flight tracking feed that starts omitting tail numbers, or a corporate registry connector whose upstream data model changed three months ago without notice: each of these degrades investigation quality in ways that are invisible to a human analyst until the contamination has already spread through the graph. By the time a false link appears in a briefing package, tracing it back to a bad data batch is slow and expensive.\n\nData Quality Monitoring addresses this by running statistical drift analysis after every connector ingestion batch. It compares the current batch against a stable reference baseline using Evidently AI, computing distributional distance metrics across every column in the feed. When the measured drift exceeds a configurable threshold, it creates a Human-in-the-Loop review request and pauses the connector until an analyst confirms the change is expected or rejects it and triggers remediation. Analysts can inspect the full Evidently HTML report stored in R2 without writing a line of code.\n\n```mermaid\ngraph LR\n    A[Connector Ingestion Run] --> B[Evidently Report]\n    B --> C1[DataDriftPreset]\n    B --> C2[DataQualityPreset]\n    C1 --> D[Drift Score]\n    C2 --> D\n    D --> E{Score vs Threshold}\n    E -- Below threshold --> F[Continue Ingestion]\n    E -- Above threshold --> G[HITL Review Request]\n    G --> H[Connector Paused]\n    H --> I[Analyst Reviews HTML Report]\n    I --> J{Decision}\n    J -- Approved --> K[Resume Connector]\n    J -- Rejected --> L[Block Feed]\n    K --> M[Update Baseline Optional]\n```\n\n**Last Reviewed:** 2026-04-14\n**Last Updated:** 2026-04-14\n\n## Key Features\n\n- **Per-Run Drift Computation**: After each successful connector ingestion batch, Evidently AI computes a full DataDriftPreset report comparing the batch against the connector's reference baseline. The report covers every column in the feed, measuring PSI (Population Stability Index), Jensen-Shannon divergence, and Wasserstein distance depending on column type. A single share-of-drifted-columns score summarises the run.\n\n- **Data Quality Metrics**: The DataQualityPreset runs alongside drift detection and flags missing value rates, constant columns, and out-of-range values. Alert counts are stored alongside the drift score so analysts can distinguish between distributional shift and structural data problems.\n\n- **Automatic Baseline Capture**: The first ingestion run for a new connector automatically establishes a reference baseline stored in PostgreSQL. Subsequent runs compare against this baseline. Analysts can reset the baseline through the admin interface after confirming that a legitimate upstream data model change has taken place.\n\n- **HITL Pausing on Threshold Breach**: When drift score exceeds the configured threshold (default 0.3), the service creates a HITL review request and records a paused process identifier tied to the connector. The connector ingestion job is held until an analyst approves resumption or rejects the batch. Severity is HIGH for drift above 0.6 and MEDIUM for scores between 0.3 and 0.6.\n\n- **R2 HTML Report Storage**: Full Evidently HTML reports are uploaded to Cloudflare R2 after each run. Analysts open presigned download URLs directly from the admin dashboard without needing access to the backend or any data science tooling. Report links remain valid for seven days.\n\n- **Fire-and-Forget Architecture**: Quality monitoring runs as a background asyncio task. If Evidently or pandas are not installed, or if the monitoring service encounters any error, the ingestion job completes normally without disruption. Data quality monitoring degrades gracefully rather than blocking intelligence feeds.\n\n- **Organisation-Scoped Isolation**: Every query to the data quality tables includes an organization_id filter. Drift baselines and run reports are never accessible across tenant boundaries. This satisfies EDF/PESCO Golden Rule 14 data sovereignty requirements.\n\n## Use Cases\n\n- **Detecting Feed Degradation Before Graph Contamination**: Run automated drift detection on every OSINT connector batch so that structural changes in upstream data models are caught at the ingestion boundary, before malformed entities or broken relationships reach the investigation graph and corrupt analyst work.\n\n- **Monitoring Entity Extraction Quality Over Time**: Track quality score trends across weeks of ingestion for connectors that produce entity descriptions, notes, and text fields. A gradual increase in missing values or a spike in constant-column alerts can indicate that an upstream provider has changed their output schema or begun throttling content.\n\n- **Confirming Legitimate Data Model Changes**: When an upstream OSINT provider publishes a schema update, analysts use the Evidently HTML report to verify that the drift is attributable to the known change rather than a data quality problem. After confirmation, they reset the baseline so future runs compare against the new structure.\n\n- **Regulatory Evidence for Data Provenance**: The per-run drift scores and quality metrics stored in PostgreSQL provide a timestamped record of ingestion data quality for each connector. This supports audit trail requirements for intelligence provenance under EDF/PESCO compliance frameworks.\n\n## Integration\n\n- **Connector Registry**: Each connector ingestion run is identified by connector_id and run_id. The quality service uses these identifiers to correlate drift reports with the connector metadata stored in the registry.\n- **HITL Approval Service**: Drift threshold breaches create review requests in the existing HITL queue, which appears in the Review Queue admin page alongside changepoint and causal discovery alerts.\n- **Cloudflare R2**: Full Evidently HTML reports are stored per-run in R2 under the key pattern `quality-reports/{org_id}/{connector_id}/{run_id}.html`. Presigned GET URLs are generated on demand with a seven-day expiry.\n- **Ingestion Pipeline**: Quality monitoring hooks into the IngestionCoordinatorService as an optional `quality_service` dependency injected at startup. The hook runs as a fire-and-forget asyncio task after the STANAG 4774 classification step completes.\n\n## Open Standards\n\n- **Evidently AI** (Apache 2.0, github.com/evidentlyai/evidently): Open-source ML model and data pipeline monitoring framework. Provides DataDriftPreset and DataQualityPreset metric implementations with HTML report generation.\n- **PSI (Population Stability Index)**: A Kullback-Leibler divergence variant widely used in credit risk and data monitoring to measure distributional shift between reference and current populations. Evidently applies PSI to categorical columns and numerical columns with low cardinality.\n- **Jensen-Shannon Divergence**: A symmetric, bounded (0 to 1) divergence measure derived from KL divergence. Used by Evidently for numerical column drift where PSI is not appropriate.\n- **Wasserstein Distance**: Also known as Earth Mover's Distance. Measures the minimum transport cost to move one probability distribution to another. Used by Evidently for continuous numerical columns to capture shape differences that KL divergence can miss.\n"}