Data Deduplication Engine

Overview#

An intelligence analyst searching for everything known about a person of interest should not have to decide whether "MOHAMMED AL-RASHID" and "Mohamed AlRashid" are the same individual. That determination should have been made before the record ever reached the analyst's screen. In practice, data flowing into Argus comes from dozens of sources, each with different name formats, transliteration conventions, and data entry standards. Without systematic deduplication, the same entity accumulates multiple records that fragment the picture rather than complete it.

The Deduplication Engine identifies, scores, and merges duplicate records across large datasets. It combines multiple fuzzy matching algorithms with ML-powered confidence scoring and configurable merge strategies to achieve high match accuracy in production environments. It operates in both modes: real-time deduplication for incoming records, and batch processing for historical data cleanup. For financial crime units, government data registries, healthcare data controllers, and intelligence agencies, accurate entity resolution is foundational to everything else the platform does.

Key Features#

Multi-Algorithm Fuzzy Matching: Combine phonetic matching, token-based similarity, and edit distance algorithms for comprehensive duplicate detection that handles name variations, typos, transliteration differences, and formatting inconsistencies across languages and scripts.
ML-Powered Confidence Scoring: Score potential matches using a multi-factor model that weighs algorithm results, field importance, historical accuracy, and data quality for precise duplicate identification. Scores are calibrated against verified match outcomes.
Golden Record Management: Automatically create and maintain authoritative master records by merging the best data from duplicate sources into a single, high-quality record. Downstream consumers always reference the golden record.
Configurable Merge Strategies: Choose from most-recent, most-complete, most-trusted, or custom rule-based strategies to determine how conflicting field values are resolved during merging. Different field types can use different strategies.
Real-Time Deduplication: Detect and handle duplicates as records are ingested, preventing new duplicates from entering the system before they propagate to downstream consumers.
Batch Processing: Run high-throughput deduplication jobs across historical datasets for large-scale cleanup initiatives. Useful after data migrations or when onboarding a new data source.
Configurable Thresholds: Set match confidence thresholds for automatic merging, manual review, and rejection to balance precision with recall for your specific data characteristics and risk tolerance.
Data Normalisation: Standardise formats, clean data, and extract features before matching to improve algorithm accuracy. Normalisation rules are configurable per field type and data domain.
Complete Audit Trail: Track every match decision, merge action, and golden record update with full lineage for compliance, debugging, and regulatory review.
Feedback Learning: Improve matching accuracy over time by incorporating manual verification feedback into the confidence scoring model. The engine learns from analyst decisions.

Use Cases#

Entity Unification Across Sources: Merge duplicate entity records created across different data sources, systems, or partner feeds into a single golden record that provides a complete picture of each individual or organisation.
Data Migration Cleanup: Before or after migrating data from legacy systems, identify and resolve duplicates to ensure the target system starts with clean, deduplicated records and analysts are not chasing redundant entries.
Ongoing Data Quality Maintenance: Run real-time deduplication on incoming records to prevent new duplicates, combined with periodic batch jobs to catch any that enter through channels without real-time coverage.
Master Data Management: Maintain authoritative master records for entities such as persons, organisations, vessels, or infrastructure by continuously merging updates from multiple source systems using trust-based resolution.
Compliance and Reporting Accuracy: Ensure accurate counts, metrics, and regulatory reports by eliminating duplicate records that would otherwise inflate numbers and compromise data quality in submissions to oversight bodies.

Integration#

The Deduplication Engine integrates with Argus ingestion pipelines for real-time duplicate prevention and connects with the platform's data quality, golden record management, and audit trail systems. All deduplication decisions and golden record updates are stored in platform record store with full organisation scoping and linked to the audit trail for traceability.

Open Standards#

Privacy-Preserving Record Linkage / Bloom Filter Encoding (Schnell et al. 2009, DOI:10.1186/1472-6947-9-41): The engine's PPRL component encodes entity fields as Bloom filters using bigram decomposition with SHA-256 hashing, then computes Dice coefficient similarity, exactly as specified in this open-access method, enabling privacy-safe matching without exposing raw field values.
W3C PROV-DM (Provenance Data Model, W3C Recommendation 2013): Every entity merge operation writes a structured provenance record using the W3C PROV-DM concepts prov:Entity, prov:Activity, and prov:Agent, with wasGeneratedBy, wasDerivedFrom, and wasAssociatedWith relationships, serialised as PROV-JSON.
OAuth 2.0 and JWT Bearer Token: Token-based authentication protects typed, auditable read and write workflows across the platform.
JSON (ECMA-404 / RFC 8259): Entity attributes, merge metadata, algorithm detail payloads, and PROV-DM records are all stored and exchanged as JSON, the interchange format for the platform record store JSONB columns backing the golden record store.
OAuth 2.0 (RFC 6749): Every authorised workflow handler enforces the IsAuthenticated permission gate, and downstream RBAC checks validate scoped permissions (ENTITY_RESOLUTION_READ, ENTITY_RESOLUTION_WRITE) derived from the platform's OAuth 2.0 token context.
Levenshtein Edit Distance (ISO/IEC 14651 collation-adjacent): Normalised Levenshtein distance is applied field-by-field as the classical string-similarity component of the hybrid scoring pipeline, providing language-agnostic edit-distance matching for names, addresses, and identifiers.

Last Reviewed: 2026-02-05 Last Updated: 2026-04-14