Entity Resolution Domain

Overview#

A national criminal intelligence service ingests records from twelve regional police forces, two customs systems, and a financial intelligence unit feed. The same individual appears under different name spellings, partial date-of-birth entries, and inconsistent address formats across these sources. Without resolution, analysts work from fragmented profiles and miss the complete picture of a subject's activities and associations. The Entity Resolution domain identifies these overlapping records, scores their similarity using multiple algorithms, and consolidates them into canonical, unique entities while maintaining full data lineage back to each source system.

Data quality is not a one-time problem: new records arrive continuously. Entity Resolution applies configurable matching strategies on an ongoing basis, automatically merging high-confidence duplicates and routing borderline cases to a human reviewer. The result is a master record that grows more complete as more data arrives, rather than more fragmented.

Key Features#

Duplicate Detection: Combines machine learning embeddings with classical string matching algorithms to identify potential duplicate records across data sources
Multi-Algorithm Similarity Scoring: Calculates similarity using vector analysis, fuzzy string matching, phonetic matching, and set-based overlap methods for accurate results
Configurable Match Thresholds: Adjustable confidence thresholds (default 0.75) control the sensitivity of duplicate detection to match different data quality levels
Multiple Merge Strategies: Supports keep-newer, keep-older, combine-all, and manual field-by-field merge strategies for flexible deduplication
Human-in-the-Loop Review: Match candidates can be routed for human review with pending, confirmed, and rejected status workflows
Auto-Merge for High Confidence: Matches exceeding the high confidence threshold are merged automatically without manual review
Full Merge History: Complete audit trail tracks every merge operation including source entities, target entity, strategy used, and timestamp
Reversible Merges: Merged entities can be split back to their original records when merges are determined to be incorrect
Multi-Entity Type Support: Resolves duplicates across persons, organisations, locations, assets, and events
Attribute-Level Scoring: Individual similarity scores for names, emails, phone numbers, and addresses provide transparency into match decisions
Source System Mapping: Maintains bidirectional links between canonical entities and their original source records for data lineage
Resolution Statistics: Dashboard metrics track total entities, potential duplicates, confirmed merges, rejection rates, and average confidence scores

Use Cases#

Organisations importing data from multiple source systems use entity resolution to identify and merge duplicate person records, creating a single canonical record with a complete view of each individual.
Data stewards review pending match candidates using the human-in-the-loop workflow, confirming true duplicates and rejecting false matches to continuously improve data quality.
Automated deduplication workflows merge high-confidence matches while routing borderline cases for human review, balancing efficiency with accuracy.
Investigators trace entity records back to their original source systems using the source mapping capability, understanding the provenance and completeness of each data point.
Data quality teams monitor resolution statistics to track deduplication progress, identify data quality issues, and measure the effectiveness of matching algorithms.

Industry Context#

National criminal intelligence databases routinely reconcile records from regional forces where name transliteration, legacy formats, and operator entry variation produce hundreds of duplicate variants per week. Border management agencies deduplicate passport and visa applications against watchlists where phonetic name variants are a deliberate evasion tactic. Financial regulators merge customer records across banking subsidiaries to establish true beneficial ownership. Defence intelligence analysts resolve person entities across classified and open-source datasets, each with different identifier conventions and completeness levels.

Integration#

The Entity Resolution domain works with data source connectors, HR platforms, CRM systems, and external intelligence feeds to import and normalise records. It connects to the master data layer to maintain the canonical record in platform record store and provides bidirectional source mapping for data lineage across all integrated systems. The graph analysis layer graph layer is updated to reflect resolved identities after each confirmed merge.

Open Standards#

Privacy-Preserving Record Linkage (PPRL), Schnell et al. 2009, DOI:10.1186/1472-6947-9-41: The primary matching component encodes entity attributes as Bloom filter bit-vectors using bigram decomposition and SHA-256 hashing, then computes Dice-coefficient similarity without exposing raw personal data.
W3C PROV-DM (Provenance Data Model): Every confirmed entity merge is recorded via the PROV-DM provenance service, capturing the source entities, target entity, merge strategy, and actor to satisfy full data-lineage requirements.
OAuth 2.0 and JWT Bearer Token: Token-based authentication protects typed, auditable read and write workflows across the platform.
RFC 4122 (UUID): Every entity record, match candidate, and merge-history entry is assigned a version-4 universally unique identifier, ensuring stable cross-system references and idempotent upserts.
ISO 8601: All date-of-birth, created_at, updated_at, and merge-timestamp fields are serialised as ISO 8601 strings, ensuring consistent interchange with source systems using different regional date conventions.
Levenshtein Edit Distance: Classical per-character edit-distance scoring is applied as a fallback and weighted component (30 %) for name, address, and identifier attribute comparison when embedding vectors are unavailable.
Jaccard Similarity: Set-intersection-over-union scoring is used for multi-valued and tokenised attributes (such as address components) to complement Levenshtein scoring in the classical matching path.

Last Reviewed: 2026-02-05 Last Updated: 2026-04-14