Transformer Entity Matching

Overview#

Determining whether two records in an investigation refer to the same real-world person, organisation, or asset is harder than it looks. Names are spelled inconsistently, phone numbers appear in different formats, and the same individual may appear under a legal name in one source and an alias in another. Classical string-distance metrics catch obvious cases but break down when variation comes from transliteration, data-entry differences, or deliberate obfuscation.

Argus uses a cross-encoder transformer model to score entity pairs jointly. Instead of comparing fields in isolation, the model reads both entity records as a single input sequence and produces a single similarity score that captures relational context across all fields at once. This approach, introduced by Li et al. (2021) in the Ditto paper, consistently outperforms feature-engineered baselines on real-world entity matching benchmarks.

The model is applied zero-shot and can be fine-tuned on Argus-specific labelled entity pairs with as few as 100 analyst-verified examples, making it practical to improve accuracy for specific data sources without a large labelling effort.

Last Reviewed: 2026-04-14 Last Updated: 2026-04-14

Key Features#

Cross-Encoder Joint Scoring: Both entity records are concatenated into a single serialized sequence before being passed to the model. This lets the model attend across all fields of both records simultaneously, capturing interactions that a bi-encoder or field-by-field comparison would miss.
Ditto Serialization Format: Each entity is serialized as a sequence of COL/VAL token pairs (for example, "COL name VAL John Smith COL email VAL john@example.com"). Priority fields (name, email, phone, address, description) appear first to give the model the highest-signal attributes at the start of the sequence.
Lazy Model Loading via Model Registry: The cross-encoder weights are not bundled with the service. They are loaded on first use from object storage via the model registry, keeping the service image small and allowing model updates without redeployment.
Sigmoid Score Normalisation: Raw cross-encoder logit scores can range beyond the 0-1 interval. A sigmoid function maps them to a consistent probability-like range so that a configurable threshold (default 0.75) can be applied uniformly regardless of which model variant is in use.
Batch Scoring: The match-finding and batch-scoring operations process entity pairs in batches of 32 to bound GPU and CPU memory usage. This allows scoring against a pool of several hundred candidates in a single governed submission.
Analyst Verdict and HITL Workflow: Every scored pair above threshold is persisted to a dedicated match-results table. Analysts can confirm or reject matches through the analyst verdict workflow, recording a confirmed match or confirmed non-match verdict. Pending records appear in the analyst review queue.
Graceful Fallback: If the model is unavailable (for example, during cold start or when sentence-transformers is not installed), all scoring methods return 0.0 or empty lists rather than raising errors. The service remains available and logs a warning.

Use Cases#

Investigation Deduplication: Surface likely duplicate person or organisation profiles within an investigation so analysts can merge them into a canonical record.
Cross-Source Entity Linking: Match an entity imported from an external data source against existing profiles to prevent record fragmentation across connectors.
Alias and Transliteration Matching: Identify entities whose names appear in multiple spellings or scripts, which string-distance metrics treat as unrelated.
Fine-Tuned Vertical Matching: After providing 100 or more verified positive and negative pairs from a specific data source, retrain the model to increase precision for that connector.

Integration#

Entity Resolution Service: The Ditto service complements the existing hybrid pipeline (PPRL Bloom filter, embedding similarity, Levenshtein) and can be invoked as an additional scoring stage.
Review Queue: Pending analyst verdicts surface in the platform review queue alongside other HITL items.
Graph Intelligence: Confirmed matches feed into the investigation graph, merging duplicate nodes and updating edge references.

Open Standards#

W3C OWL 2 Web Ontology Language: defines the owl:sameAs construct used as the canonical representation for asserting that two entity records refer to the same real-world individual or organisation.
W3C PROV-O (Provenance Ontology): provides the data model for recording analyst verdicts, match lineage, and the chain of evidence behind any confirmed entity merge.
W3C RDF 1.1: the graph data model used to represent entities as nodes and to express co-reference links between matched records.
Unicode Standard (Unicode Consortium): governs name normalisation, script transliteration, and character-level canonicalisation that underpin consistent entity serialisation across multilingual sources.