Overview#
Intelligence operations frequently involve the same real-world person, organisation, or location appearing under multiple identifiers across disparate data sources: a subject of interest recorded with a misspelled name in a customs manifest, a transliteration variant in a signals database, and a partial address in a financial transaction log. Resolving these records to a single canonical entity without leaking protected identifiers to untrusted processing infrastructure is a core OSINT challenge.
The Privacy-Preserving Entity Resolution module links records across datasets using Bloom filter encoding (Schnell et al. 2009) as the primary matching primitive. Field values are encoded into binary Bloom filters before comparison, meaning the plaintext identifiers are never transmitted to the scoring pipeline. Cosine similarity over LLM embeddings and normalised Levenshtein distance contribute additional signal. The three scores are combined with fixed weights into a single confidence value that the analyst interface presents according to each analyst's configurable display threshold.
Diagram
graph LR
A[Entity A Attributes] --> B[Bloom Filter Encode\nPPRL, Schnell et al. 2009]
A --> C[LLM Embedding\nCloudflare Workers AI]
A --> D[Levenshtein Distance\nNormalised String Similarity]
E[Entity B Attributes] --> B
E --> C
E --> D
B --> F[PPRL Score\n× 0.30]
C --> G[Cosine Score\n× 0.40]
D --> H[String Score\n× 0.30]
F --> I[Weighted Aggregate]
G --> I
H --> I
I --> J[Similarity Score]
J --> K[Analyst UI\nthreshold preference applied\nfor display only]Last Reviewed: 2026-04-14 Last Updated: 2026-04-14
Key Features#
-
PPRL Bloom Filter Encoding: String field values (name, email, phone, address) are encoded as binary Bloom filters using 2-gram decomposition and k=20 independent SHA-256 hash functions over a 1024-bit filter. Similarity is computed via the Dice coefficient, which yields a score in [0.0, 1.0] without comparing plaintext identifiers. This approach is defined in the open-access literature (Schnell et al. 2009, DOI: 10.1186/1472-6947-9-41).
-
LLM Embedding Cosine Similarity: Each entity field is embedded using Cloudflare Workers AI, producing a dense vector representation that captures semantic similarity beyond surface-level string matching. The embedding score accounts for transliterations, abbreviations, and language variants that purely lexical methods miss. Embedding vectors are stored in the entity embeddings table for reuse across queries.
-
Levenshtein String Distance: Normalised edit distance provides a fast, deterministic baseline score for field-level character similarity. It acts as a safety net for cases where both Bloom filter hashing and embeddings are inconclusive, and is used as the sole scorer when embedding infrastructure is unavailable.
-
Fixed-Weight Hybrid Scoring: The three component scores are combined as:
weighted_score = (pprl_score × 0.30) + (embedding_score × 0.40) + (levenshtein_score × 0.30). These weights are algorithm constants hard-coded inEntityResolutionService._calculate_weighted_score. They are not read from the database at runtime, ensuring deterministic, reproducible scoring across sessions. -
Analyst Similarity Preferences: Each analyst can configure a personal
confidence_display_threshold(default 0.75) that controls which candidate matches the interface presents above-the-fold. This preference is stored in theanalyst_similarity_preferencestable as a UX personalisation setting. It has no effect on the algorithm's computed scores; it only determines the cut-off below which results are collapsed in the analyst view. The preferred algorithm weights stored alongside are informational metadata for the UI tooltip and audit trail. -
Entity Merge and Split Audit: All merge operations are recorded in
entity_merge_historywith full provenance including source entity IDs, merge strategy, and analyst identity. Merges can be reversed (split) until the history record is explicitly closed, supporting correction of erroneous automated matches.
Use Cases#
- OSINT Identity Resolution: Resolve person entities across open-source datasets where names appear in multiple transliterations, with typos, or with partial information, without exposing raw PII to intermediate processing.
- Cross-Source Organisation Deduplication: Identify when the same legal entity appears under abbreviated names, trading names, or registration number variants across financial and corporate datasets.
- Location Disambiguation: Link location references recorded with varying levels of geographic precision or naming conventions to canonical geographic identifiers.
- Multi-Domain Merge Review: Present candidate duplicate pairs to analysts above a configurable confidence threshold for manual confirmation before committing to a canonical merge.
- Embedding-Augmented Matching: Use semantic vector similarity to surface near-duplicate entities whose field values have low lexical overlap but strong semantic correspondence (e.g. an organisation's full legal name versus its common abbreviation).
Integration#
- Entity Embedding Service: Provides the LLM embedding vectors consumed by the cosine similarity component.
- Entity Resolution Repository: Persists canonical entities, match records, and merge history with full tenant isolation.
- RBAC Service: All resolution operations require
ENTITY_RESOLUTION_READorENTITY_RESOLUTION_WRITEpermissions. - Audit Trail: Every match computation, merge, and split is logged to the platform audit trail with
userId,organizationId,action,timestamp, andresourceIdfor EDF/PESCO compliance. - Analyst UI: The
confidence_display_thresholdfromanalyst_similarity_preferencesis read at presentation time to filter the result list shown in the interface.
Open Standards Compliance#
The matching algorithm is implemented exclusively on open-access and open-source foundations:
- PPRL Bloom filter encoding: Schnell, R., Bachteler, T., Reiher, J. (2009). Privacy-preserving record linkage using Bloom filters. BMC Medical Informatics and Decision Making, 9(1), 41. DOI: 10.1186/1472-6947-9-41. Open access. The approach encodes field values into binary Bloom filters before comparison, so plaintext identifiers are never transmitted to the scoring pipeline. Reference implementations available in the open-source PPRL ecosystem (e.g. PPRL-Toolbox, FEBRL).
- W3C Linked Data entity identity: Entity IDs follow W3C Linked Data principles for stable, dereferenceable identifiers. Canonical entity records are the authoritative node in the platform graph.
- Dice coefficient similarity: Standard information-retrieval metric with no proprietary claim.
- Levenshtein distance: Classical edit distance algorithm in the public domain.
The analyst_similarity_preferences table stores per-analyst UX personalisation settings. These records are architecturally separate from the matching algorithm: no scoring weight or algorithm parameter is read from this table at runtime. The confidence display threshold is applied at the presentation layer, after the algorithm has completed, solely to determine which results are shown above-the-fold to the analyst.