Overview#
Intelligence operations frequently involve the same real-world person, organisation, or location appearing under multiple identifiers across disparate data sources: a subject of interest recorded with a misspelled name in a customs manifest, a transliteration variant in a signals database, and a partial address in a financial transaction log. Resolving these records to a single canonical entity without leaking protected identifiers to untrusted processing infrastructure is a core OSINT challenge.
The Privacy-Preserving Entity Resolution module links records across datasets using Bloom filter encoding (Schnell et al. 2009) as the primary matching primitive. Field values are encoded into binary Bloom filters before comparison, meaning the plaintext identifiers are never transmitted to the scoring pipeline. Cosine similarity over LLM embeddings and normalised Levenshtein distance contribute additional signal. The three scores are combined with fixed weights into a single confidence value that the analyst interface presents according to each analyst's configurable display threshold.
Last Reviewed: 2026-04-14 Last Updated: 2026-04-14
Key Features#
-
PPRL Bloom Filter Encoding: String field values (name, email, phone, address) are encoded as binary Bloom filters using 2-gram decomposition and k=20 independent SHA-256 hash functions over a 1024-bit filter. Similarity is computed via the Dice coefficient, which yields a score in [0.0, 1.0] without comparing plaintext identifiers. This approach is defined in the open-access literature (Schnell et al. 2009, DOI: 10.1186/1472-6947-9-41).
-
LLM Embedding Cosine Similarity: Each entity field is embedded using Cloudflare Workers AI, producing a dense vector representation that captures semantic similarity beyond surface-level string matching. The embedding score accounts for transliterations, abbreviations, and language variants that purely lexical methods miss. Embedding vectors are stored in the entity embeddings table for reuse across queries.
-
Levenshtein String Distance: Normalised edit distance provides a fast, deterministic baseline score for field-level character similarity. It acts as a safety net for cases where both Bloom filter hashing and embeddings are inconclusive, and is used as the sole scorer when embedding infrastructure is unavailable.
-
Fixed-Weight Hybrid Scoring: The three component scores are combined as:
weighted_score = (pprl_score × 0.30) + (embedding_score × 0.40) + (levenshtein_score × 0.30). These weights are algorithm constants hard-coded in the weighted scoring function. They are not read from the database at runtime, ensuring deterministic, reproducible scoring across sessions. -
Analyst Similarity Preferences: Each analyst can configure a personal
confidence_display_threshold(default 0.75) that controls which candidate matches the interface presents above-the-fold. This preference is stored in the a dedicated analyst preferences store as a UX personalisation setting. It has no effect on the algorithm's computed scores; it only determines the cut-off below which results are collapsed in the analyst view. The preferred algorithm weights stored alongside are informational metadata for the UI tooltip and audit trail. -
Entity Merge and Split Audit: All merge operations are recorded in a dedicated merge history record with full provenance including source entity IDs, merge strategy, and analyst identity. Merges can be reversed (split) until the history record is explicitly closed, supporting correction of erroneous automated matches.
Use Cases#
- OSINT Identity Resolution: Resolve person entities across open-source datasets where names appear in multiple transliterations, with typos, or with partial information, without exposing raw PII to intermediate processing.
- Cross-Source Organisation Deduplication: Identify when the same legal entity appears under abbreviated names, trading names, or registration number variants across financial and corporate datasets.
- Location Disambiguation: Link location references recorded with varying levels of geographic precision or naming conventions to canonical geographic identifiers.
- Multi-Domain Merge Review: Present candidate duplicate pairs to analysts above a configurable confidence threshold for manual confirmation before committing to a canonical merge.
- Embedding-Augmented Matching: Use semantic vector similarity to surface near-duplicate entities whose field values have low lexical overlap but strong semantic correspondence (e.g. an organisation's full legal name versus its common abbreviation).
Integration#
- Entity Embedding Service: Provides the LLM embedding vectors consumed by the cosine similarity component.
- Entity Resolution Repository: Persists canonical entities, match records, and merge history with full tenant isolation.
- RBAC Service: All resolution operations require
ENTITY_RESOLUTION_READorENTITY_RESOLUTION_WRITEpermissions. - Audit Trail: Every match computation, merge, and split is logged to the platform audit trail with
userId,organizationId,action,timestamp, andresourceIdfor EDF/PESCO compliance. - Analyst UI: The
confidence_display_thresholdis read from the analyst preferences store at presentation time to filter the result list shown in the interface.
Open Standards#
- W3C Linked Data (RDF 1.1): Entity identifiers follow W3C Linked Data principles for stable, dereferenceable URIs, ensuring canonical entity records are interoperable across graph-based platforms.
- W3C PROV-O (Provenance Ontology): Merge, split, and match audit records align with PROV-O concepts of entity, activity, and agent, supporting reproducible provenance chains for every resolution decision.
- ISO/IEC 29101:2018 (Privacy Architecture Framework): The privacy-by-design approach to field encoding, ensuring plaintext identifiers are not transmitted to the scoring pipeline, follows the architectural principles set out in this standard.
- ISO/IEC 29134:2017 (Guidelines for Privacy Impact Assessment): Cross-dataset record linkage involving personal identifiers is assessed against the risk and mitigation guidance in this standard before production deployment.
- NIST SP 800-188 (De-Identifying Government Datasets): Bloom filter encoding of personal field values before comparison aligns with the de-identification and pseudonymisation techniques described in this guidance.