# Cross-Investigation Semantic Search

## Overview

An analyst working across multiple investigations may know that a person of interest in one investigation also appears in a separate case under a different alias, but connecting those dots manually requires reading every investigation one by one. Cross-investigation semantic search closes that gap: it uses text embeddings and vector similarity to surface entities that are meaningfully similar to the entities in a current investigation, regardless of how they are named or described.

The Cross-Investigation Semantic Search module indexes entity text representations as high-dimensional vectors using Cloudflare Workers AI and stores them in Cloudflare Vectorize. When an analyst opens an investigation or entity detail, the system retrieves the most semantically similar entities across all other investigations in the same organisation and presents the related investigations ranked by how many matching entities they contain. Cross-investigation search is strictly scoped to organisation_id — data from one tenant is never exposed to another.

```mermaid
graph LR
    A[Entity] --> B[Text Representation\ntype: name | key attributes]
    B --> C[CF Workers AI Embeddings\nBGE-large-en-v1.5\n768-dim]
    C --> D[Cloudflare Vectorize Index\norg-scoped per organisation]
    D --> E[Cosine Similarity Search\nquery vector = avg of entity embeddings]
    E --> F[Similar Entities\nacross investigations, same org only]
    F --> G[Related Investigations Panel\nranked by matching entity count]
    F --> H[Similar Entities Panel\nin entity detail view]
```

**Last Reviewed:** 2026-04-14
**Last Updated:** 2026-04-14

## Key Features

- **Cloudflare Vectorize Indexing**: Each organisation has a dedicated Cloudflare Vectorize index (named `argus-entities-{org-slug}`). Entity text representations are embedded using the Cloudflare Workers AI BGE-large-en-v1.5 model (768-dimensional vectors) and upserted into the index. A PostgreSQL `entity_embedding_records` table serves as the source-of-truth index manifest, recording which entities are indexed, when, and with which embedding model.

- **Incremental Background Indexing**: A scheduled background job (APScheduler, hourly) scans for entities without an embedding record and indexes them incrementally — up to 50 per organisation per run. The job acquires a distributed lock to prevent duplicate execution across multiple service instances. No backfill is required: the first run indexes new entities, and subsequent runs are no-ops until new entities appear.

- **Cosine Similarity Search**: Query-time search averages the embedding vectors of the supplied entity IDs to produce a single query vector, then performs cosine similarity search in Vectorize with a metadata filter on `org_id`. Results exclude the query entities themselves and are sorted by similarity score descending.

- **Related Investigations Panel**: The frontend surface in the case workspace sidebar shows related investigations grouped by investigation, ranked by number of matching entities. Each entry shows the investigation title, matching entity count, average similarity score, and up to three top matching entity labels. Clicking an entry navigates directly to the related investigation.

- **Similar Entities Panel**: The entity detail view surface shows the top semantically similar entities found across other investigations. Each result shows the entity name, type, how many investigations it appears in, and its similarity score displayed as a percentage badge.

- **Strict Organisation Scoping**: Cross-investigation search is scoped strictly to `organisation_id`. The Vectorize index is per-organisation, every query includes an `org_id` metadata filter as a defence-in-depth measure, and every PostgreSQL query in the service includes `organisation_id` in the WHERE clause. Cross-tenant data access is structurally impossible.

## Use Cases

- **Cross-Case Entity Correlation**: An analyst discovers a suspect in a current investigation and immediately sees that semantically similar entities appear in two other investigations — surfacing a potential network that spans multiple cases.
- **Alias and Variant Detection**: Entities with the same real-world identity but different spellings, aliases, or descriptions are surfaced as similar because the embedding model captures semantic meaning, not just string similarity.
- **Investigation Triage**: A newly created investigation is automatically checked against the existing investigation corpus; if strong matches exist, analysts are alerted before duplicate work begins.
- **Entity Intelligence Enrichment**: An analyst viewing an entity profile sees at a glance how many other investigations contain semantically similar entities, supporting threat actor attribution and network mapping.

## Open Standards

- **Cloudflare Vectorize**: Cloudflare's published vector database product, accessed via the standard Cloudflare REST API. No proprietary orchestration layer.
- **Cloudflare Workers AI**: Cloudflare's published AI inference service. The BGE-large-en-v1.5 model is an open-weights BAAI model served via Cloudflare's edge AI infrastructure.
- **Cosine Similarity**: Standard linear algebra metric. The similarity score is the cosine of the angle between two embedding vectors, ranging from 0 (orthogonal, unrelated) to 1 (identical direction, maximally similar).

## Integration

- **Entity Resolution**: Shares the Cloudflare Workers AI embedding infrastructure with the within-investigation entity deduplication module. Cross-investigation search adds a per-organisation Vectorize index layer on top.
- **Investigation Workspace**: The Related Investigations panel appears in the case workspace sidebar alongside the export and disclosure panels.
- **Entity Detail View**: The Similar Entities panel appears below the entity profile viewer on entity detail pages.
- **Background Scheduler**: The incremental indexing job is registered with the APScheduler service and runs hourly with distributed locking to prevent duplicate execution.