Investigation-Scoped News Ingestion

Overview#

An intelligence analyst tracking a suspected cross-border smuggling network does not want every News story that mentions the word "smuggling" landing in their platform. They want stories that are relevant to their active investigations, their named entities, and the keywords they have chosen for their alerts, and they want those stories correlated against the rest of the intelligence record. They also do not want the platform reaching out to News from every service instance, with no central rate control, and no ability to audit what was pulled.

The Investigation-Scoped News Ingestion module gives the platform a single, bounded, investigation-driven ingestion path for News stories. A Cloudflare Worker runs on a fixed schedule, asks the middleware for the current ingestion plan (active investigations, their precise search terms, and optional News interest identifiers), performs the outbound calls from a single egress point, and ships the matched story payloads back to the middleware. The middleware is the system of record: it builds the plan, deduplicates, normalises the stories into the platform's existing news schema, scores investigation relevance, creates alerts where an investigation matches, and writes entity links into the news graph.

Last Reviewed: 2026-04-17 Last Updated: 2026-04-17

Key Features#

Investigation-Scoped Search Plan: The middleware builds an ingestion plan every time the Worker wakes up. For each active investigation (status ACTIVE, OPEN, ONGOING, or IN_PROGRESS) the plan lists precise terms extracted from the investigation title, summary, linked entity identifiers, and configured alert keywords, plus any News interest identifiers the investigation has pinned. The plan is tenant-tagged so the Worker submits each investigation under its own organisation.
Bounded Collection: The plan's term extraction filters stopwords, UUIDs, and common platform terms so the Worker does not pull on generic queries like "investigation" or "report". Term weights let the plan prioritise investigation titles and linked entities over weaker signals. This keeps collection investigation-relevant rather than keyword-trawl.
Tenant-Scoped Ingestion: Every story pushed back to the middleware carries the investigation identifier and the tenant identifier that generated it. All downstream writes (NewsStory, NewsArticle, NewsStoryInvestigationLink, NewsAlertArticleMatch) include organisation scope in the WHERE clause, satisfying the platform's data sovereignty requirement.
Deduplication and Correlation: Stories are deduplicated by content hash and URL, then correlated against existing NewsStory records so a single real-world event tracked by multiple outlets rolls up into one story with multiple articles rather than many isolated records.
Entity Linking: The ingestion service passes normalised story text through the platform's entity extraction pipeline so NewsEntityLink rows are created linking each story to the people, organisations, and locations it mentions. Entity links feed the existing News Intelligence correlation graph.
Investigation Trending Score: Each story that matches an investigation updates the InvestigationTrendingScore for that investigation, giving the analyst a running measure of how much media activity is occurring around their case without computing it at view time.
Alert Notification: Where an investigation has configured a news alert and an ingested story matches the alert's terms, a NewsAlertArticleMatch is written and a NewsAlertNotification is dispatched through the analyst's configured channels.

Use Cases#

Active Investigation Media Tracking: An analyst opens a new investigation into a suspected fuel smuggling network. As soon as the investigation is marked ACTIVE, the next scheduled News Worker run includes the investigation's terms and linked entities. Within six hours the analyst sees News stories relevant to the case in their investigation workspace, with trending score and entity links already populated.
Keyword Alert with Low Noise: A counter-terrorism analyst configures a news alert for a specific vessel name. The search plan includes the vessel name as a weighted term. When a News story mentions the vessel, an alert notification is raised. The plan's stopword filter prevents false positives on generic terms.
Regional Situational Awareness: An operations centre tracks named locations and organisations in an active investigation. The places and topics on the plan direct the Worker to News interest identifiers that cover the region. Stories outside the region do not reach the analyst.
Scheduled Audit: A compliance review asks what was pulled from News during a given week. Because every Worker run is plan-driven and every ingested story is tagged with the originating investigation and tenant, the audit is answered by a SQL query over the existing news tables rather than a log trawl.
Rate-Limited Backfill: During onboarding, a tenant wants to backfill a quarter of News coverage for their investigations. The NEWS_MAX_STORIES_PER_RUN and NEWS_QUERY_LIMIT environment variables bound the backfill so it does not exceed platform budgets or News rate limits.

Integration#

News Intelligence Platform: News stories flow into the same NewsStory, NewsArticle, NewsEntityLink, and NewsCorrelation tables as RSS-sourced stories. Existing views, dashboards, and investigation panels pick them up without change.
Investigation Workspace: Trending scores and alert matches appear in the investigation workspace's news panel alongside other intelligence.
Alert Notification: Matched alerts are raised through the platform's existing notification channels (email, SMS, realtime websocket, push).
External Contract Changelog: The news surfaces are recorded in the external contract CHANGELOG so partner tenants and developer toolkits can track the new internal surfaces.

Open Standards#

News Web API: the Worker calls News' documented web API. The platform does not reverse-engineer proprietary News protocols.
ATOM and RSS 2.0: sibling RSS feeds continue to flow through the same NewsSource/NewsArticle pipeline. Ground-sourced stories coexist with RSS-sourced stories in the same schema.
IPTC NewsCodes (https://iptc.org/standards/newscodes/): stories are classified against the IPTC NewsCodes taxonomy where topic codes are present, reusing the existing IPTC classification pathway.
Schema.org NewsArticle: the normalised story representation aligns with the Schema.org NewsArticle vocabulary so downstream export is straightforward.
W3C PROV-DM: every ingested story carries a provenance record (News Worker as agent, ingestion run as activity, investigation plan as input entity), enabling end-to-end lineage from plan to alert.