AI Inference

Overview#

Every millisecond of round-trip latency between an analyst's question and the AI's answer matters during an active investigation. Routing requests through a distant centralised API introduces unnecessary delay. The AI Inference module deploys AI capabilities at the edge, running interactive chat, text generation, document analysis, and entity suggestions from the nearest Cloudflare edge location rather than through the centralised middleware typed integration layer.

The platform supports multiple model tiers, fast models for interactive chat and reasoning models for complex analysis, with automatic usage tracking and billing telemetry on every request so organisations maintain complete visibility into their AI consumption.

Key Features#

Edge-Deployed AI Chat: Interactive chat sessions run at the edge with support for both streaming and non-streaming response modes. Streaming mode delivers tokens to the client as they are generated; non-streaming mode returns a complete response with full usage metadata. Each chat request returns the provider, model, token usage, and latency.
Multi-Model Selection: Choose between available models based on the task: fast models (compact fast model, fast language model) for low-latency interactive responses, and reasoning models (reasoning-capable language model, high-capacity reasoning model) for complex analytical tasks. Model selection is exposed as a provider dropdown in the AI Studio interface.
Document Analysis: Upload documents for AI-powered analysis through a dedicated endpoint. Documents are converted to markdown using Cloudflare's document-to-markdown service, then analysed to produce a summary, key insights, and extracted entities. Usage tracking captures both the document conversion and analysis token costs.
Text Generation: A generic text generation endpoint supports configurable model selection and returns streaming responses with usage headers (raw input tokens, raw output tokens, billable total units) for real-time cost visibility.
Entity Quick Suggestions: Context-aware entity suggestions generated at the edge for rapid inline assistance during investigation and analysis workflows.
Per-Request Usage Telemetry: Every AI request returns structured usage data including raw token counts (input and output), billable token units (with 1.5x multiplier), provider identifier, model name, model tier, and request latency in milliseconds. When the provider API does not return exact token counts, the system estimates usage from response character length.
Cache-Safe Headers: All AI response endpoints set Cache-Control: no-store to prevent sensitive AI-generated content from being cached by intermediaries or service workers.

Use Cases#

Interactive Investigation Assistance: Analysts chat with AI during investigations to generate hypotheses, summarise evidence, and identify connections, with responses streaming in real time from the nearest Cloudflare edge location. Law enforcement and intelligence agency analysts benefit from sub-second first-token latency on routine queries.
Document Triage: Upload documents received during investigations for rapid AI-powered summarisation and entity extraction, reducing manual review time for large document sets. Particularly valuable in financial crime and fraud investigation contexts where document volumes are high.
AI Studio Experimentation: Use the AI Studio interface to experiment with different models and prompts, comparing fast and reasoning model outputs for quality and cost before integrating AI capabilities into operational workflows.
Cost-Aware AI Usage: Per-request billable unit tracking enables organisations to monitor AI costs at the feature, user, and investigation level, supporting informed decisions about model selection and usage governance.

Integration#

The AI module connects to the token usage management system for cost tracking and quota enforcement, the AI Studio interface for interactive model experimentation, and the investigation and analysis workflows for contextual AI assistance. All AI endpoints are accessible through the platform's published service interface layer.

Open Standards#

OpenAI Chat Completions API (messages format): All inference providers, including OpenAI GPT models, Anthropic Claude, xAI Grok, and Cloudflare Workers AI, are called through the same POST /chat/completions-style interface using the [{"role": "system"|"user"|"assistant", "content": "..."}] messages schema, enabling provider-agnostic orchestration.
JSON Web Token (JWT / RFC 7519): The edge AI proxy worker enforces RS256-signed JWT bearer token authentication, verifying audience and scope claims (middleware:ai:proxy) against a JWKS endpoint before executing any inference request.
JSON (RFC 8259): All AI inference request bodies, usage telemetry payloads, and structured entity-extraction responses are encoded as JSON over HTTPS.
Cache-Control: no-store (HTTP Caching, RFC 9111): All AI response endpoints explicitly set Cache-Control: private, no-store to prevent sensitive AI-generated content from being retained by intermediary caches or service workers.
ISO 639-1 Language Codes: The speech-to-text (Whisper) and multi-language inference paths accept and propagate ISO 639-1 two-letter language codes (for example "en", "es") to control transcription and response language.
Multipart Form Data (RFC 7578): Audio payloads submitted to the Whisper speech-to-text model are transmitted as multipart/form-data HTTP requests, matching the Cloudflare Workers AI Whisper endpoint's required encoding.
OAuth 2.0 and JWT Bearer Token: Token-based authentication protects typed, auditable read and write workflows across the platform.

Last Reviewed: 2026-04-02 Last Updated: 2026-04-14