AI Inference

Overview#

The AI Inference module provides edge-deployed AI capabilities through, enabling low-latency AI chat, text generation, document analysis, and entity suggestion directly from the platform's edge network. By running AI inference at the edge rather than routing through the centralised middleware GraphQL layer, the module reduces round-trip latency and enables direct streaming responses to the client.

The platform supports multiple model tiers including fast models for interactive chat and reasoning models for complex analysis, with automatic usage tracking and billing telemetry on every request.

Key Features#

Edge-Deployed AI Chat - Interactive chat sessions run through with support for both streaming and non-streaming response modes. Streaming mode delivers tokens to the client as they are generated; non-streaming mode returns a complete response with full usage metadata. Each chat request returns the provider, model, token usage, and latency.
Multi-Model Selection - Choose between available models based on the task: fast models (Llama 3.2 3B, Llama 3.1 8B Fast) for low-latency interactive responses, and reasoning models (Qwen3 30B, Llama 3.3 70B FP8) for complex analytical tasks. Model selection is exposed as a provider dropdown in the AI Studio interface.
Document Analysis - Upload documents for AI-powered analysis through a dedicated endpoint. Documents are converted to markdown using Cloudflare's document-to-markdown service, then analysed to produce a summary, key insights, and extracted entities. Usage tracking captures both the document conversion and analysis token costs.
Text Generation - A generic text generation endpoint supports configurable model selection and returns streaming responses with usage headers (raw input tokens, raw output tokens, billable total units) for real-time cost visibility.
Entity Quick Suggestions - Context-aware entity suggestions generated at the edge for rapid inline assistance during investigation and analysis workflows.
Per-Request Usage Telemetry - Every AI request returns structured usage data including raw token counts (input and output), billable token units (with 1.5x multiplier), provider identifier, model name, model tier, and request latency in milliseconds. When the provider API does not return exact token counts, the system estimates usage from response character length.
Cache-Safe Headers - All AI response endpoints set Cache-Control: no-store to prevent sensitive AI-generated content from being cached by intermediaries or service workers.

Use Cases#

Interactive Investigation Assistance - Analysts chat with AI during investigations to generate hypotheses, summarise evidence, and identify connections, with responses streaming in real time from the nearest Cloudflare edge location.
Document Triage - Upload documents received during investigations for rapid AI-powered summarisation and entity extraction, reducing manual review time for large document sets.
AI Studio Experimentation - Use the AI Studio interface to experiment with different models and prompts, comparing fast and reasoning model outputs for quality and cost before integrating AI capabilities into operational workflows.
Cost-Aware AI Usage - Per-request billable unit tracking enables organisations to monitor AI costs at the feature, user, and investigation level, supporting informed decisions about model selection and usage governance.

Integration#

The AI module connects to the token usage management system for cost tracking and quota enforcement, the AI Studio interface for interactive model experimentation, and the investigation and analysis workflows for contextual AI assistance. All AI endpoints are accessible through the platform's REST API layer.

Last Reviewed: 2026-04-02

Metadados do modulo

Documentacao renderizada

Overview#

Key Features#

Use Cases#

Integration#