AI Document Analysis

Overview#

When an analyst receives a 200-page financial disclosure during an active investigation, spending an hour reading it cover-to-cover is a luxury they rarely have. The AI Document Analysis module changes that equation. Upload the document, and within seconds receive a concise summary, a ranked list of key findings, and a structured list of every entity the document contains, ready to cross-reference against the investigation graph.

The module handles the complete pipeline from file upload through conversion, AI analysis, and structured response delivery, with full usage tracking for cost management.

Key Features#

File Upload and Conversion: Accepts document uploads via multipart form data and converts them to clean markdown using a document processing service. The conversion preserves document structure including headings, lists, tables, and emphasis while stripping formatting that would consume unnecessary tokens during analysis.
AI-Powered Analysis: The converted markdown is analysed by a language model to produce three structured outputs: a concise summary of the document's content, a list of key insights and findings, and a list of extracted entities covering people, organisations, locations, dates, and other significant identifiers found in the document.
Token Usage Tracking: Every analysis request tracks token consumption across both the document conversion and analysis stages. The response includes raw token counts, billable units (with the standard 1.5x multiplier), and request latency, enabling organisations to monitor document analysis costs alongside other AI usage.
Structured Response Format: Analysis results are returned in a consistent JSON structure with summary, insights array, and entities array fields, making it straightforward for consuming applications to display results without parsing unstructured text.
Multi-Application Availability: The document analysis endpoint is available across both the main web application and the investigations application, enabling document analysis from any context where an analyst encounters a document needing rapid assessment.

Use Cases#

Evidence Triage: Upload documents received during investigations for rapid AI-powered summarisation and entity extraction, reducing the time needed to assess document relevance and identify key information.
Intelligence Report Processing: Analyse incoming intelligence reports to extract mentioned entities, identify key findings, and generate summaries for inclusion in operational briefings. Particularly valuable for law enforcement agencies and defence organisations processing high volumes of source material.
Document-Heavy Investigations: Process large volumes of documents such as financial records, communications, and regulatory filings to identify patterns and extract entities that can be cross-referenced against the knowledge graph. Financial crime units running complex money-laundering investigations benefit directly from this pipeline.
Quick Assessment: Analysts upload an unfamiliar document and receive a summary and entity list within seconds, enabling rapid triage decisions about whether the document warrants detailed manual review.

Integration#

The AI Document Analysis module connects to the Cloudflare Workers AI infrastructure for document conversion and language model inference, the token usage management system for cost tracking, and the investigation and evidence management workflows for contextual document analysis. Analysis results feed extracted entities into the entity resolution and knowledge graph systems.

Open Standards#

RFC 7578 (Multipart Form Data): Document uploads are ingested via multipart/form-data HTTP requests, following the RFC 7578 encoding for file attachments submitted to the analysis endpoint.
FIPS 180-4 / SHA-256: A SHA-256 digest of the raw file bytes is computed at the start of every pipeline run, providing a cryptographic fingerprint for deduplication and chain-of-custody audit records.
IANA Media Types (RFC 2046): MIME type identifiers such as application/pdf, application/vnd.openxmlformats-officedocument.wordprocessingml.document, and image/tiff govern which text-extraction path (Apache Tika or Tesseract OCR) is invoked and which metadata-stripping routine is applied at intake.
ECMA-376 / ISO/IEC 29500 (Office Open XML): DOCX and XLSX files are parsed by traversing their OOXML ZIP archive structure directly, extracting text from word/document.xml and xl/sharedStrings.xml without requiring a proprietary Office runtime.
ISO 639-1: The language-detection stage returns a two-letter ISO 639-1 language code (e.g. en, fr, de) alongside every analysis result, enabling downstream consumers to apply locale-aware processing.
JSON (RFC 8259): All analysis results, summary, key-insights array, and entities array, are returned as well-formed JSON, and entity records are persisted in a JSONB column adhering to a documented schema of {text, label, start, end} objects.
ISO 19005 (PDF/A): Documents produced or exported through the evidence and case workflows that feed the analysis pipeline may be archived as PDF/A-1B through PDF/A-4F, the ISO series of standards for long-term preservation of electronic documents.

Last Reviewed: 2026-04-02 Last Updated: 2026-04-14