Document Intelligence

Overview#

Intelligence investigations accumulate documents rapidly: scanned forms, photographed receipts, exported bank statements, email chains, contracts, and court filings arrive in dozens of formats from dozens of sources. Extracting structured intelligence from that document corpus, identifying the entities mentioned, understanding what each document is, producing a concise summary an analyst can act on, traditionally required either manual review or purpose-built machine learning models trained on labelled document datasets. Both approaches have significant costs in analyst time or infrastructure investment.

Argus implements document intelligence using open-source text extraction tools combined with zero-shot LLM inference. Apache Tika (Apache License 2.0) extracts text from over a thousand file formats. Tesseract OCR (Apache License 2.0) handles scanned images and image-heavy PDFs. spaCy (MIT License) applies pre-trained Named Entity Recognition models to identify persons, organisations, locations, and other entities in the extracted text. AI language model API performs zero-shot classification and summarisation, the LLM classifies and summarises the document without any Argus-owned fine-tuning, custom model training, or labelled dataset. Every processing step is recorded in an audit-ready provenance trail stored alongside the result.

Last Reviewed: 2026-04-14 Last Updated: 2026-04-14

Key Features#

Apache Tika Text Extraction (Apache License 2.0): Apache Tika is an open-source content analysis toolkit maintained by the Apache Software Foundation that detects and extracts text and metadata from over a thousand file formats including PDF, DOCX, XLSX, PPTX, HTML, RTF, ODS, and many others. Tika is called via its REST server interface, keeping the extraction pipeline language-agnostic and independently scalable.
Tesseract OCR (Apache License 2.0): Scanned image files and image-heavy PDFs that yield no text from Tika are processed by Tesseract, the open-source OCR engine originally developed by HP and now maintained as an Apache-licensed project at Google. Tesseract supports over 100 languages and handles JPEG, PNG, TIFF, and BMP inputs. The pipeline automatically detects when a PDF contains no extractable text and falls back to OCR.
spaCy Pre-Trained NER (MIT License): After text extraction, spaCy applies pre-trained Named Entity Recognition models to identify persons, organisations, locations, dates, monetary values, and other entity types. Argus uses spaCy's published pre-trained models (en_core_web_trf, xx_ent_wiki_sm) without any custom model training. No labelled training data is created or maintained; the NER capability comes entirely from spaCy's published model weights.
AI Language Model API Zero-Shot Classification: Document classification is performed by the AI language model API using a zero-shot inference prompt. The LLM classifies the document as CONTRACT, INVOICE, REPORT, CORRESPONDENCE, LEGAL, FINANCIAL, or other labels based solely on the extracted text and the analyst-provided context string. No Argus fine-tuning, no custom training data, and no model retraining is involved. The classification prompt is constructed at runtime for each document.
AI Language Model API Zero-Shot Summarisation: The AI language model API produces an abstractive summary of the document's key content using zero-shot inference. The analyst can provide a context string (for example, "financial investigation: suspected fraud") that guides the LLM's focus without requiring any model adaptation.
Document Provenance and Audit Trail: Every document analysis result records the SHA-256 hash of the original file bytes (for chain-of-custody and deduplication), the ordered list of processing steps applied, and the analyst who initiated the analysis. Results are stored in the document_analysis_records table with platform record store RLS enforcement (XACML OrgIsolationPolicy), ensuring cross-tenant access to document intelligence records is architecturally impossible.
Language Detection: The langdetect library (Apache License 2.0) identifies the document's primary language from the extracted text, enabling downstream routing to language-appropriate NER models and LLM prompts.

Use Cases#

Financial Investigation Document Review: Scanned bank statements, invoices, and transfer receipts uploaded to an investigation are automatically classified, entity-extracted, and summarised, allowing analysts to quickly identify the key parties and amounts mentioned without reading every document.
Legal Document Triage: Contracts, court filings, and corporate records are classified and summarised on ingestion, with persons and organisations extracted as entities that can be linked to existing intelligence profiles.
Multi-Format Evidence Processing: Investigations routinely receive evidence in inconsistent formats from different sources. Tika's 1000+ format support ensures that unusual file types, legacy word processor formats, embedded attachments, compressed archives, are processed consistently.
Scanned Document OCR: Paper documents photographed in the field or scanned from seized materials are processed by Tesseract OCR, producing searchable text that spaCy and the AI language model can then analyse.
Chain-of-Custody Documentation: The SHA-256 hash, processing steps, and analyst identity recorded for every document analysis provide an auditable chain of custody for evidence admissibility purposes.

Integration#

Evidence Management: Uploaded evidence files are automatically routed to the document intelligence pipeline on ingestion.
Intelligence Profiles: Entities extracted by spaCy are cross-referenced against existing entity profiles in the investigation workspace.
Investigation Workspace: Classification and summary results are attached to the investigation record and displayed in the document review panel.
Audit Service: Document analysis events are logged to the audit trail with userId, organisation context, action, documentHash, and timestamp (EDF/PESCO compliance).
XACML Access Control: The document_analysis_records table is protected by platform record store RLS using the OrgIsolationPolicy (XACML argus:policy:org-isolation:v1).

Open Standards#

Standard	Description
Apache Tika (Apache License 2.0)	Open-source content analysis and text extraction framework. https://tika.apache.org/
Tesseract OCR (Apache License 2.0)	Open-source OCR engine. https://github.com/tesseract-ocr/tesseract
spaCy (MIT License)	Open-source NLP library with pre-trained NER models. https://spacy.io/
langdetect (Apache License 2.0)	Language detection library. https://github.com/Mimino666/langdetect
AI language model API	Zero-shot inference via published API, no custom model training. Provider documentation
SHA-256 (FIPS 180-4)	Cryptographic hash for document identity and chain-of-custody.

Argus uses zero-shot LLM inference (no custom training) and Apache/MIT licensed open-source text extraction tools. No labelled document training dataset is created, maintained, or used.