Multimodal Analysis

Overview#

Body camera footage, intercepted audio recordings, and photographs seized from suspects all carry intelligence that text-based tools cannot reach. The Multimodal Analysis module processes images, audio, and video directly through AI models, returning OCR output, transcripts, object detections, and forensic findings ready to attach to an investigation case.

Key Features#

Image Analysis: AI-powered object detection, scene understanding, OCR text extraction, and visual content classification
Audio Transcription: Automated speech-to-text transcription with speaker identification and language detection
Video Analysis: Frame-by-frame video analysis combining visual and audio processing for comprehensive content understanding
Forensic Analysis: Specialised analysis capabilities for investigative use cases including evidence examination
Native Multimodal Processing: Direct processing of images, audio, and video without separate preprocessing steps
High-Accuracy Analysis: Advanced AI models deliver results with confidence scoring and usage tracking

Use Cases#

Relevant sectors include law enforcement, defence intelligence, and financial crime investigation.

Extracting text from images and documents during evidence processing
Transcribing audio recordings for investigation documentation
Analysing video footage for object identification and scene understanding
Processing multimedia evidence across investigation workflows

Integration#

Connects with media storage for source file access
Integrates with document analysis for text-based content
Works with content summarisation for AI-generated summaries of analysed media

Open Standards#

IANA Media Types (RFC 2046): The service identifies all ingested media by MIME type (image/jpeg, image/png, image/webp, audio/mpeg, etc.) and passes the detected type to the AI provider, following the IANA media-type registry.
Base64 (RFC 4648): All image, audio, and video payloads are transferred as Base64-encoded strings across the typed integration layer boundary, with the service decoding them before processing.
OAuth 2.0 and JWT Bearer Token: Token-based authentication protects typed, auditable read and write workflows across the platform.
JPEG (ISO/IEC 10918) / PNG (ISO/IEC 15948) / WebP / BMP: These raster image formats are detected by magic-byte inspection and submitted to computer-vision models; EXIF metadata embedded in JPEG/TIFF files is explicitly targeted during forensic analysis tasks.
MPEG-4 / ISO/IEC 14496 (MP4, MKV, WebM, AVI, MOV): Video container formats are detected and decoded frame-by-frame via OpenCV for frame sampling, OCR, and AI analysis.
WAVE / RIFF PCM audio: Uncompressed PCM audio (WAV) is detected by RIFF magic bytes; the audio classification service processes raw 16-bit mono PCM buffers directly, and MP3 (MPEG-1 Audio Layer III), FLAC, and Ogg Vorbis files are also identified and dispatched for transcription.
ISO 639 language codes: Audio transcription accepts a language parameter using ISO 639-1 two-letter codes (defaulting to "en"), which is forwarded to the speech-to-text API for language-specific decoding.
RFC 4122 (UUID): All persistent analysis records, batch run identifiers, user references, and organisation references are keyed by RFC 4122 version-4 UUIDs throughout the data model and API.

Last Reviewed: 2026-02-05 Last Updated: 2026-04-14