Evidence OCR and HTR Provenance Review

Overview#

Scanned records, handwritten notes, custody forms, field notebooks, and historical archives often contain the most important evidence in a case, but they are slow to review and easy to misread. Evidence OCR and HTR Provenance Review turns page images into searchable text while preserving the review trail that courts and oversight teams need.

The module supports selectable recognition engines, page-level review, handwritten text recognition readiness, duplicate detection, text revision history, and signed provenance. Reviewers can approve or reject extracted text, compare revisions, and trace every accepted span back to the source page image.

Key Features#

Selectable Recognition Engines: Choose the recognition path that fits the evidence type, from fast printed-text extraction to higher-accuracy processing for difficult scans.
Handwritten Text Readiness: Prepare page images and review workflows for handwritten material, including external HTR service integration where an organisation uses one.
Page-Level Review: Reviewers work at page level, with source image, extracted text, confidence signals, and approval state shown together.
Immutable Text Revisions: Each approved or corrected text version is recorded as a revision rather than overwriting the past.
Span-Level Provenance: Extracted words and passages can be traced back to the document, page, and region that produced them.
Duplicate Detection: Repeated pages and near-duplicate documents are identified so reviewers do not waste time approving the same text multiple times.
Search and Entity Enrichment: Approved text becomes available for evidence search and controlled entity extraction without exposing raw unreviewed text as final evidence.
Reviewer Notifications: Users can see where their document sits in the recognition and review queue.

Use Cases#

Historical Abuse Inquiry: A commission processes handwritten records and typed correspondence, approving text page by page before it becomes searchable.
Financial Crime Disclosure: Investigators OCR scanned invoices and ledgers, then preserve the link from every extracted field back to the source page.
Cold Case Review: Analysts make archived paper files searchable while retaining revision history for any corrected extraction.
Medical and Ambulance Records: Reviewers convert scanned clinical notes into searchable text while protecting PHI and recording every approval action.
Bulk Evidence Triage: A large document production is deduplicated, OCR processed, and queued for reviewer approval in priority order.

Integration#

The module connects to evidence management, document preview, review queues, entity extraction, search, redaction, disclosure export, audit logging, and provenance reporting. Only approved text is treated as verified evidence text. Unreviewed or rejected recognition output remains controlled and cannot silently replace the source document.

Open Standards#

W3C PROV-DM: Text extraction, review, correction, and approval events are represented as provenance activities linked to source pages and reviewers.
IIIF Image API: Page images can be addressed and reviewed consistently across viewers that support image-region based review.
PAGE XML: Handwritten and printed text recognition layouts can align with the PAGE XML model used by many OCR and HTR toolchains.
ALTO XML: OCR layout and text regions can be exchanged using the ALTO standard where archival workflows require it.
RFC 8785, JSON Canonicalisation Scheme: Deterministic serialisation supports reproducible signatures over text provenance records.
FIPS 180-4, SHA-256: Source pages, text revisions, and provenance manifests use cryptographic hashes for tamper evidence.
ISO 8601: Recognition, review, correction, and approval timestamps use standard date-time formatting.

Last Reviewed: 2026-06-26 Last Updated: 2026-06-26