Overview#
An IT operations team can run a quarterly DR drill that automatically rehearses the full failover-and-failback sequence, verifies canonical incident state survived intact, and generates a regulator-grade drill report from a single admin console with timing per phase.
The DR and HA Failover Drill module turns a previously manual, high-risk exercise into a repeatable, evidence-producing operational routine. It targets recovery-time objective (RTO) of 15 minutes or less and recovery-point objective (RPO) of 1 minute or less, drains traffic from the primary DFB site, promotes the disaster-recovery site, runs smoke tests, restores traffic, and then proves through checksum comparison that the canonical incident state survived the cutover.
Last Reviewed: 2026-05-05 Last Updated: 2026-05-05
Key Features#
-
Automated Drill Harness: A drill harness in the new DR drill domain executes the full failover-and-failback sequence so operators are not improvising the order of operations under pressure.
-
RTO and RPO Targets Built In: Each phase is timed against the 15-minute RTO and 1-minute RPO targets, and any breach is flagged in the drill report.
-
Traffic Drain and Restoration: The harness drains traffic away from the primary site before stand-down and restores it on the agreed path after the drill, removing manual cutover errors.
-
DR Site Promotion: Promotion of the disaster-recovery site is scripted into the runbook engine so the same path is exercised every quarter rather than reinvented each time.
-
Smoke-Test Orchestrator: A new smoke-test orchestrator hooked into the W12 production-proof toolkit confirms the promoted site behaves like production before traffic is restored.
-
Canonical Incident Verification: Each drill is itself a canonical incident with
source='dr_drill', and pre-drill and post-drill incident counts and unified-timeline checksums are compared to prove zero data loss across the cutover. -
Phase-Level Timeline: Every phase transition, every smoke-test outcome, and every anomaly is recorded on the drill timeline so the after-action picture is complete rather than inferred.
-
Regulator-Grade Drill Report: The harness produces a signed report covering timing per phase, anomalies, and action items, suitable for evidence packs requested by auditors and regulators.
-
Single Admin Console: A planned
DrDrillConsolein the admin app gives operations a runner and a report viewer in one place rather than scattering DR work across spreadsheets and chat threads. -
Builds on Existing DR Foundations: The module extends the existing administrative disaster-recovery surface and the backup domain rather than replacing them.
Use Cases#
-
Quarterly DFB DR Rehearsal: An IT operations lead schedules the quarterly drill, watches each phase complete inside the target window, and exports the signed report to the compliance owner.
-
Regulator and Auditor Evidence: A compliance officer responds to an audit request with a recent signed drill report showing RTO, RPO, and verification outcomes per phase rather than narrative assurances.
-
Change-Management Confidence: Before a major platform release, the team runs an additional DR drill to confirm the failover path is still healthy and unaffected by recent changes.
-
Post-Incident Re-Validation: After a real-world disruption, the team can re-run the drill harness to confirm the primary and DR sites are back in their expected roles with verified data parity.
-
Onboarding a New Operator: New operations staff observe and then run a drill with the harness driving the steps, learning the DR sequence in a controlled, scripted setting.
-
Cross-Team Tabletop Anchor: Business continuity tabletops can be anchored to a real recent drill report rather than a hypothetical scenario, sharpening the discussion.
Integration#
-
Multi-Region PostgreSQL Replication: The drill exercises the existing primary-and-replica topology across the Irish primary site and the EU replica without changing the underlying replication arrangement.
-
Neo4j Cluster: Graph state on the DR side is included in the verification phase so investigation graphs are confirmed alongside relational incident state.
-
Cloudflare Jurisdiction Routing: Traffic drain and restoration use the existing jurisdiction-aware routing layer so the drill never sends traffic outside the agreed regions.
-
Backup Domain: The harness integrates with the existing backup domain so backup posture is confirmed as part of the drill rather than treated as a separate concern.
-
Existing Admin Disaster Recovery Module: The drill harness extends the existing administrative DR surface so operators see one coherent picture rather than two parallel views.
-
W12 Production-Proof Toolkit: The smoke-test orchestrator reuses the production-proof toolkit so the same checks that gate releases also gate DR cutovers.
-
Unified Incident Timeline: Drill phases land on the same canonical incident timeline used by operational incidents, so the verification step compares like with like.
-
Admin Drill Console: The planned
DrDrillConsolesurfaces the runner, the live drill timeline, and the report viewer for operators and observers in one place.
Open Standards#
-
ISO 22301: drill cadence, scope, and reporting align with the business continuity management system expectations for tested recovery arrangements.
-
ISO 27001 A.5.29 and A.5.30: the drill exercises ICT readiness for business continuity and information security during disruption as called for by the current control set.
-
ISO/IEC 27031: the harness follows the guidelines for ICT readiness for business continuity, including tested failover and verified recovery.
-
NIST SP 800-34 r1: the contingency planning patterns from the federal information systems guide are used as a reference for drill structure and reporting.
-
ANSI/TIA-942: the underlying primary and disaster-recovery sites are aligned with the telecommunications infrastructure standard for data centres so the drill exercises a recognised topology.
-
HSE OoCIO DR Guidance: the drill cadence and reporting are aligned with the disaster-recovery guidance from the Office of the Chief Information Officer of the Irish health service.
-
CloudEvents 1.0: drill lifecycle events are emitted as
argus.dr.drill_started,argus.dr.failover_completed,argus.dr.failback_completed, andargus.dr.report_signedso downstream systems can react using a common event envelope. -
ITIL 4 Service Continuity Management: the drill process is mapped to the service continuity management practice so it integrates cleanly with broader service management activity.
-
HTTPS / TLS: drill control traffic and report retrieval use the standard secure web transport baseline.
-
JSON: drill reports and phase records are exchanged in a common structured format so they remain easy to archive and review.