CAREFUL

Disaster Recovery Architecture Policy and Procedure

Version: 1.0Effective: 22 September 2022Last reviewed: 22 September 2022Owner: Chief Technology Officer
Audience:Technical StaffLeadershipCustomers & Procurement
Category:Technical Operations

1. Introduction

A significant incident (SI) is any unplanned outage caused by infrastructure failure, service failure, software error, or data loss or breach.

Disaster recovery (DR) is the process by which data and service are restored following an SI.

This document sets out how CAREFUL protects customer and patient data by reducing the likelihood of a significant incident, and the procedure for restoring function if one occurs.

2. Overall Architecture and DR Approach

We believe that our customers value data integrity and the prevention of data loss more than 100% uptime. Protecting data is therefore our priority. CAREFUL's DR architecture is designed to prevent the loss of any committed database activity. Data in transit or entered into a client device at the moment of failure is considered non-recoverable.

CAREFUL is hosted on Microsoft Azure, benefiting from a managed database-as-a-service cloud deployment that covers most infrastructure-level risks.

Redundancy within the Azure infrastructure guarantees 99.95% uptime, representing approximately eight hours of downtime per year. Our target is zero significant incidents per year.

3. Data Structure

The current state of the database is stored in a relational (MS SQL) database. Transaction-level data is also stored in an event log, from which a full point-in-time restore is possible.

This allows a restore from backup at a particular point in time with a roll-forward using the event log.

a) Data Backup

To protect against failure or the unlikely event of a Microsoft data centre outage, we maintain two backup mechanisms:

  1. Local incremental snapshots within the same data centre at 24-hour intervals. This interval may decrease as transaction volumes increase.
  2. A passive mirror of both the full event log and the latest snapshot at a separate Microsoft data centre, geographically remote from the primary instance.

b) Container Automation and Backup

All code and database instances are containerised using Kubernetes and Docker. Container definitions are created automatically and the containerisation scripts are codified and backed up as code.

c) Code Protection

All code, along with deployment automation, is backed up in GitHub and in the Azure instance.

4. Responsibilities

Chief Executive Officer (CEO)

The CEO has ultimate responsibility and accountability for DR and the implementation of this policy.

Data Protection Officer (DPO)

The DPO is responsible for monitoring risks of significant incidents and reporting to the Risk Committee and the CEO. The DPO is also responsible for investigating any SIs and reporting findings to the CEO and the Risk Committee.

Risk Committee

The Risk Committee meets monthly, is chaired by the CEO, and is collectively responsible for monitoring and managing overall risks to the business, and for assuring that all actions and recommendations resulting from SIs are implemented.

Chief Technology Officer (CTO) and Lead Developer (LD)

The CTO and LD are jointly and severally responsible for ensuring that all DR processes are implemented and tested, and for reporting to the Risk Committee.

5. Managing a Significant Incident

The following describes the key process steps for DR.

6. Significant Incident Report

An SI must be reported by any member of staff who suspects or has evidence of a failure. This can be done verbally to the CEO, DPO, CTO or LD.

7. Communication and Cascade

The receiver of the initial report must inform all members of the Risk Committee as soon as possible. An emergency online meeting of the Risk Committee should be called.

The Risk Committee must decide how to inform customers and other users of the incident and the likely recovery time.

8. Analysis and Planning

The CTO or LD must undertake an immediate assessment of the extent of the failure and attempt to identify the cause.

An outline recovery plan must be devised and reported to the Risk Committee, preferably via the CEO. The Risk Committee should approve the plan at the earliest opportunity.

9. Recovery

The CTO and LD must jointly lead the recovery by following the agreed plan. Updates and changes of plan must be passed to the Risk Committee as soon as practicable.

The Risk Committee should keep customers and users informed as far as is practicable.

10. Investigation

When DR is complete, the DPO must lead an investigation into the causes of the failure and provide the Risk Committee, within seven days, with a list of remedial actions to prevent recurrence.

11. Mitigation

The CTO and LD are responsible for implementing the mitigation plans recommended by the DPO's investigation.

12. Planning and Testing

The CTO and LD are responsible for devising, documenting and implementing all technical data protection and DR processes, and for testing these at least annually. Reports of these tests must be presented to the Risk Committee.