1. Introduction
A significant incident (SI) is any unplanned outage, caused by infrastructure failure, service failure, or software error, or by a data loss or data breach.
Disaster recovery (DR) is the process by which data and service are restored following an SI.
This document sets out how CAREFUL protects customer and patient data by reducing the likelihood of a significant event. It also sets out the procedure for restoring function following an SI.
2. Overall architecture and DR approach
We believe that our customers value data integrity and the prevention of data loss more than the provision of 100% uptime, which has significant costs. Protecting data is therefore our priority. CAFREFUL DR is therefore designed to prevent the loss of committed database activity. Data in transit or entered into a client device is considered non-recoverable.
CAREFUL is hosted on Microsoft Azure. It therefore benefits from a ‘DB as a service’ cloud deployment, which covers most of the risks of SIs within the data centre.
Redundancy within the Azure infrastructure guarantees 99.95% uptime, which represents about 8 hours in every year.
To meet the 99.95% uptime overall, our target maximum service outage after an SI is therefore eight hours. Our target is for zero Sis within a year
3. Data structure
The current state of the database is stored in a relational (msSQL) database. Transation-level data is also stored in an ‘event log’ from which a full restore is possible – up-to and including a certain time-point.
This allows a restore-from-backup at a particular time and a roll-forward using the event-log.
a) Data backup
To protect against other failures or the unlikely event of a Microsoft Data Centre failure, we ensure adequate backup through two backup mechanisms.
- Local, incremental snapshots within the same data centre at 24 hour intervals. This interval may decrease as the number of transactions and users increase.
- A passive mirror of both the full event-log and the latest snapshot (but not the current state SQL) at another MS datacentre, remote from the instance.
This structure allows for the most rapid recovery using the latest snapshot with roll-forward using the event log.
b) Container Automation and backup
To speed up DR after an SI, all code and DB instances are containerised using Kubernetes and Docker containers. The structure of these containers are created automatically and the scripts for this containerisation is codified and backed-up as code (see below).
c) Code Protection
All code, along with the necessary deployment-automation is backed-up in GitHub and in the Azure instance.
4. Responsibilities
d) Chief Executive Officer (CEO)
The CEO has ultimate responsibility and accountability for DR and the implementation of this policy
e) Data Protection Officer (DPO)
The DPO is responsible for monitoring risks of Sis and reporting these to the Risk Committee and the CEO. The DPO is also responsible for investigating any SIs and reporting the results of these investigations to the CEO and the Risk Committee
f) Risk Committee
The Risk Committee meets monthly and is chaired by the CEO and is responsible collectively for monitoring and managing the overall risks to the business, and for assuring that all actions and recommendations resulting from SIs are implemented.
g) Chief Technology Officer (CTO) and Lead Developer (LD)
The CTO and the LD are jointly and severally responsible for assuring that all DR processes are implemented and tested and for reporting these to the Risk Committee.
5. Managing a Serious Incident
The following describes the key processes for DR.
6. Serious Incident report
An SI must be reported by any member of staff who suspects or has evidence of a failure. This can be done verbally, to the CEO, DPO, CTO or LD.
7. Communication and Cascade
The receiver of this report must inform all members of the Risk Committee as soon as possible. An emergency meeting of the Risk Committee should be called online.
The Risk Committee must decide how to inform customers and other users of the incident and the likely recovery time.
8. Analysis and planning
The CTO or LD must undertake an immediate assessment of the extent of the failure and try to identify the cause.
An outline plan must be devised and reported to the Risk Committee, preferably via the CEO. The risk committee should approve the plan at the earliest opportunity.
9. Recovery
The CTO and LD must jointly lead the recovery by following the plan agreed by the Risk Committee. Updates and changes of plan must be passed to the Risk Committee as soon as practicable.
The Risk Committee should keep customers and users informed as far as practicable.
10. Investigation
When the DR is complete, the DPO must lead an investigation into the causes of the failure and offer the Risk Committee, within 7 days, a list of remedial actions that will prevent an incident.
11. Mitigation
The CTO and LD are responsible for implementing the mitigation plans recommended by the DPOs investigation.
12. Planning and testing
The CTO and LD are responsible for devising, documenting, and implementing all technical data protection and DR processes and testing these at least annually. Reports of these tests should be presented to the Risk Committee.