Analysis of Disaster Recovery and Business Continuity in the Financial System

I. Requirements for Disaster Recovery#

1. Requirements of Service Level Agreements (SLA)#

SLA requirements generally specify the need to restore services within a certain period of time. If SLA requirements cannot be met, compensation or explanations to customers are often necessary, with the associated risks being relatively controllable.

2. Requirements for Transaction Correctness#

We know that transactions have four characteristics: Atomicity, Consistency, Isolation, and Durability (ACID). In financial systems, the requirement for atomicity is the most stringent, meaning that all operations within a transaction must either succeed or fail and be rolled back. If the requirement for transaction correctness cannot be met, the system will experience significant chaos, with relatively high risks. For example, when withdrawing money from an ATM, if the deduction is shown as successful but the ATM suddenly loses power before dispensing cash, the system needs to be restored to the state before the withdrawal.

The requirements for disaster recovery are based on the above two requirements and need to strike a balance between SLA levels and transaction correctness levels, taking into account the cost. For example, if the requirements for disaster recovery cannot meet the requirements for transaction correctness but only guarantee the SLA requirements, and the cost is higher than the compensation budget required by the SLA, then such disaster recovery is actually meaningless. In this case, using a portion of the daily profit as a compensation fund for disaster recovery can meet the requirements. Another example is that some financial businesses have low transaction volumes and few customers (but possibly high amounts), with low SLA requirements that can be a few days or even months. In this case, the cost of disaster recovery cannot be high.

II. Disaster Recovery Monitoring#

Implementing disaster recovery requires monitoring. Financial institutions, especially banks, often use cross-data center disaster recovery monitoring to achieve this.

1. Local Asynchronous Backup of Monitoring Data#

Monitoring data is locally backed up asynchronously, and cross-data center monitoring systems monitor each other.

2. Real-time and Delayed Monitoring#

Push: The business system actively pushes business indicators to the monitoring system, such as heartbeat lines, to achieve real-time monitoring. This approach has strong real-time performance but also higher resource requirements.

Pull: The monitoring system periodically pulls monitoring indicators from the business system, such as AnyRobot, a log operation and maintenance tool from Aisuhuo, which can pull monitoring indicators from the business system using standard protocols such as Syslog, SNMP, SSH, and JDBC.

3. Monitoring Itself Can Also Have Errors#

When monitoring detects errors, disaster recovery is initiated. However, monitoring itself can also have errors, so pseudo-reporting issues need to be considered during disaster recovery.

III. Disaster Recovery Processing#

1. Service Disaster Recovery Processing#

Stateful and stateless services are two different service architectures, and the judgment criterion is whether two requests from the same initiator have a contextual relationship on the server side.

Stateless services can send requests to any server and achieve horizontal scaling through load balancing and other means. For example, request data can be transmitted through cookies to save tokens.

3bdc1d2bd10f56b855a93c74aee8fe97

Source: Original article by CSDN blogger "༺鲸落༻"
Original article link: https://blog.csdn.net/nsx\_truth/article/details/108917851

Transactional processing requires stateful services, such as the shopping process in e-commerce, which involves a series of operations from adding items to the shopping cart to filling out orders and making payments. In this case, session can be used to maintain the contextual information of logged-in users. Although the HTTP protocol is stateless, with the help of session, stateless HTTP services can be transformed into stateful services.

b392e0ac4cc73741dea6448cde50be9c

Source: Original article by CSDN blogger "༺鲸落༻"
Original article link: https://blog.csdn.net/nsx\_truth/article/details/108917851

Disaster recovery for stateless services is relatively simple, as the scheduling algorithm of the service itself will distribute services to different nodes. Disaster recovery for stateful services involves treating service nodes as data nodes.

2. Data Disaster Recovery Processing#

1) Single Node Disaster Recovery

The process of writing data to a single node can be divided into: application cache - OS cache - hardware controller cache - disk storage. Any power failure during this process will result in a write failure. Currently, there are two methods to solve this issue: using battery-powered disks to delay power failure and ensure that disk storage is completed; using clustering and load balancing to ensure high availability of business. (Note that clustering is used here, not distributed systems. The difference between the two can be found in the article "Difference and Benefits of Distributed and Clustered Systems")

2) Multi-Node Disaster Recovery

In banking systems, we often hear about two data centers with two centers, or two data centers with three centers, or three data centers with five centers. Regardless of the configuration, consensus algorithms are used, with the most famous being the Raft consensus algorithm. Raft consists of master and slave nodes, requiring at least two nodes. If there is only one node, it cannot be considered distributed. Raft is usually deployed with three or five backup nodes. At least more than half of the nodes need to be online for normal operation.

Let's first look at the case of two data centers with two centers (3 backup nodes), where one data center has one node and the other data center has two nodes. If a single node fails, the system can still work because the other two nodes are functioning normally. However, if the data center with two nodes fails and only one node remains, which is less than half of the required nodes, the entire system cannot work properly. This design can meet the requirements for single-node disaster recovery, but not data center-level disaster recovery.

Next, let's look at the case of two data centers with three centers (3 backup nodes), where all three data centers have one node, with two data centers located in the same city. In this case, if the above-mentioned data center-level failure occurs, it will not affect the system's operation because the other two nodes can still function normally. However, if a city-level failure occurs, such as the subway construction damaging the city's fiber optic cables, and only one node remains, the system cannot work properly as it fails to meet the requirement of having more than half of the nodes online. This design can meet the requirements for data center-level disaster recovery, but not city-level disaster recovery. This architecture is currently used by many banks as it is required by the country to ensure the stability of banking systems when the assets exceed a certain threshold. However, there are some differences between two data centers with three centers and active-active architecture. The former distinguishes between the roles of data centers and disaster recovery centers, while the latter allows the disaster recovery center to also be a data center, requiring higher technical requirements.

Finally, let's consider the case of three data centers with five centers, where three different cities have five nodes. This architecture is one of the highest levels of disaster recovery architecture in financial systems and can meet city-level disaster recovery requirements. Ant Financial has showcased this architecture before. However, this solution is currently used very rarely, as banks generally use two data centers with three centers, and many securities companies use active-active architecture. They do not have the resources to establish a large-scale multi-node disaster recovery system.

IV. Disaster Recovery and Backup#

Disaster recovery and backup cannot be simply equated, as the latter is a broader category that includes backup and disaster recovery. However, the purpose of backup is still for disaster recovery. If the backup is successful but it takes several days to restore the system when a failure occurs, resulting in a long recovery time objective (RTO), then the backup in this case is actually meaningless. Alternatively, if the data volume is too large and the backup time is insufficient, such as many banks' database systems that cannot be fully backed up overnight, then disaster recovery cannot be achieved, let alone backup.

I have previously written an article titled "Overview of Disaster Recovery Technology," which introduces CDP backup and CDM backup. Both of these backup technologies can serve disaster recovery well. The RTO of the former is close to zero, and the RTO of the latter is also short, and it can achieve active-active architecture, effectively utilizing backup data. Both are good choices. Aisuhuo, a leading disaster recovery company in China, applies timed backup, CDP backup, and CDM backup simultaneously to the disaster recovery of two data centers with three centers in banks.

d2b5ca33bd970f64a6301fa75ae2eb22 1

Source: Aisuhuo Academy

References:#

Ren Jie, "How to Achieve Correct Cross-Data Center Real-Time Disaster Recovery"