Multi-region database disaster recovery architecture for MySQL

Businesses expect extreme reliability from the database infrastructure accessible by their applications. Despite your best intentions and careful engineering, database errors do occur, whether it’s a machine crash or network partitioning. Good planning can help you get ahead of problems and recover faster when problems arise.

This blog presents an approach to deploying a database architecture that implements high availability and disaster recovery for MySQL on Compute Engine, using regional disks as well as load balancers.

Any database architecture must provide approaches to tolerate errors and recover from them quickly without losing data. These approaches are expressed in RTO (recovery time objective) and RPO (recovery point objective), which offer means of defining and then measuring the duration during which a service may be unavailable and the distance for saving data.

After a database error, a database should recover as quickly as possible with as small an RTO as possible, ideally within seconds. There should be as little data loss as possible, ideally none. The desired RPO is the last consistent database state.

From a database architecture and deployment perspective, this can be accomplished with two distinct concepts: high availability and disaster recovery. Use both at the same time to create an architecture prepared for the widest range of errors or incidents.

 

Creating a resilient database architecture

A high availability database architecture has database instances in two or more zones. If a server in one zone goes down or the zone becomes inaccessible, instances from the other zones are available to continue processing. The figure below shows two instances, one in the zn1 area and one in the zn2 area. The load balancer opposite supports redirection of traffic to a healthy database instance available for read and write requests.

A disaster recovery architecture adds a second high availability database configuration in a second region. If one of the regions becomes inaccessible or fails, the other region takes over. The figure below shows two regions, primary and DR. Data is replicated from the primary region to the DR region so that the DR region can take over from the last consistent database state. The load balancer in front of the regions directs traffic to the region in charge of read and write traffic. Here is what this architecture looks like:

In addition to configuring the database instance, a regional disk is deployed so that data is written to two zones simultaneously, which is safe in the event of a zone failure. This is a huge advantage of Google Cloud, allowing you to ignore MySQL-level replication in a region. Each write operation on the disc is done in two areas synchronously. When the main instance fails, a standby instance is mounted with regional persistent disks and the database service (MySQL) is then started using the same. This gives you the peace of mind of not worrying about replication delay or the state of the database for high availability.

In a disaster recovery process view, the following events occur over time during a failure situation:

  • Normal operation of the database in steady state.
  • Failure occurs and a region becomes unavailable or the database instance is inaccessible.
  • A decision must be made to switch or not (in the event that the region is expected to become available soon enough or that the instance becomes reactive again).
  • DNS is updated manually, so it redirects application traffic to a second region.
  • Returning to the main region once it is available again is optional, as the second region is a fully integrated deployment.

 

In a high availability process view, the following events occur over time during a failure situation:

  • Normal operation of the database in steady state.
  • The database instance fails or becomes unavailable.
  • Launch the rescue instance.
  • Mount the regional SSD and start the database.
  • Automatic redirection of application traffic to the standby server via the load balancer.
  • Once the failed or unavailable instance becomes available again, a fallback may or may not occur.

The database architecture presented illustrates a highly available architecture supporting disaster recovery. With regional disks and load balancers, it is simple to provide a resilient database deployment.

Learn more about load balancers and regional disks. Consult the general HA and DR processes and the steps detailed in the first part of the reference guide. Try it out to familiarize yourself with the architecture as well as the two main failover processes.

Leave A Comment

Whatsapp Whatsapp Skype