VMware Cloud Disaster Recovery – Advanced Solutions Design

Previously I talked about VCDR (VMware Cloud Disaster Disaster Recovery) Solutions Validation, and how to properly run a proof of concept.  The reality is that while it would be nice if all applications were as straightforward as our wordpress example, there are often far more complex applications requiring far more advanced designs for a disaster recovery plan to be successful.

External Network Dependencies

For many applications, even in modern datacenters, external network dependencies, or even Virtual Machines which are too large for traditional replication solutions can create challenges when creating disaster recovery plans.

To solve for this, 3rd party partners may be used to provide array or other host based replication services.  This solution requires far more effort on the part of the managed services partner, or the DR Admin.  Since the physical or too large for VCDR workloads cannot be orchestrated through the traditional SaaS orchestrator, there is an additional requirement to both test, and manually failing over.  A Layer 2 VPN between the VMware Cloud Environment and the partner location provides the connectivity for the applications running in both environments.

Script VM

For the more complex VM only environments, some scripts may need to be run during the test and recovery phases.Similar to bootstrap scripts for operating system provisioning, the scripts may be used for basic or even more advanced configuration changes.

The script VM should be an administrative or management server which is either the first VM started in the test recovery, or production recovery plans.  Separate VMs may be designated for testing versus production recovery as well, enabling further isolated testing one of the biggest value propositions of the solution.  Further specific documentation is located here, Configure Script VM.

Extending into VMC on AWS

WIth more complex architectures, it often makes sense to consider a mixed DR scenario.  In these cases, moving some of the applications to the VMC on AWS environment to run permanently, or leveraging another method for replicating outside the traditional VCDR SaaS orchestrator may be warranted.  While this does present some risk again since this is not tested with the rest of the VCDR environment, it does provide for additional options.  

With the recent addition of Cloud to Cloud DR, more options were made available for complex disaster recovery solutions.  Once an environment has been migrated full time to VMC on AWS, VCDR can be leveraged as a cost effective backup solution between reasons without a need to refactor applications.

Even in advanced DR scenarios, the VCDR solution is one of the more cost effective and user friendly available.  With the simplicity of the VMC on AWS cloud based interface, and policy based protection and recovery plans, even more complex environments can take advantage of the automated testing and low management overhead.  The best and most impactful DR solution is the one which is tested and which will successful recover in the event it is needed.

VMware Cloud Disaster Recovery – Advanced Solutions Design

VMware Cloud Disaster Recovery – Solution Overview

In November of 2020, I changed roles at VMware to join the VMware Cloud on AWS team as a Cloud Solutions Architect.  Going forward, I intend to work on a few posts related to the VMware Cloud product set, and cloud architecture.  I am a perpetual learner, so this is my way of sharing what I am working on. I welcome comments and feedback as I share. Many of the graphics in this post were taken from the VMware VCDR Documentation.

To start with I wanted to focus on VCDR (VMware Cloud Disaster Recovery), based on the Datrium acquisition.  To be clear, this is not marketing fluff, this is not a corporate perspective, this is my personal opinion based on the time I have spent working with VMware customers on the VCDR product.  I promise you this may sound like marketecture, but this article lays the important foundation for the next several.

The Problem

DR (Disaster Recovery) is not an exciting topic.  It is basically the equivalent to buying life insurance; you know you should do it, but usually it is a low priority, until it isn’t.  We often think of disasters as fire, flood, earthquake, and other natural disasters, but recently malware has become the largest problem requiring a good DR plan.  

When a system, or systems, are compromised, it is likely that not only the file systems are compromised, but also the backups, mount points, and even the DR location.  Prevention is the best way to solve this issue, but assuming you are attacked, a good DR plan is critical to restoring services quickly and securely.  

The Overview

VCDR uniquely solves this problem with immutable (read only/unchangeable) backups and continuous automated compliance checking.   

The biggest challenge is that when we do backups we are generally appending changed blocks which makes for a far more efficient backup solution. This lowers the cost, and the time to backup while still providing a point in time recovery solution. When the backup is compromised, the best case in this option is to go back to a time before there was malware, assuming it is not infected somehow.

VCDR solves this problem by creating an immutable point in time copy of the data. Since each point in time copy is isolated from the others, malware cannot infect the previous points. The system can then pull together all the partial backups to make what appears to be a full backup at any given point. Since this is all being handled as a service, recovery is near instant, and the recovery admin can recover from as many points as needed to find the best point to restore to. 

As a Service

The promise of everything as a service seems like a great idea, but in practice it can create some challenges. It requires that we trust the service, and that we regularly test the service. VCDR is no exception. Because this is a part of the VMware Cloud portfolio, this enables adjacency to other VMware Cloud services, in particular VMware Cloud on AWS. Leveraging the Pilot Light service, some applications which are critical for recovery can be recovered directly to a cloud based service while the less critical services can be brought back online in the Datacenter once the problems are mitigated.

By providing a warm DR location, the costs are significantly mitigated, and by using the “as a service” model, many of the lower value tasks such as patching and server management are handled by the service owners, VMware in this case.

Some cool details

Aside from the immutable backups, the SaaS orchestrator, and the Scale-out Cloud File System provide a significant edge for many users. The SaaS orchestrator provides a simple web interface to configure production groups.  Setting the protection groups by name patterns, or exclusion lists gives DR admins a simple setup, and no need to recover an onsite system, or log into a new site before doing recovery.  

The Scale-out Cloud File System is simply an object store which provides for far greater scale, as the name implies. For instant power on to test virtual machines, this cloud based file system will mitigate the need for the additional configuration during a declared disaster. Once the appropriate recovery point is identified, simply migrate the powered on Virtual Machine back to the datacenter, or run it in the Pilot Light environment in VMC on AWS while the host is being prepared to receive the recovered VM.

Moving forward I will explore test cases for the VCDR service, where it fits within the Backup, Site Recovery Manager/Site Recovery service continuum, and even dig into the VMware Cloud on AWS services. 

VMware Cloud Disaster Recovery – Solution Overview