VMware Cloud Disaster Recovery – Advanced Solutions Design

Previously I talked about VCDR (VMware Cloud Disaster Disaster Recovery) Solutions Validation, and how to properly run a proof of concept.  The reality is that while it would be nice if all applications were as straightforward as our wordpress example, there are often far more complex applications requiring far more advanced designs for a disaster recovery plan to be successful.

External Network Dependencies

For many applications, even in modern datacenters, external network dependencies, or even Virtual Machines which are too large for traditional replication solutions can create challenges when creating disaster recovery plans.

To solve for this, 3rd party partners may be used to provide array or other host based replication services.  This solution requires far more effort on the part of the managed services partner, or the DR Admin.  Since the physical or too large for VCDR workloads cannot be orchestrated through the traditional SaaS orchestrator, there is an additional requirement to both test, and manually failing over.  A Layer 2 VPN between the VMware Cloud Environment and the partner location provides the connectivity for the applications running in both environments.

Script VM

For the more complex VM only environments, some scripts may need to be run during the test and recovery phases.Similar to bootstrap scripts for operating system provisioning, the scripts may be used for basic or even more advanced configuration changes.

The script VM should be an administrative or management server which is either the first VM started in the test recovery, or production recovery plans.  Separate VMs may be designated for testing versus production recovery as well, enabling further isolated testing one of the biggest value propositions of the solution.  Further specific documentation is located here, Configure Script VM.

Extending into VMC on AWS

WIth more complex architectures, it often makes sense to consider a mixed DR scenario.  In these cases, moving some of the applications to the VMC on AWS environment to run permanently, or leveraging another method for replicating outside the traditional VCDR SaaS orchestrator may be warranted.  While this does present some risk again since this is not tested with the rest of the VCDR environment, it does provide for additional options.  

With the recent addition of Cloud to Cloud DR, more options were made available for complex disaster recovery solutions.  Once an environment has been migrated full time to VMC on AWS, VCDR can be leveraged as a cost effective backup solution between reasons without a need to refactor applications.

Even in advanced DR scenarios, the VCDR solution is one of the more cost effective and user friendly available.  With the simplicity of the VMC on AWS cloud based interface, and policy based protection and recovery plans, even more complex environments can take advantage of the automated testing and low management overhead.  The best and most impactful DR solution is the one which is tested and which will successful recover in the event it is needed.

VMware Cloud Disaster Recovery – Advanced Solutions Design

VMware Cloud Disaster Recovery – Solution Validation

In the previous post, I talked about the VCDR (VMware Cloud Disaster Recovery) solution overview.  The best way to determine if a solution is the right fit is to validate the solution in a similar environment to where it might run in production. For this post, the focus is on validating the solution and ensuring a successful POC (Proof of Concept) or Pilot.  As always, please contact me or your local VMware team if you would like to hear more, or have an opportunity to test this out in your environment.

Successful Validation Plans

The key to any successful validation plan is proper planning.  For testing, it is always best to look for 2-3 use cases at most to be tested.  In the case of VCDR, generally the following tests make the most sense.

  • Successfully recover one or two windows or Linux web servers – Web servers are usually fairly simple to test initially.  Linux servers are generally faster to build and the license is open source making for a good test case.
  • Successfully recover a 3-tier app – Often time using two to three Linux VM’s running a Web Server, an App Server, and a Database Server, something such as WordPress or similar running on Linux, is often a good candidate since it is simple to setup and makes for a set of virtual machines, which must be connected or the app will not work properly.
  • An addition or an alternative for the 3-tier app would be any similar internal application which is a copy of production or a development system which could be leveraged for testing.

The purpose of the test is to demonstrate replication of the virtual machines into the DR (Disaster Recovery) environment; the actual application is less relevant than validating the functionality of the solution.

Setting up the “on premises” environment

It is critical for a POC to never connect production. POC’s are very much meant to be a demonstration of how things might work within a lab. The POC environment is for a finite period, typically 14 days or less, just enough to demonstrate a few simple tests.

The Lab setup for this should be very simple. A single vSphere host, or a very small isolated cluster will suffice, with a vCenter instance and the test application installed. A small OVA will be installed in the environment as a part of the POC so there should be sufficient capacity for that as well.

One of the most critical prerequisites to be addressed before beginning is the network connectivity. For most POC’s it is recommended to use a Route based VPN connection to isolate traffic, although policy based could work.  This will generally require engaging the network and firewall teams to prepare the environment.

Protecting the test workloads

The test cases above should be agreed upon.  The following is a formalized test plan that will be included in the POC test document.

  1. Demonstrate application protection and DR testing.  This will be accomplished by the following.
    1. Protect a single 3 tier application such as WordPress or similar from the lab environment into the VCDR environment.
    2. Complete up to 2 Full DR Recovery Tests and demonstrate the application running in the VCDR Pilot Light Environment.

The POC is very straight forward.  Simply deploy the VCDR Connector OVA to your lab vCenter, register the lab vCenter with the VCDR environment, and create the first protection group. 

In the case of a POC, there will only be a single protection group.  We will add the three WordPress virtual machines to our demo using a pattern name based on how we named them.

Creating a DR Plan requires mapping the protected site, the lab in this case, resources to resources in our cloud DR environment.  

A key decision point is in the virtual network.  You can choose to use the same network mappings for failover and testing.  In the POC we can use the same networks, but for a production deployment we want to ensure they are separate so we can run our tests in a bubble without impacting production workloads.

Once we are all set up, the last thing to do is replicate the protection group, and then we can run our failover testing into the VMC on AWS nodes connected as VCDR Pilot Light nodes.

While this is fairly straightforward, the key to any successful POC is to have very specific success criteria.  Be sure to understand what you want to test and how you will show a successful outcome.  Provided the 3-tiered app model fits your business model, this is a great use case to start with to validate the solution and get some hands on experience.  For more hands on experience, check out our Hands on Lab, https://docs.hol.vmware.com/HOL-2021/hol-2193-01-ism_html_en/, and be sure to come back for more VMC on AWS as we continue to look at the direction the cloud continues to go and the future of VMware. 

VMware Cloud Disaster Recovery – Solution Validation

VMware Cloud Disaster Recovery – Solution Overview

In November of 2020, I changed roles at VMware to join the VMware Cloud on AWS team as a Cloud Solutions Architect.  Going forward, I intend to work on a few posts related to the VMware Cloud product set, and cloud architecture.  I am a perpetual learner, so this is my way of sharing what I am working on. I welcome comments and feedback as I share. Many of the graphics in this post were taken from the VMware VCDR Documentation.

To start with I wanted to focus on VCDR (VMware Cloud Disaster Recovery), based on the Datrium acquisition.  To be clear, this is not marketing fluff, this is not a corporate perspective, this is my personal opinion based on the time I have spent working with VMware customers on the VCDR product.  I promise you this may sound like marketecture, but this article lays the important foundation for the next several.

The Problem

DR (Disaster Recovery) is not an exciting topic.  It is basically the equivalent to buying life insurance; you know you should do it, but usually it is a low priority, until it isn’t.  We often think of disasters as fire, flood, earthquake, and other natural disasters, but recently malware has become the largest problem requiring a good DR plan.  

When a system, or systems, are compromised, it is likely that not only the file systems are compromised, but also the backups, mount points, and even the DR location.  Prevention is the best way to solve this issue, but assuming you are attacked, a good DR plan is critical to restoring services quickly and securely.  

The Overview

VCDR uniquely solves this problem with immutable (read only/unchangeable) backups and continuous automated compliance checking.   

The biggest challenge is that when we do backups we are generally appending changed blocks which makes for a far more efficient backup solution. This lowers the cost, and the time to backup while still providing a point in time recovery solution. When the backup is compromised, the best case in this option is to go back to a time before there was malware, assuming it is not infected somehow.

VCDR solves this problem by creating an immutable point in time copy of the data. Since each point in time copy is isolated from the others, malware cannot infect the previous points. The system can then pull together all the partial backups to make what appears to be a full backup at any given point. Since this is all being handled as a service, recovery is near instant, and the recovery admin can recover from as many points as needed to find the best point to restore to. 

As a Service

The promise of everything as a service seems like a great idea, but in practice it can create some challenges. It requires that we trust the service, and that we regularly test the service. VCDR is no exception. Because this is a part of the VMware Cloud portfolio, this enables adjacency to other VMware Cloud services, in particular VMware Cloud on AWS. Leveraging the Pilot Light service, some applications which are critical for recovery can be recovered directly to a cloud based service while the less critical services can be brought back online in the Datacenter once the problems are mitigated.

By providing a warm DR location, the costs are significantly mitigated, and by using the “as a service” model, many of the lower value tasks such as patching and server management are handled by the service owners, VMware in this case.

Some cool details

Aside from the immutable backups, the SaaS orchestrator, and the Scale-out Cloud File System provide a significant edge for many users. The SaaS orchestrator provides a simple web interface to configure production groups.  Setting the protection groups by name patterns, or exclusion lists gives DR admins a simple setup, and no need to recover an onsite system, or log into a new site before doing recovery.  

The Scale-out Cloud File System is simply an object store which provides for far greater scale, as the name implies. For instant power on to test virtual machines, this cloud based file system will mitigate the need for the additional configuration during a declared disaster. Once the appropriate recovery point is identified, simply migrate the powered on Virtual Machine back to the datacenter, or run it in the Pilot Light environment in VMC on AWS while the host is being prepared to receive the recovered VM.

Moving forward I will explore test cases for the VCDR service, where it fits within the Backup, Site Recovery Manager/Site Recovery service continuum, and even dig into the VMware Cloud on AWS services. 

VMware Cloud Disaster Recovery – Solution Overview

Installing VMware Tanzu Basic on vSphere 7

With VMware Tanzu becoming more critical to the VMware Strategy, I thought I would see what it is like to install in my lab with out any experience on this specific product. I plan to write a few more posts about the experience and how it relates to VMware Cloud strategy. As a disclaimer this was done with nested virtualization, so this is not a performance test. William Lam wrote a post on an automated deployment, but I wanted to have a better understanding to share. To get myself started I watched Cormac Hogan’s video on the implementation.

Assuming the prerequisites are met which is covered in several youtube videos and other blogs, start with selecting “Workload Management” from the main menu in the vCenter web client. The initial choice is allows you to select NSX-T, if installed, or you will need to use HAProxy for the vCenter Network.

On the next screen, select your cluster, next and then choose your control plane size. For a lab deployment Tiny should suffice, depending on how many workloads to be deployed in the environment. On the next screen choose your storage policy, in my lab I am using vSAN to simplify things.

For the Load Balancer section, you just need to give it a name, something simple works and select HAProxy as the type. The Data Plane API Address is the IP of the HAProxy you setup, with a default port of 5556. Put in the username and password you put in when setting up HAProxy. The Virtual IP Address Range you should pick something in the workload network, separate from the management network and something not in the DHCP scope.

In the “Server Certificate Authority” you will need to SSH into the HAProxy VM, and copy the output of “cat /etc/haproxy/ca.crt” into the field.

In the workload management section, select the management network being used for the deployment. Input the network information including the start of the IP range you want to use.

Under the workload network, select your workload network and fill in the information. This should be on a separate broadcast domain from the management network.

For the Service Network pick something that is not conflicting with your existing networks, and at least a /23 range. Add your workload network from the previous screen below.

Finally select the content library you should have subscribed to already, and finish. It will take some time to provision and you can then provision k8s workloads natively in the vSphere environment.

A couple thoughts on this, the install wasn’t too bad, but it did take a while to understand the networking configuration, and setup everything correctly. I had also assumed this would be a little more like VMware Integrated Containers. While I have some understanding of deploying workloads through k8s, installing it is a bit more learning. The next steps for me are go through the deployment a few more times, and then start testing out some workloads running.

For those of us coming from the Infrastructure side of things, this is going to be a great learning opportunity, and if you are up for the challenge, Kube Academy is an exceptional, and no cost resource to learn from the experts. For those who do not have a home lab to work with, VMware also offers a Hands on Lab, for vSphere with Tanzu at no charge as well.

Installing VMware Tanzu Basic on vSphere 7

To home lab or not to home lab

As I often do, I am again debating my need for a home lab.  My job is highly technical, to take technology architecture and tie it all together with the strategic goals of my customers.  Keeping my technical skills up to date is a full time job in and of itself, and begs the question, should I build out a home lab, or are my cloud based labs sufficient.

One of the perks to working at a large company is the ability to use our internal lab systems.  This can also include my laptop with VMware Workstation or Fusion product which affords some limited testing capabilities, mostly due to memory constraints.  Most of the places I have been have had great internal labs, demo gear, etc, which has been nice.  I have often maintained my own equipment as well, but to what end.  Keeping the equipment up to date becomes a full time job, and adds little value to my daily job.

With the competition in cloud providers, many providers will provide low or no cost environments for testing.  While this is not always ideal, for the most part, we are now able to run nested virtual systems, testing various hypervisors, and other solutions.  Many companies are now providing virtual appliance based products which enable us to stay fairly up to date.

Of course one of my favorites is VMware’s Hands on Labs.  In fairness I am a bit biased, working at VMware, and with the hands on labs team as often as I can.  Since a large majority of what I do centers around VMware’s technology, I will often run through the labs myself to stay sharp on the technology.

While the home lab will always have a special place in my heart, and while I am growing a rather large collection of raspberry pi devices, I think my home lab will be limited to smaller lower power devices for IoT testing for the moment.  While always subject to change, it is tough to justify the capital expenditure when there are so many good alternatives.

To home lab or not to home lab

Enterprise Architecture – When is good enough, good enough?

In a conversation with a large customer recently we were discussions their enterprise architecture.  A new CIO had come in and wants to move them to a converged infrastructure.  I digging into what their environment was going to look like as they migrated, and why they wanted to make that move.  It came down to a good enough design versus maximizing hardware efficiency.  Rather than trying to squeeze every bit of efficiency out of the systems, they were looking at how could they deploy a standard, and get a high degree of efficiency but the focus was more on time to market with new features.

My first foray into enterprise architecture was early in my career at a casino.  I moved from a DBA role to a storage engineer position vacated by my new manager.  I spend most of my time designing for performance to resolve poorly coded applications.  As applications improved, I started to push application teams and vendors to fix the code on their side.  As I started to work on the virtualization infrastructure design for this and other companies, I took pride in driving CPU and memory as hard as I could.  Getting as close to maxing out the systems while providing enough overhead for failover.  We kept putting more and more virtual systems into fewer and fewer servers.  In hindsight we spent far more time designing, deploying, and managing our individual snowflake hosts and guests that what we were saving in capital costs.  We were masters of “straining the gnat to swallow the camel”.

Good enterprise design should always take advantage of new technologies.  Enterprise architects must be looking at roadmaps to prevent obsolescence.  With the increased rate of change, just think about unikernel vs containers vs virtual machines, we are moving faster than our hardware refresh cycles on all of our infrastructure.

This doesn’t mean that converged or hyper-converged infrastructure is better or worse, it is an option, but one that is restrictive since the vendor must certify your chosen hypervisor, management software, automation software, etc. with each part of the system they put together.  On the other hand, building your own requires you do that.
The best solution is going to come with compromises.  We cannot continue to look at virtual machines or services per physical host.  Time to market for new or updated features are the new infrastructure metric.  The application teams ability to deploy packaged or developed software is what matters.  For those of us who grew up as infrastructure engineers and architects, we need to change our thinking, change our focus, and continue to add value by being partners to our development and application admin brethren.  That is how we truly add business value.

Enterprise Architecture – When is good enough, good enough?

There can be only one…or at least less than there are now.

Since the recent announcement  of Dell acquiring EMC, there has been great speculation on the future of the storage industry.  In previous articles I have observed that small storage startups are eating the world of big storage.  I suspect that this trend had something to do with the position EMC found themselves in recently.

Watching Nimble, Pure, and a few others IPO recently, one cannot help but notice there are still far more storage vendors standing, with new ones coming out regularly, and the storage market has not consolidated as we thought it would.  During recent conversations with some of the sales teams for  a couple storage startups, we discussed what their act two was to be.  I was surprised to learn that for a number of them, it is simply more of the same, perhaps less a less expensive solution to sell down market, perhaps some new features, but nothing really new.

Looking at the landscape, there has to be a “quickening” eventually.  With EMC being acquired, HP not doing a stellar job of marketing the 3Par product they acquired, Netapp floundering, and Cisco killing their Whiptail acquisition, we are in a sea of storage vendors with no end in sight.  HP splitting into two companies bodes well for their storage division, but the biggest challenge for most of these vendors is they are focused on hardware.

For most of the storage vendors, it is likely that lack of customers will eventually drive them out of business when the finally run out of funding.  For some, they will survive, get acquired, or merge to create a larger storage company, and probably go away eventually anyway.  For a few they will continue to operate in their niche, but for the ones who intend to have long term viability, it is likely they are going to need to find a better act two, something akin to hyper converged infrastructure, or more likely simply move to a software approach.  While neither are a guarantee, they do have higher margins, and are more inline with where the industry is moving.

We are clearly at a point where hardware is becoming commoditized.  If your storage array can’t provide performance, and most of the features we now assume to be standard, then you shouldn’t even bother coming to the table.  The differentiation has to be something else, something outside the norm.  Provide some additional value with the data, turn it into software, integrate it with other software, make it standards based.  Being the best technology, the cheapest price, or simply the biggest company doesn’t matter any more.  Storage startups, watch out, your 800lb gorilla of a nemesis being acquired might make you even bigger targets.  You better come up with something now or your days are numbered.

There can be only one…or at least less than there are now.

The times they are a changin

Disclaimer: I am a VMware employee. This is my opinion, and has been my opinion for some time prior to joining the company. Anything I write on this blog may not be reflective of VMware’s strategy or their products.

With this weeks announcements from VMware, there has been a great deal of confusion on what made it into the release. So as not to add to it, I wanted to focus more on something you likely missed if you weren’t watching closely. As I said in the disclaimer, this is not a VMware specific post, but they do seem to be in the lead here.

For many years I was big on building management infrastructure. It was an easy gig as a consultant, it scales and it is fairly similar from environment to environment. Looking back, it is a little funny to think about how hardware vendors did this. First they sell you servers, then they sell you more servers to manage the servers they sold you, plus some software to monitor. When we built out virtual environments we did the same thing. It was great, we did less physical servers, but the concept was the same.

If you pay close attention to the trends with the larger cloud providers, we are seeing a big push toward hybrid cloud. Now this is not remarkable unless we look closer at management. The biggest value to hybrid cloud, used to be that we could burst workloads to the cloud. As more businesses move to some form of hybird cloud, it seems that the larger value is not being locked into on premise cloud management software.

At VMworld 2014, as well as during the launch this week, VMware touted their vCloud Air product. Whether you like the product or not, the thing that caught my eye is the outside model of management. Rather than standing up a management system inside the datacenter, simply lease the appropriate management systems and software. Don’t like your provider, great get another. Again I want to point out, I am using VMware as my example here, but there are others doing the same thing, just not on the same scale yet.

While this is not going to be right for everyone, we need to start rethinking how we manage our environments.  The times they are a changin.

The times they are a changin

The universe is big. It’s vast and complicated and ridiculous.

As I was meeting with a customer recently, we got onto the topic of workload portability. It was interesting, we were discussing the various cloud providers, AWS, Azure, and VMware’s vCloud Air, primarily, and how could they, a VMware shop, move workloads in and out of various cloud providers.

Most industry analysts, and those of us on the front lines trying to make this all work, or help our customers make it work, will agree that we are in a transition phase. Many people smarter than I have talked at length about how virtualization and infrastructure as a service is a bridge to get us to a new way of application development and delivery, one where all applications are delivered from the cloud, and where development is constant and iterative. Imagine patch Tuesday every hour every day…

So how do we get there? Well if virtualization is simply a bridge, that begs the question of portability of workloads, virtual machines in this case. Looking at the problem objectively, we have started down that path previously with the Open Virtualization Format (OVF), but that requires a powered off Virtual Machine which is then exported, copied, and then imported to the new system which creates the proper format as part of the import process. But why can’t we just live migrate workloads without downtime between disparate hypervisors and clouds?

From my perspective the answer is simple, it is coming, it has to, but the vendors will hold out as long as they can. For some companies, the hypervisor battle is still waging. I think it is safe to say we are seeing the commoditization of the hypervisor. As we look at VMware’s products, they are moving from being a hypervisor company, again nothing insider here, just review the expansion into cloud management, network and storage virtualization, application delivery, and so much more, but more and more they are able to manage other vendors hypervisors. We are seeing more focus on “Cloud Management Platforms”, and everyone wants to manage any hypervisor. It has to follow then that some standards emerge around the hypervisor, virtual hard drives, the whole stack so we can start moving within our own datacenters.

This does seem counter intuitive, but if we put this into perspective, there is very little advantage in consolidation at this point. Most companies are as consolidated as they will get, we are now just working to get many of them to the final 10% or so. It is rare to find a company who is not virtualizing production workloads now, so now we need to look at what is next. Standards must prevail as they have in the physical compute, network, and storage platforms. This doesn’t negate the value of the hypervisor, but it does provide for choice, and differentiation around features and support.

I don’t suspect we will see this happen anytime soon, but it begs the question of why not? It would seem to be the logical progression.

The universe is big. It’s vast and complicated and ridiculous.

EVO Rail, is technology really making things easier for us?

This week at VMworld, the announcement of what had been Project Marvin became official.  I wanted to add my voice to the debate on the use case for this, and where I believe the industry goes with products like this.  To answer the title question, EVO is a step in the right direction, but it is not the end of the evolution.  As always I have no inside information, I am not speaking on behalf of VMware, this is my opinion on where the industry goes and what I think is cool and fun.

To understand this, we need to consider something my wife said recently.  As a teacher, she was a bit frustrated this week to return to school to find her laptop re-imaged, and her printer was not configured.  I tried to help her remotely, but it is something I will need to work on when I get back.  Her comment was, “Technology is supposed to make things easier”.  This stung for a moment, after all technology is my life, but when I thought about her perspective, it struck me just how right she is.  Why afterall shouldn’t the laptop have reached out, discovered a printer near by and been prepared to print to it, afterall, my iPhone/iPad can do that with no configuration on the device itself.

So what does this have to do with EVO?  If we look at EVO as a stand alone product, it doesn’t quite add up.  It is essentially a faster way of implimenting a product which is not too complicated to install.  I have personally installed thousands of Nodes of vSphere, hundreds of vCenters, it is pretty simple with a proper design.  The real value here though, the trend, is simplification.  Just because I know how to build a computer, doesn’t mean I want to.  Just because I can easily impliment a massive vSphere environment, that doesn’t mean I want to go through the steps.  That is why scripting is so popular, it enables us to do repetetive tasks more effeciently.

The second part of this though really comes down to a vision, where are we going.  If you look at where we are going as an industry, we are moving to do more at the application layer in terms of high availability, disaster recovery, and performance.  We see this with the openstack movement, the cloud movement, docker, and so many others.  At some point, we are going to stop worrying about highly available infrastructure.  At some point our applications will all work everywhere, and if the infrastructure fails, we will be redirected to infrastructure in another location without realizing it.  

That is the future, but for now we have to find a way to hide the complexity from our users, and still provide the infrastructure.  We need to scale faster, better, stronger, and more resilient, without impacting legacy applications.  Someday we will all be free from our devices, and use what ever is in our hand, or in front of us, or just get a chip in our brains, someday HA won’t be an infrastructure issue, but until then projects like EVO will help us to bridge that gap.  Not perfect arguably, but this is a bridge to get us a step closer to a better world.  At the end of the day the more complexity we hide with software, the better we are, provide that software is solid, and we can continusiouly improve.

EVO Rail, is technology really making things easier for us?