Business continuity has become one of the top issues facing enterprises globally. Data growth is exploding. That’s why many enterprises now require 100 percent access to data, 24/7.
A wide range of technical approaches can address business continuity. However, replicating data to a remote secondary site provides the most effective insurance policy against system downtime. Some of the questions, which must be addressed while deciding on an approach for a remote DR site:
- What is the acceptable switchover time?
- What is the acceptable level of data loss if any?
- What is the acceptable level of service functionality after switch over?
- How easy is it to maintain the DR site?
- What does it cost to setup, maintain a DR site?
- Can the remote site infrastructure be used while the primary is active?
- What is the bandwidth requirement between the sites?
- The primary site may be running a single server or a cluster of servers. The switchover should maintain full/partial critical functionality seamlessly.
- The remote DR site should ensure that the data is secured and also the functionality is preserved.
- The site may be running on a locally high availability architecture like Load balancers, Mithi Connect servers and shared storage and/or a set of individual functional servers with their own local storage.
- When both sites are live, the DR site should take minimum effort to maintain.
Each element on the primary site should be replicated to the remote site so that there is flexibility in the setup on the primary site. Which means that if a shared storage exists, then that should be replicated using the functionality/capability of the shared storage. The load balancers should replicate themselves. Individual connect servers should replicate themselves.
- Assuming a total disaster can strike the primary site, the DR site should be able to take over the entire load of the primary, with the entire data available.
- The switchover should not take more than 4 hours of activity.
- The DR site should be easy to maintain with minimum manual intervention.
- The switchover should seamlessly accommodate the different networks that might exist at the two sites.
Approach: Disaster recovery for functionality and data
For a site which has a shared storage, load balancers, set of servers under the load balancer, and individual functional servers. This is an active passive setup, where the DR site cannot function until it is activated using a defined procedure.
- All access to all services and all mail routing to happen over host names instead of IPs. This is done to allow transparent client access in case of a switchover. At a basic level, this also provides flexibility to the IT team to change the backend infrastructure without an overhaul in client configurations.
- High quality WAN links are available between the sites.
- The DNS server for the domain to be hosted at an ISP, which is away from the primary and secondary site.
- May be required to have a local DNS servers (if required) for name resolution by LAN/WAN users.
- The Primary site may have all or some of the following, depending on the design of the site.
- Internet and WAN links to connect remote servers/users and other company offices.
- Clustered load balancers
- Shared storage
- n functional Mithi Connect servers
- One Administration server (Master), which is standalone
- DNS server
- The remote site is a mirror image of the primary site.
- The DR site is setup as a mirror image of the primary site with identical components.
- The network setup is also a replica such that the remote clients, servers feel no difference when a switchover happens.
- The shared storages are configured to replicate over their proprietary infrastructure.
- The load balancers are configured to replicate over their own replication platform.
- All the Mithi Connect Servers (Master/admin server and other non-clustered servers) replicate to their counter part at the remote site using the DR setup of MCS (over DRBD)
The servers and the storage replicate to their counter parts at the DR site continuously, over the WAN links.
The users and the administrators are accessing the resources/services at the primary site using host names instead of IPs.
- Software Upgrade/update: The servers at both sites need to be upgraded in tandem.
- Monitoring: The replication of the data, DR site infrastructure checks to ensure that the DR site equipment is functioning properly.
On Primary failure
- Activate the Secondary for the Mithi Connect Servers. The resources now function over new IPs.
- Follow procedure to activate the Load balancers and the shared storage.
- Change the DNS to point to the new IPs. (A records and MX records). These insulate the users/servers from the change in the network setup, which has occurred after switchover.
- Take live
Addressing the challenges for a business
What’s the acceptable switchover time?
Such an approach may take approximately 2-4 hours to switch over and be fully functional. This includes the service switchover and the DNS changes.
What’s the acceptable level of data loss if any?
The level of data loss depends on the data replication technology deployed. DRBD uses an asynchronous method, wherein after the write to the primary occurs, the write to the secondary lags behind. This is a function of the network capacity available for the replication between the sites. Our tests on the DRBD platform indicate that the lag can be minimised with a strong network. The technology used by the load balancers and the storage system needs to be investigated to understand the level of data loss in case of disruption in the primary site.
What’s the acceptable level of service functionality after switchover?
Using this approach, full functionality is restored.
How easy is it to maintain the DR site?
Regular monitoring is required to ensure that the replication is happening normally. Using a well defined maintenance procedure, perform periodic checks on the infrastructure at the DR site.
What does it cost to setup, maintain a DR site?
The cost will be same as the cost of setting up the primary site. One way to save cost is to have a DR site for a lower capacity (not recommended since it has to be able to support the entire functioning for an unknown period of time). In this approach, we would reduce the number of servers in a group e.g. reduce the number of active servers under the load balancer. Possibly this can be done only if the primary site was oversized in the first place.
Can the remote site infrastructure be used while the primary is active?
One of the goals is to simplify the setup to prevent complications in maintenance and switchover. Setting up an active-active site would complicate issues of data (mail, calendar, etc.) being splintered across 2 locations. Attempting to partially use the infrastructure for a specific function like mail routing, could further complicate the setup. Keeping this mind, we feel it is unviable to have any element of the DR site active during normal functioning.
What is the bandwidth requirement between the sites?
This would be a function of the number of elements at the sites and the overall size of the data to be replicated. Typically we would have a 4-8MBPS link at least to start with.
Disaster Recovery for only functionality
For a site, which is typically used only for mail routing. In this type of setup there would no shared storage but all the other elements might be present. The approach here is similar to (1) above, where each element in the infrastructure maintains its clone at the DR site using appropriate replication platforms.
This is achieved by having a parallel site, with additional servers, linked into the enterprise network. This means that this DR site has servers, which are an extension of the enterprise network and is technically another site to which the master must replicate the directory. These additional servers are capable of routing mail, just like the servers at the primary site. The DNS is configured to point MX2/3 to this site such that if the primary site fails (even temporarily), the mail can continue to come to the DR site and be routed to the other regional sites. Mail meant for the servers at the Primary site will be queued till the primary comes up. Here we can configure the server to leave a copy of the mail on the DR site servers for access by the primary site users (they will of course not have access to their old mail)
Please note that the access to the DR site will be via a different URL (one or more depending on the type of setup at the DR site)