One of our large customer, whose mail servers are hosted on a VMWare infrastructure, recently performed some maintenance on their shared storage infrastructure and on the VMWare platform. After this activity, It was found that except for the core mail server, all other servers booted up fine. The core mail server was exhibiting a problem related to binding the IP address to the Ethernet interface and would discontinue the boot. At one hand this call was escalated to the VMware support desk for their inputs on this. On the other hand, when this call landed on Mithi’s Level 1 team, they got busy trying to boot the server from the single user mode by eliminating services from the boot stack one at a time. While the Level 1 team is working with the Level 2 and Level 3 teams to try and get the mail server booted up (Plan A), and after a certain time has elapsed with no success, its time to kick in Plan B, in which we would rebuild the mail server on a fresh VM from the latest backup. For this the coordinator of the activity from Mithi, aligns the reconfiguration or deployment team to work on Plan B parallely.
For the teams who are locked into action on Plan A, it is difficult for them think of or work on an alternative plan. In fact they are likely to get deeper and deeper into troubleshooting the track they are holding and keep feeling that they are nearing a solution.
While handling emergencies, its important for one person or a separate team to observe the proceedings and ensure that the work on Plan B kicks in on fixed schedule. Plan B could well be an alternative troubleshooting track or a fall back to the DR plan.
This coordinator’s job is to ensure results, which is in terms of increased up time for the customer.
So here are some suggestions on how to handle emergencies in a structured fashion
- First call is received by Level 1, who does basic diagnosis to revive the service/system, sometimes even resorting to a reboot if necessary.
- If unable to resolve using basic diagnosis or if a predefined time interval has elapsed without a solution then Level 1 team now escalates to Level 2 team for advanced diagnostics.
- Level 2 team may involve Level 3 team after the advanced diagnostics are done or if another predefined time interval has elapsed without a solution . At this stage the Service Delivery Manager also is alerted.
- Service Delivery Manager (SDM) takes stock of the situation, and evolves/refers to Plan B.
- SDM aligns a separate team to work on Plan B, which is typically a recourse to a DR plan (on site or off site)
- SDM now monitors the functioning of teams on both plans
Do you have a Plan B, C, D for the upkeep of your collaboration infrastructure?
It is a good practice to have a DR Strategy for your mail server setup, which could range from a simple server/site rebuild document, to an on site hot standby server or High availability configuration for your servers to a complete DR site.
It is also a good practice to perform DR drills at periodic intervals, which will ensure that you have all the necessary tools and documents ready, if and when ever needed.
P.S. At the time of publishing this, the Mithi teams and HCL team (FMs on site) have successfully brought the mail server back online using Plan B and a lot of collaboration through the night. Well done guys !!!