Mithi’s enterprise messaging servers are automatically monitored by external probes, who scan the servers at close intervals for a response on all protocols. For any outage, the probes send our sms alerts and also call up on designated mobile phones using an automated voice caller. At 2 AM, Monday morning, we received a emergency call announcing a server outage on one of our mail servers. Our L1 team who took the call, scanned the server, and found the Load average touching 250, with iowaits hovering around 95% to 99%. Clearly abnormal. They at first suspected the hardware. When it was escalated to me, I got online, logged into the Mithi chat interface, created a group and got the Level 2, Level 3 and Reconfiguration teams on the group chat.
While the reconfiguration team worked on drawing up and firming a recovery Plan B, in case the server didn’t resume normal functionality, the other teams discussed advanced diagnostics to parallely troubleshoot the issue.
As a base measure to troubleshoot the issue, we decided to stop all services and agents and kill all running/waiting application jobs and scripts to eliminate any software issue. Once this was done the server became normal. On a closer scan it was found that the /var/log/messages log file had grown beyond 3 GB and the agents and scripts which work on this log were reading and loading up the I/O. This was cyclic since the earlier agents kept getting slower and slower as the file was used more and more often. At a certain point it overloaded the server and brought it to a standstill. Now what was causing this, happened to be an internal spam attack, whose rejection lines were pushed into the log and it kept growing.
We could get this under control in the matter of a couple of hours due to deep collaboration between the teams. Using chat allowed teams to perform tasks, copy paste results and commands, while keeping their hands free from holding the phone. Not forgetting to mention the saving in phone bills.
Using the group chat, there were 4 different teams and a coordinator working on the issue, with the clear transparent flow of information to keep all the team members up to date. So while Level 3 instructed on the diagnosis steps, Level 2 actually performed the instructions, while asking Level 1 to monitor the parameters. During the discussions a lot of ideas were also thrown in how to have prevented this or the steps to diagnose or forming an improved checklist for monitoring. All this with only an initiation call and no more. And whats more is that this chat discussion is now archived as well. It would serve as a base for the troubleshooting documents and preventive work allocation
Mithi’s Connect Xf product supports enterprise grade secure chat