I recently read the results of a survey of how more than 100 different companies managed IT service. The study covered a variety of subtopics, including questions on how IT personnel are notified of an urgent situation in the middle of the night. It turns out that manual dispatch is still a common method, and those companies that have moved beyond that—setting up an electronic dispatch system—generally use set and forget methods that are passive and lack audit trails. Nobody can be sure the message was received, and no automatic actions are taken when the dispatch isn’t accepted after a preconfigured interval.
Another startling revelation is that the vast majority of companies have never computed the cost of downtime. They know IT outage is a major risk, and they have some idea of what industry analysts say about the costs of outage, but they’ve never quantified what such an occurrence means to their own bottom line.
Take, for example, the case where a batch process fails at 2 a.m. What is the cost of not getting the results on time and of having to run the same process again the next night? The answer to that question varies from company to company, so every company has to do the math for themselves. And based on the cost you estimate, you will probably want to allocate money to minimize the chances failure will occur and to reduce the mean time to repair in the event something actually does go wrong.
One group within IBM Global Services (IGS) in Germany has worked this out quite nicely. In their case, there’s a double whammy when a failure occurs. Because they rent mainframe time to other companies, when something goes wrong, not only is IGS impacted, there’s also a good chance their clients will feel it. For them, a failure could very easily result in lost revenue.
To minimize disruption, IGS has worked out that if a batch process crashes, they want their own IT personnel immediately notified. But if an application bombs, IGS wants to notify IT personnel from the client company that owns the application. In all cases, a precise log has to be kept so everybody knows what failed and when, and how much time it took to fix it.
As you can imagine, IGS’s requirements call for a robust system with a lot of configurable options. For different events, different groups of people need to be notified. Depending on the time of day, the day of the week and holiday schedules, a different individual within each group needs to be alerted. If the first person doesn’t respond after a certain amount of time, somebody else must be prompted. On top of that, each member of a group has his or her own preferences on how he or she receives notification. Some people want an email message, others want SMS, and still others want a voice communication. In all cases, IGS needs the notification to be reliable; that is, the sender has to know the message reached its destination and that the recipient read its contents.
Being on top of things, IGS has implemented a neat solution to its requirements. They use a network management system to detect failure and to log the status of service requests. The management system interfaces to mobile middleware, which generates alarms to different employees based on a sophisticated set of rules.
When a problem occurs, such as a batch process failing, the appropriate IT person is alerted on his or her mobile device. That person must then accept or reject the request within a pre-configured amount of time. Once the problem is solved, information is entered describing the resolution so a record can be kept in the system log.
While some unfortunate IT staffer might be alerted in the wee hours of the morning to fix a problem, I can assure you, everybody else in IBM Global Services now sleeps much better at night.