Mean Time to Recover is that great KPI that drives many Ops and ITSM people crazy. We need to reduce MTTR but we don’t have ________ (you fill in the blank with people, interactions, culture, suppliers, tools). Our SLA says we will make it all ok by a certain time, and we do try but hey, sometimes it just cannot be done. Then we complicate things with Priority 2, 3 and I have even seen 6.
The thing I love about DevOps is that it allows for people to think differently. DevOps is the movement driving organisations to find new ways of collaborating, communicating, cooperating and improving the way they use technology to create and support services.
DevOps places the customer first, so what would happen if you treated ALL incidents as priority 1? How would that change the way you think about why a service is in service or who is supposed to do what, when and how? If you always thought in terms of get it right and back up to the customer, what would that do to their experience with you or more importantly, their loyalty?
MTTR implies that all things are now back as they should be. DevOps looks at the flow of work and how to obtain feedback on a timely basis such that the overall effort can be improved as far as it makes sense. Let’s consider this for a moment: the flow then for MTTR is Respond, Review, Repair, Restore, Recover, Return. You have to respond to the alert or issue. You then need to review the situation such that when you Repair it, it is done correctly. You might have to Restore information or other systems before you Recover the situation back to normality. The last part is the most important: you then need to Return the service back to the customer and ensure that it is as expected.
You can ask if they are satisfied if you wish. Satisfaction for most people is not having the event, so be careful how you couch the question. But it is something you need to know as feedback to help you improve. In fact, each step of MTTR needs to be measured and reviewed such that you can apply automation or better applications or infrastructure to reduce the times you need to do any MTTR.
Throw away your SLAs. Make all events important. Brave thoughts and maybe not practical right away, but what if you began to just think this way? Think of the cultural shift and would that not be enough benefit to continue the journey?
Let us know. Do you use MTTR? Can you measure each step and perform a retrospective? Ranger4 can help show you how.