The fear of the blank problem record
Posted: June 17th, 2009 | Author: admin | Filed under: Uncategorized | Tags: itil, problem management, rca, root cause analysis | 2 Comments »
Root cause analysis (RCA) is an intuitive activity, we are naturally inclined to do it in our daily life; we know the techniques and methods to make deep and sharp analyses of our daily events, from enumerating causal factors to the generation of recommendations and the implementation of changes to eliminate the underlying cause.
Why, then, it’s so hard to take that ability to a work environment?
At IT management layer, it’s widely spread the need of integrating problem management among the rest of IT management processes, but are reluctant when it comes to effectively adopt it, holding against it a bunch of unfounded reasons. So it must be expected that, if management is reluctant to adopt problem management, so will be the technical personnel involved on performing, for two main reasons:
- The first effect perceived is an increase in their work-loads: a small or medium sized company will avoid to dedicate resources exclusively to perform RCA because it’s expensive; and they are right: in short term, putting aside any calculations of ROI or VOI and similar (even more ethereal) indicators, it is true that problem management requires an initial expense. Technical staff have to combine their daily routine activities with incident support plus investigation on major and serious incidents, and once the root cause has been found, raise the requests for change needed in order to eradicate it.
- The avergage systems administrator has no idea of where to start from… but neither he knows where he must end.
The fear of the blank problem record
Let’s imagine: the Service Desk gets suddenly flooded by users calls reporting their email is not working; Major Incident procedure is triggered; the incident is escalated to third level support; a systems administrator resolves it, users are informed the service is restored.
If the organization has adopted ITIL best practices, a major incident requires opening a problem ticket right afterwards it is detected; and there we have the systems administrator facing the immaculate white desert of the blank problem record with an assignment: fill it up. So, where to start?
First: identify the affected service and its components, which is not a trivial duty if Configuration Management has not been settled or if the Service Catalog (SC) is incomplete, poor in detail, not updated frequently, summing up: B rated. Exercising on determining the components of a service affected by an incident during the problem investigation has an advantage: it will be used as an input for Service Catalogue Management to improve the SC and make it useful.
Next: documenting a problem record the right way; it is required it contains an accurate depiction of the chronological sequence of the incident, since first notification to resolution. The straightest way: writing a journalistic chronicle, in newspaper style, where first it is settled the scenario: ‘what, when, where, who’; then, the timeline of events. Once it’s done, the analytic part begins, and it is worthy answering a handful of questions which:
- will help us vertebrate the research
- will prevent leaving any loose end
- will help us identify and formalize the improvements that will shape the permanent solution, because, after all, this is the main output expected from problem management.
Bringing the facts to light: the truth, the whole truth and nothing but the truth
An average problem ticket would begin something like: “At 10:15 am, Joaquin Bañez called the Service Desk; he reported an error window when trying to open MS Outlook to acces his mailbox; the text in the error was…”.
The research will begin, then, gathering together all incident records related with the problem we are dealing with; the oldest, will become the string we’ll pull to unravel the problem; it will be helpful to answer questions such as:
- When was the incident detected?
- How?
- what was done to resolve it?
- what was the impact to the user base?
- has it happened before? is it a recurring problem?
- is there any way this problem could have been detected any earlier?
- is there any way this incident coud have been resolved any earlier?
- was the documentation about this incident in the Knowledge Database? If so, was the work instruction adequate to resolve the issue?
- did any other team or partner have to get involved?
In follow up posts, I’ll try to give more details about documenting, arguing and answering, in a methodic and structured manner (and, above all, documented manner) all those questions.
One more question for the list in “Bringing the facts to light”:
Could the incident have been avoided?
And a couple of seconds reflecting on this question gives an answer to why Problem Management is by far the most difficult process to adopt in any environment.
And if you add to the situation that the environment is a Southern one, where nobody enjoys giving feedback to other teams and peers, let alone constructive self-criticism, where still in the 21st century organisations and employees struggle to understand that FEEDBACK is part of our responsibility as professionals…
A lot needs to change in order to prepare the field for an open and honest outcome. And the issue here is that any other outcome, i.e. not a completely open and honest one, destroys the very essence of Problem Management.
[...] The fear of the blank problem recordI like how this guy is always emphasizing getting exec/management buy-in so you can get budget and time. Without leadership, all the best practices in the world will be wasted cause no one will care…to do them. [...]