RCA: misunderstood or even a complete stranger
Posted: June 12th, 2009 | Author: Joaquin Baez | Filed under: Uncategorized | Tags: it, itil, problem management, rca, root cause analysis | 1 Comment »

- The Spanish Deming Cycle
I’ve been working on IT for about 8 years now, always as a member of support teams (except for a couple of years, when I, just me, was the “support team”). Through the last 2 years, beyond the mega-hype of ITIL as a magic recipe to turn IT shops into perfect machines (as if that was enough) one of the most used expressions for selling and buying IT services has been “continual improvement”.
On January 2008 I joined a support team of 5 people, where I played the systems administrator role; the customer was a private holding who bought to my company a managed service pack, including the adoption of a few ITIL v2 processes: incident, problem and change management.
I had never known, through my years of experience in the IT field, nothing like problem management, whose goal is finding the underlying root cause of major or recurring incidents and then raise a request for a change (of the infrastructure, the operations procedures or documentation or whatever) to permanently eliminate that cause and so prevent the recurrence of such incidents.
Those are widely accepted processes at industry production lines, and have been described and optimized with standards such as Six Sigma and many others based on the Deming Cycle of Continual Improvement; but as I found later, it turned out too complex to successfully implement such processes on the IT planet; I thought it was due to the zeitgeist of our customer, but after speaking to some colleagues about it (and some goggling), I found out that the major handicap to overcome is customers, providers and technicians not truly understanding what root cause analysis (RCA) means and implies. They usually understand how important it can be to boost the improvement of the infrastructure, of the services (and reputation) of IT, even some of them are able to measure that importance and turn that in terms of TCO or ROI (and those also understand the power presenting that data to Management has), but few times they have a deep understanding of RCA, they hardly have the necessary vision to face it, they don’t know what is the expected output it should bring and how to effectively manage the information and knowledge derived.
That makes that when adopting ITIL at an organization, it’s unable to complete the 4-step Deming Cycle, swifting from “Plan, Do, Check, Act” to “Plan, Do, Stop”. Some of the handicaps that an IT shop must overcome (and, rather than obstacles, sometimes is just the IT department resistance to change) are represented with excuses such as:
- No time: the #1 reason, repeated over and over again: “we don’t have the time”; for those teams, their current situation seems to be an endless loop of fire-fighting without the time to effectively address the issues at hand. Similarly, some IT environments reward “heroes” in a fire-fighting environment and, as a result, implicitly encourage and perpetuate “cowboy” behavior which tends to reduce any available time for proactive activities.
- No money: a change, almost always, requires some investment; if, as expected, a RCA outputs a request for a change, it must be taken into account the need for an investment (an investment which will be easily justified if the RCA’s been done in a consistent and well documented way). But facing reality: do IT managers have access to that money? Is that money already assigned to other projects, i.e.: renewing network equipment, which maybe was introduced with less arguments and business justification than the problem we are dealing with? Do those managers have a clear vision of the actual cost, either for their own department and the whole company, of incidents (costs ranging from the time technicians expend solving those incidents and the decrease of productivity of the affected users in the meantime from crash to resolution)?
- Chasing which incidents: help-desk agents are more concerned about solving the incidents they receive rather than collecting all sort of data regarding those incidents;moreover, they often are not keen to write down the actions they take to solve them; they must manually classify each incident, and data about incident frequency is collected and processed manually too; all together makes it difficult to know which areas must be proactively addressed first . It takes a very mature incident management process which includes a lot of automatic incident classification and data collection and reporting.
Those reasons I’ve exposed here get merged, in the real world, in a question that executives, CIO and support personnel throws like a slap in the face to those who argue for the need of adopting problem management and RCA’s: “what for?”. And that makes clear not only they don’t understand the need of discovering the root cause that triggers major incidents at their companies, but also they do not have a clear vision of the impact those incidents, whether they are serious or of moderate importance, have on the organization productivity and yet worse: their own department.
About that last one I’ll talk in depth, because a behaviour I despise the most on IT world is that of encouraging the fire-fighting among support personnel and claiming the fast solvers as heroes, promoting the cowboy way of life, so IT system administrators ride on the IT Services as in a rodeo, but that stuff must really come to an end in Spain, it’s too 90’s…
[...] in case you need one more reason to perform that RCA. Posted by gpoul at 6:51pm CEST Filed in Uncategorized No Comments [...]