Root Cause Failure Analysis Overview
RCFA is a systematic and structured process to identify the most effective solutions to eliminate completely, or to manage to the satisfaction of the asset owner, the causes of failures, events or incidents that prevent the asset from achieving and sustaining the business objectives. RCFA is an important part of the continuous improvement process to not just “put out fires” but prevent them from happening or reoccurring.
The goal of an RCFA is to identify:
- What happened
- How it happened
- Why it happened
- Actions to prevent reoccurrence
There are many different root cause failure analyses methodologies. The “5 Whys?” was first used during the development of Toyota’s manufacturing process in the 1950s. Other tools exist such as:
- Barrier analysis
- Change analysis
- Defect elimination
- Cause and Effect analysis
- Fishbone (Ishikawa) • Kepner-Tregoe
ACTOR supports both the informal "5 Whys" process and the formal RCFA process using cause and effect diagrams. The “Informal” approach is used when the effect of the event or failure falls inside the threshold or trigger criteria. It is normally used for on-the-job problem solving or when more detailed feedback is required for work management and control.
The Formal RCFA process should be used on critical equipment failures or failures with a complex
interaction of causes. This approach is used when the effect of the event or failure falls outside
the threshold or trigger criteria requiring a formal RCFA approach.
The formal RCFA process should be used on critical equipment failures or failures with a
complex interaction of causes. RCFA is usually triggered by an undesirable event which
manifest in different ways. Two main categories are sporadic events and chronic events.
Sporadic events are normally once-off failures that lead to a high consequence event. Chronic
failures occur over a period of time and may accumulate to become a situation that becomes
unacceptable to the organization.
- Define the Undesirable Event. It is important that the actual event is clearly defined using objective measures. Different
people see things differently. Often different people will look at the same problem but interpret
it differently.
When defining the actual problem, do not try to identify the solution - DEFINE THE PROBLEM
ONLY! The Undesirable Event description should:
- Provide a detailed description of the event as it was observed or experienced (based
on facts and not assumptions).
- Describe HOW the event took place not WHY!
- Should also state: where and when the event occurred, the timing, sequence of events
and pattern or trend of the undesirable event.
- Define equipment and boundaries for analysis. Once the Undesirable Event has been defined it is important to define the equipment, system,
process involved and the boundaries of the analysis.
The boundaries of the system under examination may expand once the analysis is underway
because the underlying root cause of the problem may be within an external system, e.g.,
“insufficient instrument air supply”. Regardless, it is still important to manage the boundaries
of the analysis.
- Define the Problem statement and/or performance gap.
The problem statement is a functional description of the Undesirable Event/Failure. The
statement includes:
- The definition of the function that was affected by the undesirable event - What is the
equipment, process or system supposed to do or what is the equipment not doing
because of the Undesirable Event?
- The impact on safety (if any).
- The impact on the environment (if any).
- The impact on production including loss of production.
A function statement always starts with the word “TO” contains a “VERB” and “OBJECT” and
at least one “PERFORMANCE STANDARD.”
For example: “To safely transport 20 tons of broken coal from coalface to the tipping point at
a maximum speed of 2.1 mph.”
- Perform a Combined Cause and Effect analysis.
Failures can be caused by a combination (chain) of events or a system of causes and
conditions; there is not necessarily one correct answer to the problem.
This requires a thinking process for viewing the problem holistically (define and quantify the
problem), and secondly for analyzing the relationship between the problem causes and
conditions.
For example: What is the cause of a car being stuck on the highway?
Start by listing the “obvious” things that can cause the problem and then work down the chain.
We need to understand the relationship (links) between the different causes and contributing
conditions and then determine the most effective way of breaking the link(s) in the chain to
prevent the Undesirable Event.
The main characteristics of the Combined Cause & Effect Analysis:
- Causes can become effects depending on the level of analysis.
- Causes and effects can become a “never ending” chain of causes and effects.
- We need to establish whether a specific condition(s) other than the cause, could have
contributed to the previous level effect; these are Contributing Conditions.
- Sometimes the presence of a functional secondary protective system or device could
have prevented the chain of events from developing. This is called a “Barrier Analysis.”
For example, when a breaker could have prevented a fault or when a pressure relief
valve could have prevented overpressure.
- An effect exists only if its causes and conditions exist at the same point in time and
space.
- Cause and effect analysis should consider all categories of possible causes.
Barrier Analysis:
Sometimes the presence of a functional secondary protective system or device (barrier) can
prevent the effect or could reduce its severity.
To complete the Cause-and-Effect Analysis, initial “assumptions” must be identified to ensure
an auditable and credible result.
- Identify Effective Solutions.
- An effective solution is a specific action that when applied to the Root Cause of an
Undesirable Event results in the elimination of the Cause, or the management of the
effect(s) of the Root Cause, to a level defined as tolerable.
- RCM proactively seeks to identify failure management strategies to manage the risks
associated with likely Failure Causes or Failure Modes. Similarly, RCFA (reactively)
seeks to manage risks associated with Failure Modes that have already occurred
through considering the same strategies.
- Implement Effective Solutions.
Implementing the recommended Effective Solutions is probably the most important step in the
process. This will not only ensure the return on investment and realized opportunities, but also
the total buy-in and commitment from:
- Management
- All people who participated in the process
Successfully implementing the suggested Effective Solutions will ensure the long-term
endurance of the RCFA program.
To ensure the solutions are sustainable, the team needs to:
- Demonstrate the value.
- Present verified facts, not assumptions.
- Demonstrate sustainability of results.
- Monitor results as part of continuous improvement.
Once the solutions to prevent the recurrence or manage the consequences of the Undesirable
Event have been identified and implemented, continuous monitoring of the results must take
place to ensure that the solutions are effective and the benefits of implementing the solutions
can endure.
The Outcome of RCFA Analyses
The outcome of the RCFA analysis are action items to prevent recurrence including:
- Executive summary
- Findings
- Recommendations and responsibilities
- New or updated maintenance tasks
- One-time changes
- Redesigns and modifications
- Operating and maintenance procedures
- Training
- Cause and effects worksheets
- Supporting documentation
The 5 Whys Approach
The 5 Whys process is an informal approach to Root Cause Analysis and is used for incidents
or failures with less critical impacts.
By repeatedly asking the question "Why?" (five is a good rule of thumb), you can peel away
the layers of symptoms which can lead to the root cause of a problem. Very often the
ostensible reason for a problem will lead you to another question. Although this technique is
called "5 Whys," you may find that you will need to ask the question fewer or more times than
five before you find the issue related to a problem.
The steps of the “5 Whys” technique
- Identify and define the equipment/process on which the event or failure has happened. The first step is to identify an event, incident or failure that matters to an organisation or has
met a trigger for an RCFA “5 Whys” analysis to prevent a repeat episode. The equipment or
process could involve physical equipment failure, system failure, procedural failure etc. It is
critical to ensure initial evidence is preserved and data is collected for the RCFA investigation.
- Identify the people involved or who can help with the investigation (artisan, operator,
foreman, CBM technician). The event, incident or failure could involve physical equipment failure, system failure,
procedural failure, etc.
- People who were directly involved or affected.
- People who are knowledgeable about the equipment or process involved in the
undesirable event.
- Define the problem (event or failure). The most important step in an RCFA process is to define the problem and verify all
participants understand and agree upon the problem.
- Write down and communicate the problem to be analysed.
- How did the failure become evident? (What was observed?)
-
Where did the event occur?
-
When did the event occur?
-
What is the evidence of the event (Sequence of events)?
- Reach consensus on the definition of the problem.
- Brainstorm and list the possible causes. Brainstorming combines an informal approach to problem solving with lateral thinking. It
encourages participants to come up with thoughts and ideas that can, at first, seem a bit crazy.
Some of these ideas can be crafted into original, creative solutions to a problem, while others
can spark even more ideas. This helps to get people unstuck by "jolting" them out of their
normal ways of thinking.”
- Brainstorming is useful because it can help a group of people utilise its collective brainpower
to generate many ideas in a short period of time. It stimulates creativity and promotes
involvement and participation.
- Assemble a diverse group for input to the analysis.
- Use observable, verifiable data to describe problems and effects.
- Refrain from blame or judgement of ideas during divergent thinking steps.
- Identify assumptions and biases as they arise only in convergent thinking steps.
- Select the most likely cause (and justify why it was selected). Based on knowledge, information and mutual consensus select the most likely cause out of
the ones listed.
- Apply the “5 Whys” principle (on the selected “most likely” cause). The 5 Whys method is the simplest Root Cause Analysis process and involves repeatedly
asking “Why?” at least five times or until the question yields no answers. Five is an arbitrary
figure; success may sometimes require more than 5 “whys?” before the actual root cause is
identified, but after asking “why?” five times, one is likely to arrive at the root cause. The root
cause has been identified when asking “why?” no longer provides any useful information. This
method produces a linear set of causal relationships and uses the experience of the problem
owner to determine the root cause and corresponding solutions.
- Repeat asking why until the “Root Cause” is identified. Verify the 5 Whys logic in the following way:
- A logical link between the event/failure being analyzed and the identified causes must
exist going “down” the cause chain.
- A logical link between the identified solution, causes and the event/failure being
analyzed must exist going “up” the cause chain.
- Suggest possible solutions. At the lowest root cause level, propose solutions to prevent reoccurrence of the problem. Limit
solutions to those that can be implemented within the organization’s control.
- Validate the solution.
- Test the solution to ensure that it will prevent the identified problem from happening
again or will satisfactorily manage risk (reduce risk to a tolerable level).
- Ensure that the solution will be “easy” to implement.
- Ensure that the solution will be cost effective.