Root Cause Analysis – Part 2

You are here:

Home

Lean six Sigma Yellow Belt Course
Root Cause Analysis – Part 2

Introduction

Welcome to Part 2 of Root Cause Analysis, in which we will delve deeper into three specific techniques that can be used in the DMAIC analyse phase: fault tree analysis (FTA) and failure mode and effect analysis (FMEA).

Each of these methods takes a distinct approach to identifying and analysing potential system failures, and when combined, they provide a comprehensive toolkit for identifying and mitigating risks. In this section, we will look at the key concepts and steps in each method, as well as examples of how they can be used in practise.

Fault Tree Analysis (FTA)

What is Fault Tree Analysis?

Fault Tree Analysis (FTA) is a top-down, systematic method for identifying and analysing the potential causes of an incident or failure. It is used to identify and assess the likelihood of all possible combinations of events and conditions that could lead to an incident.

In FTA, a “top event” (or “failure”) is chosen as the starting point, and a tree-like diagram is constructed to show all of the possible events and conditions that could lead to that event. Working backwards from the top event, the tree is constructed by identifying all of the immediate causes and their respective contributing factors. Each contributing factor is then subdivided further, revealing its own set of contributing factors, and so on.

Once the tree is complete, the probability of the top event occurring can be calculated by multiplying the probability of each contributing event or condition by the probability of the top event occurring. This enables the analyst to determine the most likely causes of the failure, as well as any “common cause” events or conditions that may result in multiple failures.

Fault Tree Analysis is frequently used to identify and mitigate potential hazards in industries such as aerospace, nuclear power, and chemical manufacturing.

The process of creating a Fault Tree Analysis

Creating a Fault Tree Analysis is a fairly simple process but does require some practice in creating the tree and assigning the probabilities. The process in general goes as follows:

Define the top event: The first step in completing a Fault Tree Analysis (FTA) is to clearly define the failure or incident that is being analysed. This is known as the “top event,” and it serves as the starting point for the analysis. It is critical to clearly define the top event so that the analysis is focused and relevant.

Top events could include:

A system failure that causes a shutdown or loss of production

A safety incident that causes injury or damage to equipment
A financial loss caused by an error or fraud

It is critical to consider the scope of the analysis when defining the top event. If the goal is to identify the causes of a system failure, for example, it may be more useful to define the top event as the failure itself rather than an indirect consequence such as the resulting downtime. Furthermore, ensure that the top event is specific and measurable, and that it is clearly linked to the underlying causes that will be investigated in the fault tree.

Identify contributing events and conditions: The next step in completing a Fault Tree Analysis (FTA) is to work backwards from the top event to identify all of the immediate causes and their respective contributing factors. This step contributes to a thorough understanding of the events and conditions that led up to the top event.

The analyst will typically begin by reviewing any available data, such as incident reports, maintenance records, and process logs, to identify the contributing events and conditions. This data can provide valuable insights into the events and conditions that led up to the main event.

In addition, the analyst will use their knowledge of the system, process, or industry to identify potentially contributing events and conditions. This can be accomplished through brainstorming sessions, interviews with subject matter experts, or a review of industry standards and best practises.

Once the contributing events and conditions have been identified, they are represented as “gates” (symbols) in the fault tree to show the logical relationship between the events. Gates are used to demonstrate how the contributing events and conditions are related and how they lead to the outcome.

The following are the most common FTA gates:

AND gate: Represents a condition in which all of the events connected to it must occur in order for the top event to occur.

OR gate: Represents a condition in which at least one event connected to it must occur in order for the top event to occur.
NOT gate: Denotes a condition in which the event connected to it must not occur in order for the top event to occur.

This step is critical because it allows the analyst to see how various events and conditions can interact to lead to the top event, providing a better understanding of the complex interactions that led to the failure.

Build the tree: After identifying and representing the contributing events and conditions as gates, the next step is to connect the gates and events together to show all of the possible combinations of events and conditions that could lead to the top event.

The fault tree diagram, which represents the logical relationship between the events and conditions, is created during this step. Starting with the top event at the top of the diagram, the tree is built by adding gates and events as needed to show all of the possible events and conditions that could lead to the top event.

It is critical to note that the tree should be as detailed and comprehensive as possible while avoiding irrelevant information or unnecessary complexity. The key is to strike the proper balance of detail and simplicity.

It is also important to note that the tree should be constructed in a logical and consistent manner so that the analysis can be easily understood, followed, and reviewed.

Once the tree is complete, the analyst can review it to ensure that it includes all of the possible combinations of events and conditions that could lead to the top event and that it is logically consistent.

The Fault Tree Analysis (FTA) diagram can be a complex and detailed representation of the problem, but it provides a clear visual representation of how various events and conditions can combine to lead to the top event, assisting in the identification of the underlying causes of the failure.

Assign probabilities: After completing the fault tree, the next step is to assign probabilities to each event or condition. This is a critical step because it allows the analyst to assess the likelihood of each event or condition occurring as well as identify the most likely causes of the failure.
There are several methods for assigning probabilities to events and conditions, including:

Historical data: If available, historical data can be used to assign probabilities based on previous incidents or failures.
Expert judgement: In the absence of historical data, probabilities can be assigned based on the expertise of experts with knowledge of the system, process, or industry.

Combination of both: Probabilities can also be assigned using a combination of historical data and expert judgement.

It is critical to use the most appropriate and accurate method when assigning probabilities, taking into account the available data and the level of uncertainty involved.

It’s also worth noting that the assigned probabilities should be consistent and realistic, and that the accuracy of the analysis is highly dependent on their accuracy.

Once the probabilities are assigned, the analyst can multiply the probabilities of the contributing events and conditions to calculate the overall probability of the top event occurring. This enables the analyst to determine the most likely causes of the failure and assess the risk associated with each contributing event or condition.

Determine the likelihood of the top event: After assigning probabilities to each event or condition, the next step is to compute the overall probability of the top event occurring. The probabilities of the contributing events and conditions are multiplied.
For example, if event A has a probability of 0.2, event B has a probability of 0.3, and event C has a probability of 0.4, the probability of all three events happening at the same time can be calculated by multiplying the individual probabilities: 0.2 x 0.3 x 0.4 = 0.024.

Depending on the type of gate used in the fault tree, the process of calculating the overall probability of the top event differs.

The probability of the top event in an AND gate is the product of the probabilities of the events connected to it.
The probability of the top event for an OR gate is the sum of the probabilities of the events connected to it less the probability of all events occurring at the same time.
The probability of the top event is the complement of the probability of the event connected to it for the NOT gate.

The probability of the top event shows the overall probability of the failure or incident occurring, and it enables the analyst to determine the most likely reasons of the failure and assess the risk associated with each contributing event or condition.

It is crucial to remember that the chance of the top event might change when new information is gathered and probabilities are re-evaluated, therefore the analysis should be reviewed and updated as needed on a regular basis.

Identify and assess risk: The final stage in completing a Fault Tree Analysis (FTA) is to identify and assess the risk associated with the contributing events or conditions that have been discovered. This is accomplished by determining the possibility and potential impact of each incident and then taking necessary action to reduce or eliminate the risks.
The analyst can utilise the overall probability of the top event estimated in the previous stage, as well as the probability of each contributing event or condition, to identify the risk. This enables the analyst to determine which events or conditions are most likely to cause the failure and which must be addressed immediately.

The analyst might also analyse the probable impact of each event or situation while assessing risk. This includes appraising the event’s or condition’s probable implications, such as harm, damage, or financial loss.

After identifying and evaluating the risks, the analyst can take necessary measures to mitigate or eliminate them. Implementing new procedures, changing equipment, or offering more training are all examples of this.

It is crucial to highlight that the risk management process should be iterative, as new information or events may surface and affect the risk profile, therefore the risk management plan should be reviewed and updated on a regular basis.

The overall purpose of the Fault Tree Analysis (FTA) is to discover the underlying reasons of a failure or incident and to give a methodical strategy to mitigating risks, improving the system, and preventing similar accidents in the future.

It’s worth noting that, depending on the complexity of the problem, the process may require multiple iterations and reviews to ensure that the analysis is complete and accurate.

Example of a Fault Tree Analysis

To help you understand how to construct and read a Fault Tree Analysis here is an example based on a Top Event of “Production line shut down for more than 15 minutes”

Top Event: Production line shutdown for more than 15 minutes

Event A: Power failure (probability: 0.1)
- Sub-Event A1: Power grid failure (probability: 0.05)
- Sub-Event A2: Generator failure (probability: 0.05)
- Sub-Event A3: UPS failure (probability: 0.05)

Event B: Software malfunction in the control system (probability: 0.2)
- Sub-Event B1: Software bug (probability: 0.1)
- Sub-Event B2: Human error during a software update (probability: 0.1)

Event C: Mechanical failure in the conveyor belt (probability: 0.3)
- Sub-Event C1: Wear and tear on the belt (probability: 0.2)
- Sub-Event C2: Motor failure (probability: 0.1)

Event D: Operator error (probability: 0.4)
- Sub-Event D1: Lack of proper training (probability: 0.3)
- Sub-Event D2: Fatigue (probability: 0.1)

Going down a level in the fault tree can disclose more precise causes and probabilities, which can aid in identifying the most likely causes of the failure and assessing the risk associated with each contributing event or situation. It also enables the organisation to take more targeted and specific actions to manage risks and avoid similar occurrences from occurring in the future.

To calculate the probablity of Event A we need to add together the probability of A1 and A2 which is 0.05 + 0.05 = 0.1. You would repeat that for Events, B, C and D.

Then to calculate the overall probability of the production line shut down for more than 15 minutes, we use the OR gate: A + B + C + D = 0.1 + 0.2 + 0.3 + 0.4 = 1

So, the probability of a production line shutdown for more than 15 minutes is 1, which means it’s certain to happen.

According to the fault tree, the most likely reason of the production line stoppage is operator error (Event D), which has the greatest probability of 0.4.
We can use this information to zero in on specific faults with the operator’s training or procedures and take appropriate action to limit the risk.

This is a simplified example, and the fault tree can be more complex and detailed in real-world situations, but it demonstrates the basic concept of a Fault Tree Analysis (FTA) and how it can help identify the underlying causes of a failure and evaluate the risk associated with the contributing events and conditions.

Failure Mode and Effect Analysis (FMEA)

What is FMEA - Failure Mode and Effect Analysis?

FMEA (Failure Mode and Effect Analysis) is a methodology for identifying probable failures and their consequences for a system, process, or product. FMEA’s purpose is to identify, assess, and prioritise probable failure modes so that appropriate actions can be taken to prevent or reduce the risk of failures occurring, as well as to minimise the effect of failures that do occur.

FMEA is often carried out by a team of specialists from many disciplines, including as design, production, and quality, who utilise a structured approach to identify potential failure modes and assess the likelihood, impact, and detection of each failure mode. The team then prioritises the failure modes according to their likelihood and impact, and proposes corrective actions to reduce or eliminate the risks.

Why do an FMEA?

There are a range of reasons a business might conduct an FMEA even outside of a project. However, the main reason business use the FMEA to review a process it to allows teams to identify and prioritise potential failure modes based on their likelihood and impact. Teams can assess the possibility, impact, and detection of each failure mode and prioritise them depending on their risk level by using a systematic approach and a rating system. This can assist in concentrating resources on the most essential risks and ensuring that the necessary actions are made to minimise or eliminate those risks.

Furthermore, FMEA allows teams to discover probable failure modes that might otherwise go unnoticed, which is especially critical in complicated systems or processes. Teams can take appropriate actions to prevent or reduce the possibility of failures occurring, as well as to limit the effect of failures that do occur, by recognising probable failure modes early on. This can increase the overall safety and reliability of the system, process, or product while also lowering the costs associated with unanticipated breakdowns.

Other reasons you might conduct an FMEA include:

Improve product/system safety and reliability: FMEA assists in identifying and mitigating potential failure modes, which can improve a product’s or system’s safety and reliability.

Reduce costs: FMEA can assist in identifying and mitigating potential failure modes early in the design or development process, thus avoiding costly recalls, warranty claims, and other expenses.

Improve design: FMEA can be used to improve the design of a product or system by incorporating feedback from multiple disciplines and teams.

Comply with regulations: Many industries and organisations have regulatory requirements for risk management and quality control, and FMEA is one of the widely accepted and used methodologies to comply with those standards.

Continual improvement: FMEA is a continuous process that should be examined and updated on a frequent basis as new information becomes available. This allows the company to stay up to date on the latest technology, trends, and best practices.

Overall, FMEA is a powerful tool that can assist organisations in improving the design, quality, safety, and dependability of their systems, processes, and products while decreasing costs and meeting regulatory requirements.

Different Types of FMEAs

FMEA is the standard way of doing Failure Modes and Effects analysis. However, there have been variations developed over the years with the common use of Design FMEAs and Process FMEAs, more recently other variations have been developed. The following are some of the most typical types of FMEA:

Design FMEA (DFMEA): A DFMEA is used to identify potential failure modes in a product or system during the design phase. This sort of FMEA can aid in the identification and mitigation of potential design flaws before the product or system is created.

Process FMEA (PFMEA): A PFMEA is used to identify potential failure modes in a process, such as a manufacturing process or a service process. This sort of FMEA can assist in identifying and mitigating potential process flaws, which can enhance efficiency and lower costs.
System FMEA (SFMEA): A SFMEA is used to identify potential failure modes in a system, such as a mechanical system or an electrical system. This sort of FMEA can aid in identifying and mitigating potential system flaws, hence improving safety and reliability.
Service FMEA (SeFMEA): A SeFMEA is used to identify potential failure modes in a service, such as a healthcare service or a financial service. This sort of FMEA can aid in the identification and mitigation of potential service flaws, which can increase customer satisfaction and lower costs.

Software FMEA (SwFMEA): A SwFMEA is used to identify potential failure modes in software, such as a computer programme or an app. This sort of FMEA can assist in identifying and mitigating potential software flaws, which can improve functionality and lower costs.
Integrated FMEA (IFMEA): A IFMEA is used to identify potential failure modes across multiple systems, processes, or products. This sort of FMEA can assist in identifying and mitigating possible weaknesses within the company.

It is also worth noting that FMEA can be applied in a variety of industries, including automotive, aerospace, medical devices, and many more.

How to do an FMEA?

The process of completing a Failure Mode and Effect Analysis (FMEA) is as follows:

Download: FMEA Template

Define the system, process, or product: An key stage in completing an FMEA is clearly defining the scope of the analysis and specifying the boundaries of the system, process, or product being studied. This stage entails identifying the exact area or product being examined, as well as outlining the analysis’ objectives and aims. It is critical that the team understands what they are evaluating and what they are attempting to accomplish. This ensures that all possible failure modes are identified and examined. This can be accomplished by writing a scope statement that includes the following:

A clear and concise description of the system, process, or product being analysed
The analysis’s objectives and goals

The precise location or product under consideration
Customers and stakeholders who will be impacted by the system, process, or product
The timeline for the analysis

The members of the team who will be involved in the analysis

A clear and well-defined scope statement will assist in ensuring that the team remains focused on the specific area or product being examined and that all potential failure modes are documented and evaluated.

Form the FMEA team: It is critical for the effectiveness of the FMEA to assemble a cross-functional team of individuals with an understanding of the system, process, or product being assessed. Individuals from several departments and areas of expertise, such as design, production, quality, engineering, and maintenance, should form the team.
A cross-functional team ensures that all potential failure modes are captured and analysed, as well as that the team can detect and evaluate the failure modes from various viewpoints. It also guarantees that the team can establish remedial actions that are appropriate for the specific area or product under consideration and are supported by the various departments and areas of expertise.

It is also critical to select the right team members. The team should be made up of experienced and competent people who can contribute to the analysis and have the authority to take corrective action. They should be able to work effectively together, have a positive mindset, and be adaptable.

It is critical to develop a clear communication plan and assign roles and duties to team members. This ensures that everyone understands what is expected of them and that the team can work effectively together to reach the analysis’s objectives and goals.

Identify potential failure modes: The identification of probable failure modes is a critical stage in the FMEA process. Failure modes are the different ways a system, process, or product can fail to perform its intended function. Identifying probable failure modes can be accomplished by brainstorming with the FMEA team, as well as through the use of a Fishbone diagram (Ishikawa diagram) or a Pareto chart to identify the most critical failure modes, these are tools and techniques we covered in the previous post.

The team should explore all conceivable failure modes, including those that are unlikely to occur, during the brainstorming process. This ensures that all possible failure modes are identified and examined. The team should also think about the potential causes of each failure mode, as well as how the failure mode would impact the system, process, or product and its customers.

Identifying potential failure modes is crucial because it guarantees that all potential failure modes are collected and reviewed, allowing the most critical failure modes to be identified and addressed immediately.

Evaluate the likelihood, impact, and detection of each failure mode: A critical element in the FMEA process is assessing the possibility, effect, and detection of each failure mode. This stage entails evaluating the possibility, impact, and detection of each failure mode using a rating system. The team should use a scale of 1 to 10, with 1 representing the least likely, impact, or detection and 10 representing the most likely, impact, or detection.

To help with the evaluation we have provided a standard to evaluate severity, occurrence and detection against standard criteria, also available in the FMEA template

Likelihood: The likelihood is how often the failure mode is expected to occur. This can be assessed by taking into account the failure mode’s frequency, the number of units affected, and the conditions under which the failure mode occurs.

Impact: The impact is the severity of the effect if the failure mode occurs. This can be assessed by assessing the repercussions of the failure mode, such as the cost, safety issues, and customer impact.

Detection: The detection is how likely it is that the failure mode will be detected before it causes a problem. This can be assessed by taking into account the existing controls and processes in place to detect the failure mode, such as inspections, tests, and monitoring.

The team can prioritise failure modes depending on their likelihood, effect, and detection by using a grading system. Failure modes with the greatest likelihood, impact, or discovery should be prioritised and demand prompt care.

It is critical to document the evaluation, ratings, and reasoning for the ratings in the FMEA report; doing so will help to ensure that the process is transparent and that the team’s judgements are based on facts and data rather than assumptions or views.

Prioritize the failure modes: A critical stage in the FMEA process is to prioritise the failure modes based on their likelihood and impact. This stage entails prioritising the failure modes based on their likelihood, impact, and detection using a rating system such as a Pareto chart or a Risk Priority Number (RPN) computation.
A Pareto chart is a graphical depiction of the frequency of failure modes that illustrates the number of occurrences of each failure mode and may be used by the team to identify the failure modes that are the most critical and demand immediate attention.

Another approach for prioritising failure types is the Risk Priority Number (RPN) calculation. By multiplying the likelihood rating by the effect rating by the detection rating, the RPN is calculated. Failure modes having the greatest RPN should be prioritised and need prompt attention.

After identifying and prioritising the most essential failure modes, the team can build appropriate remedial steps to reduce or eliminate the risks. The team should also designate specific individuals or teams to be in charge of carrying out the corrective activities, as well as develop a strategy for monitoring and reviewing the success of the corrective actions.

It should be noted that selecting failure modes is an ongoing process that should be examined and updated as new information becomes available. This will enable the team to stay up to date on the newest technology, trends, and best practices, as well as to ensure that the necessary actions are made to mitigate or eliminate the risks associated with the failure modes.

Identify and implement corrective measures: An important phase in the FMEA process is identifying and implementing appropriate remedial actions to mitigate or eliminate the risks associated with each failure mode. This stage entails employing a number of tools and approaches to discover the root cause of the failure mode and developing corrective actions suited for the specific area or product under consideration.
A Fishbone diagram (Ishikawa diagram) is a handy tool that can assist the team in identifying the root cause of the failure mode by categorising the failure mode’s potential causes. This can help the team to identify the underlying cause of the failure mode and to develop corrective actions that address the

5 Whys is another approach for determining the root cause of a failure scenario. This is a basic yet effective approach for determining the root cause of an issue by asking “why” the problem occurs and then repeating the query until the root cause is discovered.

Once the failure mode’s root cause has been determined, the team can devise corrective activities that address the root cause. Corrective activities should be explicit, quantifiable, realistic, timely, and relevant (SMART). The team should also designate specific individuals or teams to be in charge of carrying out the corrective activities, as well as develop a strategy for monitoring and reviewing the success of the corrective actions.

It should be noted that executing corrective actions necessitates a culture of continuous improvement as well as a willingness to change. The team must be willing to take the required steps to prevent or reduce the risk of failures occurring, as well as to mitigate the effect of any failures that do occur.

Monitor and review: Monitoring and reviewing the success of corrective efforts is an essential element in the FMEA process. This stage entails monitoring the system, process, or product on a regular basis to ensure that corrective efforts are effective and that no new failure modes have been introduced.
The team should develop a plan for monitoring and reviewing the success of remedial actions, including a monitoring timetable and a data collection and analysis procedure. This can be accomplished by inspections, tests, or monitoring process performance with statistical process control (SPC) tools including control charts, Pareto charts, and histograms.

The team should also conduct periodic reviews of the FMEA to discover new failure modes, update the likelihood, impact, and detection of current failure modes, and ensure that corrective actions are effective. This is significant since the process, technology, and market are always changing, and the team must be aware of these changes in order to anticipate new failure modes and take the necessary preventative measures.

It is also critical to share the FMEA results with the appropriate stakeholders and customers, as they can provide valuable input on the process and aid in the identification of new failure modes.

Overall, monitoring and analysing the success of corrective actions helps to guarantee that the process is constantly improved and that the risks associated with failure modes are reduced or removed. This will help to assure the system’s, process’s, or product’s safety, dependability, and quality, as well as customer satisfaction.

Example of an FMEA

To better help you understand the FMEA process here is an example based on a process step of “Administering medication through a medical device”

Multiple failure modes could be identified for this process step however we will focus on one in this example.

Potential Failure mode: The device dispenses an incorrect dosage of medication.

Potential Failure mode effect: Patient experiences adverse side effects or inadequate treatment of the condition, the potential for serious injury or death.

Severity: Rated at 10 due to risk of serious injury or death (refer to severity scale).

Potential Cause: Calibration error, contamination of device, user error.

Occurrence: Rated 3 due as it was identified with data it could happen every one to three years (refer to occurrence scale).

Current Controls: Regular calibration and maintenance of the device, use of sterile techniques during medication administration

Detection: Rate 8 as it is unlikely the failure mode would be detected (refer to detection scale).

Calculate RPN: RPN = Severity 10 X Occurrence 3 X Detection 8 Therefore, 10 x 3 x 8 = 240

Action: A RPN of 240 is far too high so actions need to be taken to reduce this. In this situation, three actions were considered

Action 1: Implement additional user training on proper medication administration.
- If this action was implemented the severity and detection would likely be the same but occurrence may drop to a 2 bringing the RPN to 160
Action 2: Redesign the device to include additional safeguards against calibration errors and contamination
- If this action was implemented the severity and occurrence would stay the same but the detection would drop from an 8 to a 2 bringing the RPN down to 60.
Action 3 Consider adding a feature to alert users to potential dosage errors
- If this action was implemented the severity and occurrence would stay the same but the detection would drop from an 8 to a 1 bringing the RPN down to 30.

Based on this analysis Actions 1 and 3 were implemented to implement additional training to reduce the occurrence number and add a device feature to reduce the chance of dosage errors. A combination of these actions would bring the RPN down to 10 X 2 X 1 = 20 down from the original 120. Making the severity still high, but the occurrence low and if it did occur it would almost certainly be detected by the improvements implemented.

You can see how this example was documented below for reference.

What's Next?

We have now covered most of the tools you would likely use in the analyze phase of DMAIC, next we will start to look at basic statisical analysis of data with Statistical Process Control (SPC)