Tue, 11 Nov 2014

Driving to Root Cause

This post is largely about software engineering, but it applies to really any situation where a preventable problem was encountered.

Too often, individuals, and worse, teams, assume that failures are inevitable, and there's nothing you can do to prevent them.

High performing teams, on the other hand, almost relish failures because they have gotten very good at addressing the reasons why they occurred. They eagerly look to assimilate the variability or lack of control of that particular variable, and lock it down and make it part of their process.

Once enough of these variables are locked down, solving problems becomes very simple. Using a strategy called the "five whys", you too can operate like a seasoned problem solving engineer. For each failure in a system, ask Why?, up to five times, until you reach a resolution that represents a broken process.

If you're doing this on a team, remind the team members that they need to bring good ego strength, and seek to understand the broken process, as this is not a finger pointing exercise!

For example, if you reach a conclusion that says "John is just a poor QA tester, he needs to do a better job” or “Jane is just a poor developer, she needs better coaching on C++”, you've pointed fingers, not found the root cause.

A better way:

Joe didn’t catch an error with the Messages section of the site.
Why?
The test cases didn’t capture a feature unique to the Messages section.
Why?
Messages test cases were built from a former spreadsheet kept for the previous version.
Why?
We assumed the old test cases would cover testing of the new feature.
Why?
The requirements for the new feature should have had better acceptance criteria for new feature, and our definition of done should have included writing new test cases for any new functionality.


Root Cause:
- Stories must have acceptance criteria for new features.
- All new functionality should have test cases written against those acceptance criteria


Root Resolution:
- Team has added “All User Stories Must Have Acceptance Criteria” to their Sprint-Ready definition of a Story.
- Team has added “No Stories that are not considered Sprint Ready allowed in the Sprint Planning Meeting” to their Team Agreement


The goal here is for Joe to realize that his job is to follow a documented process, not to beat himself up over missing a test case.

Root cause analysis always strives to find the process that is broken, and then expects a team to uphold the process. Leave things to an individual, and they are likely to break. Leave it up to a team, and have the team enforce the process and you’re likely to not repeat the same mistakes.

For super-complex root cause analysis, another technique to use is called the Ishikawa diagram, or a fishbone diagram. This allows more than a simple linear regression of the five whys and allows analysis of a web of contributing causes. Stay tuned for a post on that in the future.




Khan Klatt

Khan Klatt's photo