A case for condensing fix/test cycles

Black Box Model

Black Box Model

A while back I posted on the strategy of changing only one variable at a time when working to resolve an incident and determine root cause.

I’m amazed at how much discipline it takes to implement this simple strategy. If you’re truly interested in determining the sole root cause, this is the best way to go. But, as as Mike Plant so correctly pointed out, it’s not the fastest method for debugging.

Speed sometimes needs to be sacrificed.

Here’s a non-programming example:

My shower was leaking from the shower head. That one clue helped point me to the seat washer assembly and not the bib washer.1 And I wasn’t sure if it was the hot or the cold, doubling the suspects. But because this still means taking apart the faucet, it could take 30+ minutes for a complete testing cycle:

  1. prepare the proposed “fix” (if involves a trip to the hardware store, add 45 minutes),
  2. go down to the basement,
  3. turn off the water,
  4. drain the water from the system,
  5. return to the second floor bathroom,
  6. take apart the faucet,
  7. implement the “fix”,
  8. put the faucet back together,
  9. go down to the basement,
  10. turn on the water,
  11. bleed the air from the system and
  12. test.
  13. Evaluate the solution and repeat as needed.

On hand, I had two rubber flat washers (either hot or cold), one polyethylene washer (in theory for the more extreme conditions on the hot side) and a pair of conical/beveled washers along with a pair of bronze seats. The existing washers appeared fine and the installed seats were as smooth as silk. Following the “One Variable at a Time” rule would take much more time than I cared to spend.

In this situation, with only so many hours in the weekend (my “change window”), I decided to cheat and so skipped a few cycles by combining solutions: replace both the seat washers and the bronze seats. It cost more to replace unneeded parts, but saved money in the form of time (and aggravation). My hope was to figure out the problem in a post-mortem.

And this time I was lucky. I changed four of my variables at once, completed my cleanup and tested the system: it worked. At the risk of not isolating the sole root cause, I got my shower fixed in one fix/test cycle.

What about the post-mortem? Initially, the theory was that a rough spot on the seat opened a gap between the seat and the washer causing the drop. Inspecting my replaced parts showed a very unusual hairline fracture in the threads of one of the bronze seats. I could have changed washers (the most likely suspects) one-by-one all day long and never figured it out.

What’s the lesson?

The lesson that might be drawn from this is two-fold: combining your changes can speed up the fix/test cycle but can complicate root cause analysis. Oh, and sometimes you get lucky.

In other news, my shower doesn’t leak and I had the rest of the weekend to enjoy.

  1. Leaks caused by a faulty bib washer appear at the handle.
This entry was posted in System Administration and tagged , . Bookmark the permalink.