Broken Everywhere

I catch some good-natured and well-deserved ribbing from my co-workers and colleagues about this precept, but frankly, I take it in stride. It’s true that sometimes the new hires don’t understand it right away, but once explained and demonstrated, I don’t generally get much push-back.

Stated simply, “Broken Everywhere” means:

A single device or installation should never stop working as a result of a change to (or near) a similar device or installation somewhere else.

The corollary is similar:

All same-class devices or installations should all fail at the same time, all in the same way.

“What!” you ask, “why would I want my stuff to break, to fail?” Well, frankly, you don’t. In a production environment, you probably want all your systems to work, all the time. Differences in configuration1 shouldn’t exist. Do your systems need to know what time it is? All installations should be configured to use NTP, with the same config files (and probably the same version of NTP), pointing to a well-diversified and robust farm of NTP servers. 

Problems with “sameness”

One problem with everything being the same is that if things do break, they are “Broken Everywhere”. This can be mitigated by thorough design, architecture and engineering maintained by rigorous pre-change analysis.2

Supposed benefits to diversity

I’ve had employees argue that there is benefit to what some have called “biodiversity” in system administration. What they usually mean here is by deploying different configs and settings to similar devices and installations,3 the entire infrastructure can’t be brought down through one small change or attack. The argument is that all the systems won’t be impacted in the same way because each one of them is different.

For example, think about a web server farm running OpenSSL. With a rainbow of OpenSSL versions, it may indeed be harder for an attacker to penetrate them all with the same type of attack. The diversity in versions might be seen as a protection against such an attack by limiting the number of systems impacted, but a better solution is to make them the same version, with the same web server configuration (and by extension, the same vulnerabilities), everywhere. Then, when you become aware of the latest vulnerability impacting your (mono-crop?) farm, you’re able to protect them all with a single change, performed in a consistent manner, usually very quickly. You might be sacrificing some resiliency but I argue that you’ll make up for that by holistically applying a consistent method to maintaining your systems.

Incident response and problem management also benefit when systems are “broken everywhere”. When systems are consistently configured, managed, maintained and documented, less time is spent on wondering and figuring out why one system is behaving differently than the others. And where there is a difference in configuration or installation, it’s been done intentionally, with thought and documentation.

Broken Everywhere

So let your systems be “broken everywhere”. It does take planning and execution on the front end, but you’ll  be rewarded on the back end with less down-time and easier system management.

  1. Unless operationally needed.
  2. You are following good change management practices, aren’t you?
  3. Usually caused, I find, by lack of time, sloppiness or laziness.
This entry was posted in System Administration. Bookmark the permalink.

Comments are closed.