It ain't over 'til it can't happen again

An incident isn't over until you've addressed the root cause.

If you've moved past immediate customer pain, but the pain could flare up again at any moment, you've at best mitigated the problem.

Yes, this does mean you've got to drop whatever you're doing.

No, this is not the fun part.

What if we end up spending all our time fixing root causes?

In many ways, that's better than spending all your time getting paged. You can address the problems with your systems now, with a clear head, or face them in the middle of the night.

Uncomfortable truths lurk behind this question.

Not everything is worth fixing. There are some bugs or incidents that don't cause that much customer pain. Likely because they occur in systems that aren't crucial to your core customers.

But if something's not worth fixing, is it worth paging for? And if it's not worth paging for, is it worth operating at all?

Running a service on the internet costs time and attention. Nothing comes for free. The longer it's been since you last touched the service, the more concentration it will cost you to respond to a problem.

The other uncomfortable truth is that it costs to build reliable systems. Not in the “add four weeks to the timeline for to find bugs” sense.

In the sense that you have to know what your service is supposed to do and continuously check whether it's doing it or not. If humans interact with your system to accomplish goals, you have to have a way to check that they're accomplishing those goals! HTTP 200 is OK, but not enough.

Or in the sense that sometimes parts of your service or the services it depends on will fail and you have to handle those failures gracefully.

Your downtime will be the sum of the downtime of all your single points of failure.

Waiting for root causes to be addressed before declaring incidents resolved brings these uncomfortable tradeoffs out into the sunlight. What, exactly, merits this level of rigor? What will you stand behind? And what will happen when the new system you're currently building fails?