Sat Nov 12 2022

Monitoring: The yang to testing's yin

It's no secret I love a good test suite. Sit me down at a new codebase that's full of tests and I'm a happy camper. But I'll be the first to admit that tests have limits, both in mimicing a production environment and in diminishing ROI as you get higher coverage percentage. Sure, some code makes sense to cover every single corner case. Banking software, sure. Navigation systems, definitely. But an internal tool? Write some tests please, but consider the opportunity cost of adding tests on tests. Remember, test code is still code, and will probably require maintenance at some point, so delete what you don't need. Most software probably falls somewhere in the middle of the spectrum.

"So what do you suggest, Sir Tests-a-lot?"

Well I'll tell you: Monitoring. Good monitoring to catch issues quickly after the code has rolled out (along with tooling to make rollbacks fast and easy) is the yang to testing's yin. Testing is preventative. You can take your time with it. It's comforting. Monitoring, on the other hand, allows you to have those "hold onto your butts" moments where you just need to grip it and ship it. I certainly wouldn't recommend replacing tests with monitoring, but it's there to fill in the gaps. Testing before without monitoring after is missing half the picture.

There are a few reasons why you might not have complete test coverage. Maybe there's a situation that can't be reproduced outside of a true production environment. Maybe the test setup would be so complex and fragile that the tradeoff just isn't worth it. For whatever reason, monitoring is your safety blanket.

It's usually a lot cheaper to set up monitoring/alerting than to maintain complex tests that try to mimic a prod environment as close as possible. Of course, this assumes you have monitoring infrastructure set up already, but if you don't, then... well, you're gonna want that.

There's room for both testing and monitoring, but they do serve different purposes. Catching bugs is honestly pretty low on my list of reasons for writing tests. Catching regressions is higher (so when your monitoring catches a bug, write a test so it doesn't show up again in the future!) Making changes with confidence (which goes along with regressions) is right near the top for me.

Monitoring is often used in the middle of an app or services life cycle. It's been running for a while, and all of a sudden something goes wrong, maybe an influx of traffic or a client making an unexpected request.

The type of monitoring I'm talking about is for the very beginning of the lifecycle, just after deploy when the new code starts getting traffic. You should be able to deploy, look at a dashboard, and hit a big "Rollback" button if you see any funny business (errors, latency, CPU/memory spikes, etc.)

I love tests. But I also love being able to confidently deploy a change, minor or major, knowing that if something goes wrong I can get back to a good state within minutes. The two go hand in hand. Deploy on a Friday? As long as I have my dashboard and my rollback button.