XDD part II: MDD, the Monitoring-Driven Development

3 min readJun 18, 2021

XDD part II: MDD, the Monitoring-Driven Development

(part I of the series: Test-Driven Development)

MDD can be viewed as a generalization of the idea of Test-Driven Development to the world where the software is not only continuously developed, but also continuously operated.

In an idealized way, we can picture it like this:

Before implementing a feature, say a new API endpoint on the backend, the developer creates “gauges” that measures key characteristics of said endpoint, e.g. response time, error rate, request rate.

The “gauges” are metrics emitted to an appropriate system of record, e.g. Prometheus, InfluxDB, OpenTSDB or perhaps one of the many available SaaS offerings for real-time monitoring.

On top of the metrics of these “gauges” alerts are setup, to actively inform of a regression in the metrics.

The developer makes sure that the gauges show a “broken” feature, since the feature is not there yet, e.g. 100% error rate with a code “resource not found”, with alerts firing and delivered through appropriate channels, e.g. PagerDuty pages, chatroom notifications, etc.

Now, the developer implements the feature, deploys, and hopefully sees the gauges “go green” and alerts clearing up.

Those alerts will now save the developer when the feature breaks down in production for any reason, forever after, just like the tests in the test suite, being run automatically, inform us when the code is broken. At this time, the alert routing can be changed from the individual developer to the oncall team (of which said developer is perhaps a member)

Yay, we now have a recipe for generalizing of the red-green TDD to the world of continuously operated software!

A few things to note:

We didn’t mention testing for correctness of the endpoint created. We merely check that its “vitals” are good. What if a regression results in endpoint behaving improperly, returning incorrect data or resulting in wrong side-effects?
The whole business of creating gauges and alerts seems like a lot of repetitive toil: how can it be streamlined?
What about more specialized metrics? For example, let’s say a feature being implemented creates an asynchronous task. Wouldn’t it be appropriate to monitor things like the size of the task queue, task failure rate, retry rate, time spent in queue, etc? How can developers standardize on “what” and “how” of monitoring?
How to manage this swarm of metrics and alerts as the system evolves? What if alerts are noisy? What if the developer who created the feature left the company, and the poor soul on call is woken up by an alert for that feature at 3am in the morning?

The answer to these questions is familiar: “it depends.”

Treat your monitoring configurations as code.

But our “XDD” approach gives us a key piece of guidance: creating monitoring configurations is a (software) engineering activity. It means tooling, principles and processes will often be similar to those of the “usual” software engineering: source control, review workflows, expressive DSLs, etc. In other words, treat your monitoring configurations as code.

XDD part II: MDD, the Monitoring-Driven Development

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Intractables

No responses yet