Once upon a time, XML was the greatest thing since sliced bread. When combined with a schema definition file, the ability to strictly define interfaces between systems opened up a world of possibilities.
At the time, my team very much in the camp of “light interfaces”. Put the minimum required fields in the interface, so that clients would be resilient if the underlying object changed. As a result, a lot of the functions in our API simply required an id as a parameter.
As a part of implementing our API, we had a downstream system to talk to. And the architect of that system built an interface that required an id and a name. And because of the strict schema validation present in their services, that meant we would have to look up the name via the id every time we wanted to call their service. It was *everywhere* in our code that we would have to do this, so we had a meeting to discuss the motivation behind the fatter interface requirements.
“Why do you require the name field? The id should be sufficient to get everything you’d ever need”
“The name is just a convenience. It allows us to give a human-readable field to the end user”
“We don’t have an end user, we’re totally back end. We’ll never need a name”
“Well, you can just look it up before you call our services”
“That’s kind of the point: we have to look up something that only your system needs. If your system needs it, why don’t you look it up?”
“Like I said, its a convenience. By putting it in the interface, we save ourselves a lookup”
“But you’re not saving a lookup. You just moved it to your clients”
“Yeah but we can’t meet our performance numbers if we have to do a lookup every time”
“So you’re saying the only way you can meet your performance requirements is to push all your required work to other systems?”
There were a lot of problems on that program: strict schema validation and poorly designed interfaces were just some of them. But the problem I want to highlight here is that the architect of the downstream component was optimizing that downstream component at the expense of every other component in the portfolio. It was the very definition of a local optimization.
This local optimization had some obvious side effects. It blurred the lines where there should have been clear boundaries between the component business domains. It caused unnecessary lookups as not every function needed the name field for display purposes (as it turned out, only about 10% of those calls ever resulted in a human display. They were eating the name field the other 90% of the time). And that meant that the entire portfolio was now running less efficiently so that this component could run a little faster.
The answer here is to measure the performance across the entire system, and optimize against that performance number. And by “across the entire system” that means you have to measure performance from the UI. Yes, I know that means that you’re measuring from the client’s machine all the way down to your data center and back out again, and that the client’s network is a part of that. Which is out of your control. And you still need to measure it, because this is what your client is experiencing. If you’re measuring at your service boundaries, then you’re optimizing locally, and setting yourself up for a poor overall experience.
A big part of the Site Reliability Engineering philosophy is that your Service Level Objectives (SLOs) are not absolute: that your system meets some measure of performance some percent of the time. “This page loads in less than 500ms 99.9% of the time over a rolling four week period”. This gives you the ability to account for the rare case where grandma is working from her 60Mhz Pentium using the AOL dial up network you set up for her back in the late 90s. In other words, build the variability of the things out of your control in the overall system into your SLOs.
By measuring the performance of your system across the entire system, you also have the ability to instrument across the entire system. This should give you the ability to find where the performance bottlenecks are, and fix them in place. Without this end-to-end visibility, any localized fix has the chance to cause problems in another area of the system.
A last note on this style of optimization: focusing on any one metric is just a different kind of localized optimization. Optimizing on security over usability means you have secure data that’s near impossible to read. Optimizing on uptime as opposed to latency just means you’ve built a reliable system that’s unusable to the end users. As Charity Majors* often says “Nines don’t matter if your users aren’t happy.”
Finding good SLOs for your system is not an easy problem to solve, but building them in a way that takes a holistic view of the entire system is a great approach to ensuring that you’re optimizing over the system, and not performing a localized optimization that causes pain somewhere else.
* Charity Majors is the CEO of Honeycomb and is a leading voice in the world of observability today. Follow her at @mipsytipsy for a lot of great content.