Blog

Hyrum's Law and API design

Written by Elias Szabo | Jan 26, 2024 8:52:03 PM

 

This is part of a sequence of posts on engineering principles.

What is Hyrum's Law?

Hyrum's Law is an observation made by Hyrum Wright, and popularized by Titus Winters in Software Engineering at Google. It goes like this:

With a sufficient number of users of an API,
it does not matter what you promise in the contract:
all observable behaviors of your system
will be depended on by somebody.

Simply put: if your system does something reliably, and you have enough users, someone will depend on that behavior. On its surface, this seems simple. The pernicious aspect is that this dependency does not respect what you say your system supports. So, for example, if I build a service that regularly responds in 100ms, even if the response timing is not part of the contract I've published, Hyrum's Law implies that if my service becomes popular enough, eventually someone will require that response profile to function, and making any significant changes will cause issues for my dependencies.

 

Exploring the Tradeoff

We've arrived at the most common situation you'll find in software: a tradeoff among different priorities. Hyrum's law suggests a very cautious design of interfaces: if you know anything that you do might be (ab)used, doesn't that increase the pressure to ensure you get it right, before you develop a large set of clients that lock you in? On the other hand, care usually comes at the cost of speed, and if you obsess over every aspect of a system you'll generally end up iterating less, and get fewer touch points to receive feedback about your design. I take inspiration here from organizational psychologist Adam Grant's ground-breaking research (see Originals), which finds that the probability of impactful work is generally fairly low, per-effort, and that success is correlated with the volume of attempts. Or, you are more likely to find a hugely compelling solution if you try many things quickly, rather than obsessing over one thing for a long time. At the same time, if you do find something hugely successful, Hyrum's law tells you you will eventually be the victim of your own success, since by the time you know it, you'll likely be locked in to however you built it on the speedy road to release.

 

What to do?

Answering this question really requires you to be clear-eyed about how likely your project is to reach this level of adoption. At a high level, if you ignore Hyrum's law and pretend your users will only utilize what you publish, you can afford to get the public contract wrong for longer, because you can just change it down the road. At the start of projects, and for less established projects, that probably makes sense. Once you reach a scale and level of adoption (or perhaps once you see enough velocity that you're confident this project is going to reach the level where Hyrum's law plays an important role), then it's probably time to change your approach.

 

The best solution I've seen to this problem is Netflix's "chaos engineering" strategy, which they first debuted in 2011 on medium. The idea was simple: the best way to prepare for unusual events is to make them routine. Or, we should regularly and intentionally induce rare outages in production, because once they become routine, they become expected. At a high level, chaos engineering (and the "chaos monkey" tool that netflix open sourced) use the double edged sword of Hyrum's law for good - if users adopt every aspect of the API, then defining the behavior in practice to include rare outages will force them to use that feature in their design.

 

There are many variants of this style of solution: randomly impair instances in production, periodically simulate partial or full network outages or regional impairments, or design your system to inject latency to create high variance in response times to avoid clients depending on any given behavior. All of these are viable, and the need for them grows as the adoption of your offering increases.

 

Balancing Hyrum's Law in Engineering

So where does this leave us? Hyrum's Law tells us that users will end up depending on everything that our solutions do, both the declared, and the undeclared. A good way to handle that knowledge is to ignore it for early stage projects to prioritize speed and feedback, but to gradually devote more attention to better design and guardrails as adoption/success grow. Eventually, full chaos frameworks make sense to reduce the predictable surface area of an offering to only what you want your users to depend on.