Reliability as a product feature · Engineering Playbook

Every team I have run lives with the same pull. There is always pressure to ship faster, and the fastest way to ship faster is to lower the guardrails. So quality slips, quietly, because nobody actually decided to let it slip. Then something breaks in a way a customer feels, and attention snaps back hard. Reliability only gets full attention in wartime; in peacetime it drifts.

It drifts because most teams treat reliability as hygiene: a background property the system either has or does not, like cleanliness. Hygiene is invisible and nobody puts it on a roadmap. What changes how you lead is treating reliability as a feature instead. A feature is something users notice and something you decide how much of to build, which means it competes with everything else on the roadmap for time. Once reliability is a feature, the question stops being "are we reliable enough," which is unanswerable, and becomes "how much reliability are we buying, when, and on which surface," which is a decision a leader can actually make.

Before treating reliability as a budgeting problem, look at what its absence can cost. In 2012 the trading firm Knight Capital pushed a deployment that woke up an old, dormant code path on some of its servers. For roughly forty-five minutes the system fired millions of unintended orders into the market. The firm lost about 440 million dollars in those minutes, never recovered, and was acquired within the year. No feature it shipped that decade mattered as much as the reliability it did not have. That is the real case for treating reliability as a feature: its absence is not a slightly degraded experience; it can be the end of the product.

Decide when reliability becomes a feature

Reliability is not equally a feature on day one and in year three. A brand-new product with a lean team and no proven demand usually should not be investing in reliability, and the mature thing is to say so out loud rather than let it happen by default. This is what the words alpha and beta were always for. A beta label is a deliberate, public statement: we have decided not to make this dependable yet, because we are still learning whether anyone wants it, and our handful of engineers is better spent on that question. Scheduling reliability into the product's life, and naming the moment it becomes first-class, is itself the work.

You can get this wrong in both directions. One direction is treating a beta like a bank: hardening reliability before a single user has chosen the product, which spends your scarcest people on your least certain bet. The other, more dangerous direction is treating a generally available product as if it were still a beta, shipping fragile code after customers have come to depend on it. In 2008 the new Terminal 5 baggage system at Heathrow had passed its tests, then met real conditions on opening day and could not cope. Around 42,000 bags went missing and more than 500 flights were cancelled over the following days. The software was, in effect, beta-grade, launched at full scale and full stakes. Nobody had made the lifecycle decision, so the opening day made it for them.

This is also where reliability and security part ways. You can defer reliability on purpose, and a beta is exactly that deferral made visible. You cannot defer security and privacy the same way. The small set of things a team refuses to compromise holds from day one, even in alpha. Naming reliability as something you will buy later does not put security on the same schedule. (Where the reliability bar sits inside what "done" means is its own decision.)

Put the tradeoff in the open as one number

Once you have decided reliability matters for a surface, the next mistake is leaving "how reliable" to a feeling. Product wants to ship, engineering wants to harden, both argue from conviction, and the side that argues loudest that week wins. The site-reliability practice that fixes this, made widely known by Google's SRE teams, is the error budget, and the mechanism is worth walking through because it is what does the work.

You start by setting a service level objective, an SLO: a target for how often the system does what users expect, say 99.9 percent of requests succeed over a month. Perfect reliability, 100 percent, is deliberately not the target, because it is both impossible and not worth its price. The gap between your target and perfection is the error budget. At 99.9 percent you have 0.1 percent of the month, roughly forty-three minutes, that you are allowed to be unreliable, and that budget is a quantity you get to spend. While it is healthy, the team ships fast and takes risks freely, because there is room to absorb a mistake. When it is nearly spent, a policy agreed in advance kicks in: feature work slows or stops and engineers move onto reliability until the budget recovers.

What matters is not the number itself but that one number replaces two opposing convictions. Product and engineering stop pulling against each other from taste; they both watch the same gauge, and the gauge decides who gets the next sprint. The part teams skip, and the part that matters most, is the policy. A budget with no agreed consequence is just a dashboard nobody looks at the moment a deadline gets tight. The number has to arrive with its "what happens when we breach it" already decided, settled while nobody is under pressure.

Make the budget operable with severity tiers

A monthly budget tells you whether you have room, but it does not tell you what to do at two in the morning when one specific thing is broken. For the day to day you need a second tool, also agreed before the incident: a severity ladder. The version my teams run is plain:

P0. Something broken or unsafe for users right now. Same-day fix, and it pulls people off whatever they were doing.
P1. Serious but not on fire. Goes into the immediate next release.
P2. Real but minor. Gets done when there is slack.

Writing this down in advance means you never re-litigate priority in the middle of a fire; the category sets the response.

Two things go wrong here, and both are common. The first is that everything becomes a P0, usually because nobody wants their issue deprioritized, and the ladder collapses into a single rung the team thrashes against. The second is the P2 graveyard, where the lowest tier becomes a place issues go to be forgotten and small defects pile up into exactly the slow reliability bleed that peacetime produces. A healthy ladder needs an honest top, real triage rather than everything urgent, and a tended bottom, P2s that actually resurface. When a P0 does fire, how the on-call rotation absorbs it without burning people out is its own topic, and turning that incident into a permanently tighter bar is another.

Price it in the units the business respects

Here is the political problem with reliability: it succeeds invisibly. When the work goes well, nothing happens, and a quarter of nothing-happening is hard to defend against a roadmap full of visible features. So in peacetime the budget conversation is lost by whoever is arguing "but it is good engineering." Engineering pride is not a unit the business trades in. To keep reliability funded when nothing is currently on fire, you have to price it in units the rest of the org already respects: revenue lost to refunds and service credits, customers who churn, renewals that shrink, expansion deals that stall, support load, and sales cycles that lengthen because a buyer got nervous. (Translating engineering work into business outcomes is its own subject; here you are just applying it to reliability.)

The 2024 CrowdStrike incident is the sharp version of why business framing beats uptime framing. A single faulty update took down millions of machines, grounding flights and disrupting hospitals, with industry loss estimates running into the billions and one airline alone attributing several hundred million dollars of damage to it. What is instructive for a leader is what came after: customer retention reportedly held above 97 percent, so by a naive churn metric almost nothing broke. But sales cycles lengthened as scrutiny rose, and competitors began fielding calls from customers who wanted to explore alternatives. The cost was real and it was a business cost; it simply did not show up in the one number most people would have checked.

There is a slower version of the same cost, and it is the one peacetime drift actually produces. When a platform quietly cuts its guardrails and starts having repeated outages, you rarely get a single fatal event; you get a steady bleed: each outage feeds the story that the product is not dependable anymore, and users drift to rivals over months. The recent run of outages at X, following sustained cost-cutting, is this shape. No single day ended the product; trust just leaked out slowly. A leader who only watches for the spike will miss the bleed entirely, which is why the standing business-unit justification has to be alive in peacetime rather than assembled after the fact.

Spend only where the user can feel it

The most common critique of error budgets is also correct: most teams over-invest in reliability. The cost of reliability is not linear but closer to exponential in the number of nines. Going from 99 percent to 99.9 is real work, but going from 99.9 to 99.99 costs far more for a tenth of the gain, and somewhere up that curve you have put your best engineers on the least valuable problem in the company. The plainest way to see it: a user on flaky cafe wifi or a patchy mobile signal experiences the whole product only as well as that weakest link, so a backend you pushed to five nines still reaches them at the reliability of their own connection. The extra nines you bought are real engineering work they have no way to feel. Past a point, more reliability is something the user literally cannot perceive, and reliability nobody can perceive is not a feature but a cost.

It comes down to which surface you are on. On most surfaces three nines is an honest and sufficient target and pushing further is waste. But some products exist precisely because they are boringly dependable, and there reliability is not hygiene but the headline feature people are paying for:

Payment processing, where a dropped transaction is the whole risk.
Cloud storage and infrastructure, where other businesses build on top of you.
Messaging and delivery at scale, or a database of record.

On those surfaces deliberate over-investment is not waste but the product. The leader's job is to know whether the surface in front of them is one where reliability is the differentiator or one where it is a quiet backstop, and to spend accordingly: generously in the first case, frugally in the second.

When the unit isn't uptime

Everything above quietly assumes you can measure reliability as uptime or a success rate. For a growing class of features, mostly the ones backed by a model, that assumption breaks. A model-backed feature is non-deterministic: the same input can produce a good answer today and a subtly wrong one next week, because the model or the prompt shifted underneath you, and your uptime dashboard will happily report every one of those wrong answers as a successful request. The unit of reliability is no longer "did it respond" but "was the answer consistently good," and a green dashboard can sit on top of a feature users have quietly stopped trusting.

Two things follow. First, trust in this kind of feature erodes far faster than it builds. A user will forgive a slow load, but one confidently wrong answer outweighs ten good ones, and after it they stop relying on the feature. Second, the hard reliability problems are not in the model but around it: in the orchestration, the state handling, the retries, and the partial failures. Treat the model as a probabilistic component inside an otherwise deterministic system, and instrument heavily enough that silent degradation surfaces in your own metrics before it surfaces in user complaints. The budget and the severity ladder from earlier still apply; what changes is the meter you point at them. (The specific token, latency, and reliability budgets that agentic systems run inside are their own subject.)

Pull it together and reliability stops being a property your system either has or lacks, the idea that lets it drift in peacetime and blow up in wartime. It is a feature, which means it is a set of decisions a leader owns:

How much to buy. Set as a budget in the open.
How fast to respond. Set as a severity ladder in advance.
Why it is worth funding. Argued in revenue and churn rather than craft.
Where to spend it. Generously where the user feels it, frugally where they cannot.
What unit to measure it in. Not always uptime.

Make those decisions deliberately and on the record, and reliability stops being the thing you only remember after it breaks. "It worked" was always supposed to include "and it kept working"; that is the outcome a team actually owns.

Starter kit

A reliability budget you fill in for one surface, before it breaks

Pick one surface (a service, a feature, a product). Fill every cell before the next incident, not during it.

Decision	Your call for this surface
Surface	___
Lifecycle stage	alpha / beta / generally available: ___
Is reliability the differentiator here?	yes (spend generously) / no (three nines is plenty): ___
Target (SLO)	___ % over ___ (leave blank if not yet first-class, and say so)
Breach policy	when the budget is nearly spent, we ___
P0 response	same-day, pulls in: ___
P1 response	next release: ___
P2 response	resurfaces when: ___
Business cost if this breaks	refunds / churn / stalled deals / longer sales cycles: ___
Unit of reliability	uptime / success rate / answer-quality (for model-backed features): ___
Last reviewed	___

Run this when: you are starting a new surface, or it is peacetime and nobody has looked at these calls since the last incident.

A reliability number with no breach policy and no business cost attached is a dashboard nobody honors the moment a deadline gets tight.