Sarthak Garg

Tying engineering to user and business outcomes

Teams that measure shipping without impact drift into irrelevance.

·10 min read·

Most of what your team ships will not work.

That sounds harsh, so look at the numbers from teams with enough traffic to actually know. When Microsoft runs a controlled experiment on a feature, roughly a third of the time the metric moves the way they hoped, a third of the time it does nothing, and a third of the time it moves the wrong way. Google and Bing report that ten to twenty percent of their experiments produce a positive result. At an early-stage startup, where you have less data and less discipline, the honest hit rate for shipped features sits somewhere around one in five. So for every five things your team builds, four are roughly neutral or worse on the outcome they were meant to change.

Now sit with what that means if you are not measuring. You shipped five features. You feel productive, and so does the team. The burndown looks great. But one of those five moved the world and four did not, and you cannot tell which one. You cannot pour more into the winner because you do not know it is the winner. You cannot kill the losers because they look exactly like the winner from the inside. At that point you are not running a team; you are running a feature factory with the lights off.

I have lived the version of this that is slower and more embarrassing: a whole quarter where we shipped constantly, demoed well, closed tickets, and a month after the quarter ended someone asked what any of it had done for the user, and the room went quiet. Nobody had wired up the answer. The work was real, but the line from it to anything that mattered was missing.

If you have read anything about product management in the last decade, you already know the slogan: outcomes over output. Josh Seiden's framing is the clean one. An outcome is a change in human behavior that drives a business result. A shipped feature is only a guess that such a change will follow. So the team should be handed the outcome to chase rather than the solution to build. Most engineers half-know this by now. The trouble is that the writing almost always stops at the slogan. It tells you to care about outcomes and leaves you standing in front of a merged pull request with no idea how to connect it to a behavior change three steps downstream. This essay is about that plumbing: the engineering-side work of building line-of-sight from a change to a result, done honestly enough that you can trust what it tells you.

And this is engineering's own work to do. A product manager can want an outcome, but only the people who wrote the change can make it emit the signal that proves it moved one. The measurement is part of the deliverable, built into the change rather than written up afterward. A feature is not done when it merges; it is done when it can tell you whether it changed anything.

One more thing has changed, and it is why this matters more now than it did five years ago. Writing code is becoming nearly free. When the cost of producing output collapses, the thing that separates a productive team from a fast generator of plausible noise is no longer how much you can build but whether you can tell if what you built worked. Hold that thought; it reshapes everything below, and I will come back to it at the end.

Here is the plumbing.

Pick the rung you can actually move

Between a merged change and a business result there is a ladder. From the bottom up:

  1. The feature shipped.
  2. Adoption: whether anyone actually uses it.
  3. A leading behavioral proxy: some action that tends to predict the result you care about.
  4. The lagging business metric: retention, revenue, or cost.

The instinct is to claim the top rung. "This feature will improve retention." Maybe it will, but retention moves for a dozen reasons over months, and your feature is one small input buried in the noise. You will never see your signal up there. So pick the highest rung you can both influence and instrument, and then say out loud how many rungs sit between your code and the result you ultimately care about. If your feature is meant to lift retention but the only thing you can cleanly measure is whether people complete the new flow, then completion is your outcome for now, and you state plainly that there are two more rungs of inference between completion and retention. Honesty about the distance beats false precision every time. A claimed link to revenue that you cannot actually see is worth less than a real link to a proxy that you can.

Pre-register what counts as working

Decide what "it worked" means before you build rather than after you ship.

This is the single cheapest discipline in the whole practice and the one teams skip most. Before the work starts, write down the behavior you expect to change and the threshold that would count as success. "We expect completion of the onboarding flow to rise from sixty percent to at least seventy-five within three weeks." That sentence costs you ten minutes and it kills the most common failure in outcome measurement, which is the reverse-engineered win. Without it, whatever number you observe afterward gets narrated as a success, because the human writing the narrative is the human who shipped the thing. With it, the number either cleared the bar you set in advance or it did not, and you no longer get to move the goalposts. The same discipline applies to reliability work: write the target before the quarter rather than the apology after the incident. This pairs with defining done, where "outcome achieved" is part of the definition of done. Here we are making that claim verifiable.

Buy the cheapest honest counterfactual you can afford

The hard question under every outcome claim is: compared to what? Completion rose to seventy-five percent, but would it have risen anyway? This is where attribution gets genuinely difficult, and I want to concede that fully rather than wave it off. Outcomes have many causes. Correlation is not causation. The honest answer to "did our work cause this" is usually "partly, probably, and I cannot fully separate it from the three other things that changed that month."

So buy the best counterfactual your stakes can justify. A staged rollout or a holdout group is the gold standard, because the users who did not get the change tell you what would have happened anyway. When you cannot run even a crude experiment, a before-and-after comparison can work, as long as you name the confounders out loud rather than pretending they do not exist. And when you have nothing but a correlation, say so, and downgrade your confidence accordingly. Match the rigor to the stakes: a small UI tweak does not need a randomized trial, but a six-month platform bet that consumes a third of your team had better come with something more than a hopeful line on a graph. The skeptics who say attribution is theater are right more often than measurement enthusiasts admit. Don't give up. Be honest about how much you actually know.

Make reliability's value survive peacetime

Reliability work has a cruel property: it is obviously valuable in wartime and invisible in peacetime. Right after an outage, when everyone has just felt the pain, nobody questions the investment. Six calm months later, when nothing has broken, that same investment looks like it is guarding against nothing, and it is the first thing cut. This is the prevention paradox, and it is where reliability-minded engineering leaders quietly lose budget.

Make the value legible when no incident is fresh. Three moves do most of the work:

  1. Hold a standing churn-cohort ratio. Measure it from your own data: users who hit an error or a slow page or an outage churn at some multiple of the rate of users who do not. Say it comes out to three times. Once you have that ratio, reliability has a permanent, quantified tie to retention that does not need a fresh outage to justify it.
  2. Carry the last incident's cost as a unit price. When an incident happens, compute its true cost once, including lost revenue, support load, and accounts put at risk, then carry that figure into peacetime planning as the standing unit price of that failure mode, so the cost is quoted rather than re-argued every quarter.
  3. Log near-misses as visible outcomes. Incidents caught before users noticed, degradations that auto-recovered, the error budget you burned but contained.

This is attribution by subtraction, and it generalizes to every infra and platform team whose real output is things that did not happen, hours that were not lost, cost that was removed. Framing reliability as a product feature is its own essay; what I am after here is the attribution mechanics.

Engineer the time-to-truth

How long it takes to find out whether something worked is a design variable you can shorten on purpose.

Instrument the outcome before you launch, as part of the change itself, so the clock starts at release instead of whenever someone finally remembers to add tracking. Where you have a choice, prefer outcomes that move in weeks over outcomes that move in quarters, because a faster loop lets you correct course while the work is still warm. Ship a thin slice to a small set of users to get an early read before you commit the whole team to the full build. The goal is to compress the gap between "we shipped" and "we know," because every day in that gap is a day you are steering blind.

Speed is not the only thing that breaks the loop. The outcome itself has to hold still long enough to read. When I first moved teams onto outcomes, the outcomes kept moving underneath them: one month the target was revenue per user, the next month pricing became the priority and the goal shifted before any signal had come back. You cannot measure against something that changes faster than your time-to-truth. So we started committing each outcome for a mid-term horizon, three to six months, long enough for the loop to close at least once before anyone was allowed to move the target. Below that, the team is not steering toward anything, just reacting to whatever is loudest that week.

There is a real boundary here too. Some outcomes genuinely cannot be known inside a quarter, and pretending otherwise produces worse decisions than admitting it. When the outcome is truly unknowable on your planning horizon, that is a different problem with its own discipline. That is the territory of long-horizon bets.

The outcome signal is what aims the agents

Now back to the thing I asked you to hold.

When code was expensive, measurement could be treated as a reporting chore, a thing you did at the end to write the quarterly review. That era is ending. As agents drive the cost of generating code toward zero, the bottleneck moves from production to validation and comprehension. An agent can only aim at what you can measure for it. Hand it a tight outcome loop and it improves the result; hand it nothing and it optimizes for looking done. The leverage is no longer in generating the code; it is in the feedback loop.

This reframes everything above. The line-of-sight you build from change to outcome is not a scorecard you show your manager, but the steering input for the entire system, human and machine. A team with a tight outcome loop can point cheap, fast output at the things that actually move the world and turn away from the things that do not. A team with no outcome line-of-sight does not just lose a scorecard in the agent era. It loses its steering entirely, and it starts manufacturing reviewed, well-tested work that nobody can confirm ever mattered, at machine speed. That is the real stake. Outcome measurement used to be how you proved your worth after the fact, but it is becoming how you aim the team at all. Who answers for the outcome once you can see it is a separate question, and it belongs to owning the outcome. You cannot own an outcome you cannot observe, which is why this comes first.