What to measure: outcomes, flow, quality, humans

Every six months a board or a CEO asks engineering for "one number." Pick one: deployments per week, lead time, bugs per release. The ask is reasonable, because leadership wants to know if engineering is working. The problem is that engineering work has multiple distinct things it can be doing at once, and they move independently:

A team that ships fast can be shipping the wrong things.
A team that nails outcomes can be doing it through heroics that burn out next quarter.
A team with no incidents can be paralysed by review.
A happy team can have stopped doing the work.

A leader's first move is to refuse the question. Engineering performance is not one number. It is multiple distinct classes: each class a different question you can ask about the team, each one able to move on its own without dragging the others with it. The discipline is to hold the classes apart and watch them separately, never collapsed into one.

The tempting compromise is the balanced scorecard. You keep all the metrics, weight them, and roll them up into a single composite number so leadership still gets its one-line dashboard. That move smuggles back the exact question you refused. The single number hides which class is climbing and which is being sacrificed to fund the climb, which was the whole reason for keeping the classes separate in the first place.

Week to week, a leader watches the trade-off: which class is climbing this week, and which one is being sacrificed to fund the climb.

Name the classes

There are four:

Outcomes. Did the work change anything for users or the business? Adoption, retention, revenue per cohort, feature usage. Engineering rarely owns the data; it lives in product analytics and business intelligence tooling. The slowest of the four to read, and the hardest to credit to any specific team.
Flow. Does work move smoothly from idea to running code? Deployment frequency, lead time, change-failure rate, recovery time, cycle time, throughput, work-in-progress. The DORA framework's four key metrics live here and nowhere else.
Quality. What gets shipped, and does it stay shipped? Incident rate, escape defects, rollback rate, customer-reported issues. Different from change-failure rate, which is a smaller slice that lives inside flow. (Defining done sits upstream of measuring it.)
Humans. Are the people doing the work sustainable? Retention, sentiment in 1:1s, leading indicators of burnout, survey-based developer experience (the DevEx framework names three dimensions: feedback loops, cognitive load, flow state).

The three best-known engineering-measurement frameworks all fit inside the four classes. DORA, the deployment-flow framework that gave us deploy frequency and lead time, covers flow alone. SPACE, the academic five-dimension framework (satisfaction, performance, activity, communication, efficiency), collapses into outcomes plus flow plus humans. DevEx is the humans class made concrete through surveys. The four classes are what a working leader actually uses; the frameworks are inputs to them.

See how each focus breaks the others

Prioritising one class always hurts the others, and the pattern is predictable.

Flow-only. Deployment and lead-time targets hit. Humans burn out because velocity is paid for in evenings. Quality erodes invisibly until incidents arrive. Outcomes miss anyway because shipping faster guarantees nothing about shipping the right thing. The classic failure of a DORA-only shop, and the human cost hidden inside velocity gets its own essay.
Outcomes-only. Big quarterly launches built on heroic effort. Flow becomes a sprint stacked on a sprint. Quality is hacks with TODOs in the code. Humans pay. Next quarter starts slower because you're starting from technical debt and tired engineers.
Quality-only. Over-engineering. Paralysis by review. Every change treated as critical. Flow stalls. Engineers get frustrated by a bar they cannot clear. Outcomes lag because nothing ships.
Humans-only. Drift into comfort. Everyone sustainable, nobody stretched. Flow weakens, outcomes drift toward "we kept the lights on," quality goes unmonitored.

The pattern works in every direction. Each class keeps the others honest, and you watch all four to see which one is climbing and which is paying for it.

Humans is the class most teams neglect by default. When team health slips, the other three classes follow soon after, and yet humans is almost always the one with no number next to it on a Monday morning. Expect humans to be the class you spend most of your time correcting toward.

Wire outcomes deliberately

Engineering teams are systematically the worst at instrumenting outcomes. The "wire" here is the chain that connects a feature ship on engineering's side to a moved business or user metric on the other side: who measured it, when, against what claim. That chain runs through product, analytics, and customer-facing teams, and it never closes itself. Three things keep it broken.

Latency. A feature ships today. Whether it actually moved the metric shows up six to twelve weeks later. By then, the team has moved on.
Attribution. Four teams ship in a quarter, a metric moves, nobody can say which team moved it. An explicit holdout (a control group of users who do not get the feature, so you can compare) could answer the question. Companies almost never run them.
Ownership. Outcome data lives in product, sales, customer success. Engineering reads its own dashboards. When the team is busy, that handoff is the first thing to drop.

Close the wire deliberately:

Name the metric at release time. No feature ships without naming the number it should move, the direction (up or down), and the time window you will measure over. Engineering writes this rather than product, because engineering is the one being measured against it.
Run a ninety-day outcome review. Pull every feature shipped roughly ninety days ago and compare it against the metric it claimed it would move. Most reviews show a gap between claim and reality, and that gap is what you are learning from.
Give engineering its own analyst capacity. Dedicated analyst time inside engineering rather than borrowed from product. The team that ships gets to see what their shipping actually did.
Use LLMs to trace ship-to-outcome chains. A PR closes a ticket; the ticket names a metric; an LLM follows the trace and scores whether the metric moved in the expected window. The link that used to take a meeting now runs as a quiet pipeline in the background.
Slice users by ship-date. Segment users by which version they first encountered and see whether the metric moved after the fact. Weaker than running a controlled experiment for proving cause, but good enough to know which way things are moving.

For more on owning outcomes, see tying engineering to outcomes and owning the outcome, not just the output. Outcomes is the class most often dropped from what gets watched, and the leader's job is to wire it in deliberately rather than assume product will hand it over.

Grade what the team already writes

Teams measured the wrong things for a decade, but not because they thought LOC and PR count were good signals. Everyone knew those were bad. The reason was that the alternatives required heavy process: ticket hygiene, story points, time tracking, weekly status writeups. Better signal cost every engineer time every day, and the time spent producing it was time not spent doing the work being measured. Teams accepted cheap proxies they knew were lying.

LLMs collapsed that trade-off. The team already produces artefacts: PR diffs, design docs, 1:1 notes, decision logs, code itself. An LLM reads each one and scores it on substance, clarity, risk, change quality. Nobody fills in one more field; the signal arrives anyway.

Take CWT (Complexity Weighted Throughput) as the worked example, defined in instrumenting without Goodharting. Goodharting is what happens when a metric becomes a target and stops being a useful measure of the thing you cared about. CWT reads the diff of every pull request and scores it on substance instead of size. CWT and LOC read the same artefact; CWT judges the change, LOC counts it. Writing more lines is cheap, but writing a sequence of PRs that an analyzer consistently rates as substantive is harder to fake. The cheapest way to score well is to do substantive work.

Signal should arrive without extra work from engineers. The moment a measurement requires them to do meaningful extra work to produce it, you are paying for the measurement with the very thing you wanted to measure.

The same approach works on every class. Outcomes can be graded from PR-to-feature traces and adoption logs. Quality can be graded from incident reports, postmortems, and the code itself. The bottleneck has moved from "what can we count" to "what do we want to know."

Read humans through artefacts

Humans is the class that gains most from this shift. It has been starved of cheap, low-effort signal the longest, and the LLM era opens new patterns to watch:

Tone in chat and email. A recent thread scored for sentiment, frustration, withdrawal. The shift from energetic to terse is detectable weeks before the resignation conversation.
Review-comment drift. An engineer whose code-review comments turn from collaborative to clipped, or whose reply latency on a teammate's PR stretches from two hours to two days. A survey will not catch this; the code reviews themselves do.
1:1 note grading. A leader's own 1:1 notes, read as a batch across weeks, surface topics that keep coming up without resolution, and differences between direct reports (one person's notes are full of growth and career discussion, another's are just status updates).
Help-seeking shift. An engineer who used to ask three questions a week in the team channel and now asks zero, or who asks only the AI and never the team. Visible in unstructured logs without anyone reporting anything.
Self-report versus artefact gap. What an engineer says in 1:1 ("things are fine") graded against PR cadence, late-night commit pattern, comment terseness. The gap is the signal.

Signals graded from communication artefacts edge into surveillance fast. Use them as triggers for a human conversation rather than as dashboards. The difference between watching a team and surveilling it is what the leader does with the signal: a private nudge in a 1:1 is fine; a metric on a board slide is what this essay is trying to prevent. (What not to measure draws the line; leading indicators of burnout covers the response once the signal lands.)

Fold agents into the four classes

Agents are now producers of code and tickets. The reflex is to invent a fifth class called "agent productivity." Resist. Agents do not change the classes; they just add new things to measure inside each one.

Flow. Agent task throughput. Prompt-to-merged-PR time (the agent version of lead time: from the moment a human writes the prompt to the moment the resulting PR lands). Agent rerun rate, meaning how many times a human re-prompts the agent before accepting the output, which proxies how good the original spec was. Queue depth at human-in-the-loop gates, because humans are now the bottleneck and work piles up waiting for review.
Quality. Agent-induced incident rate. Override rate at gates: too low is rubber stamp, too high is the agent doing work that should not be delegated. Regression rate in agent-touched areas. Silent-failure detection coverage, because agents fail in ways that look correct, and the signal is whether monitoring catches what review missed.
Outcomes. Cost-per-outcome, including token spend plus supervisory time per shipped business result. Adoption of agent-shipped features tracked separately from human-shipped ones.
Humans. What fraction of an engineer's week is hands-on judgment versus agent supervision. Skill-ladder drift for juniors: the small, repeated work that used to build engineering judgment (writing the first draft, debugging the easy bug, reading the unfamiliar file) is exactly what agents now do, so juniors stop accumulating those reps and climbing the engineering ladder takes longer. Whether engineers feel like their judgment is what matters, or feel like they are just reviewing a system they no longer own.

The trap is counting PRs opened by agents or tickets closed by agents and calling that productivity. The cheaper an artefact is to produce, the worse counting it works as a signal. Agents push the cost of producing PRs and tickets close to zero. Counting them measures the agent's throughput rather than the team's. Look at what the agent's output does inside the four classes.

Allow single-class focus, but only with a written exit

Single-class focus is sometimes the right call as a bounded period, but treating it as permanent is the trap.

Write down three lines for any deliberate period: what is being underweighted, what trigger opened the period, what condition closes it. Without those three lines, the period is exactly the failure it was supposed to avoid.

One example per class:

Flow-only: pre-launch window. Two to four weeks before a launch you have publicly committed to. The team has agreed in advance that humans and quality will absorb the cost. Exit: the launch ships, and the team gets time to decompress.
Quality-only: post-incident lockdown. Two weeks of no new feature work after a customer-impacting outage. Fix the class of bug the incident exposed. Exit: a rehearsal of the same failure scenario passes cleanly.
Humans-only: post-attrition stabilization. Eight to twelve weeks after two or more senior engineers leave in quick succession. The team rebuilds capacity and ramps new hires before new commitments. Exit: new hires are at independent-contributor pace and the team structure has been redrawn around who is left.
Outcomes-only: product-market-fit search. A pre-revenue startup where flow and quality are deliberately underweighted because the question on the table is whether the thing matters at all. Exit: paying users whose behaviour gives a clear answer either way.

What goes wrong: the period extends past its exit because nobody wrote one down, and a bounded call quietly becomes the culture. Watch for the leader who says "next quarter we'll get back to humans" three quarters in a row.

Watch the tension between the classes

You keep all four classes in view on a Monday morning to read the tension between them, then decide which class to push and which to relieve this week. The signal is which way the numbers are sliding relative to each other, not the absolute number on any one dial.

Watch for this trap: a team that proudly tracks all four but rolls them into a balanced scorecard has rebuilt the same failure in a more sophisticated wrapper. The single number hides the tension. The point of watching four is to see what is being traded for what.

Prefer signals that read artefacts the team already produces. Resist signals that demand ticket hygiene, story points, time tracking, weekly status writeups, or any field nobody fills honestly twice in a row. Better signal used to require heavier process; that excuse is gone. LLMs grade what the team already writes, so measurement costs the leader once instead of every engineer every day.

Four classes earn their place when they change a leadership decision. The test is what the leader does on Monday morning: if the four numbers produce no decision (no class to push, no class to relieve), the leader is keeping score instead of diagnosing. Choosing what makes it onto a dashboard and what stays off is the subject of dashboards that actually get read.

Starter kit

Monday-morning read of all four classes

Class	What I'm seeing this week	Direction (climbing, flat, slipping)	Action this week (push, relieve, hold)
Outcomes (did the work change anything for users or the business?)	___	___	___
Flow (does work move smoothly from idea to running code?)	___	___	___
Quality (does what shipped stay shipped?)	___	___	___
Humans (are the people sustainable?)	___	___	___

If a class is in a bounded single-class period, write the three lines:

Underweighted: ___ (which other classes are getting deliberately less attention)
Trigger: ___ (what opened this period)
Exit: ___ (the condition that closes it)

Run this when: sitting down to plan the week's leadership priorities, before the first leadership sync.

If no decision comes out of this, the time is better spent elsewhere.