Sarthak Garg

Instrumenting without Goodharting

Measure outcomes, not activities. Design signal that resists gaming.

·6 min read·

Most engineering metrics that fail do not fail because the team cheated. They fail because the metric was designed in a way that made cheating the cheapest path. The classic catalogue (lines of code, pull requests merged, tickets closed, story points completed) all have the same shape: a count of surface artefacts that loosely correlates with output, until somebody starts measuring it and the correlation breaks. The list to refuse outright is its own essay; this one is about everything you do choose to instrument.

The standard advice on this is to measure outcomes rather than activities, and to use multiple metrics so no single number can be pumped. Both are correct, and both are also toothless, because they describe where you want to end up rather than the move that gets you there. A team that wants a dashboard tomorrow does not get one from "measure outcomes"; it gets one from a specific design discipline applied to every number it considers publishing.

The reframe that helps: gameability is a design property of the metric rather than an integrity problem in the team. If the metric is gameable, the team will adapt around it. That is not a failure of character but the metric being broken by design.

What follows are five moves I use when instrumenting an engineering team. They are ordered roughly by when they get applied, but in practice you cycle through them.

Name the outcome and the decision

Before instrumenting anything, name two things out loud: the outcome the number is a proxy for, and the decision that will change when the number moves. If the second is missing, the metric is theatre.

A common version of this failure is the code review SLA measured as time-to-first-response. The intended outcome is shorter merge cycles; the published number is acknowledgement latency, with no decision attached to it. Nobody re-assigns reviewers, nobody investigates a slow reviewer. So the team adapts to the only thing the number rewards. Reviewers leave a one-line ack within the hour and disappear. The SLA looks healthy. Merge times do not move. The instrument worked exactly as designed; the design was wrong.

Run the cheapest-path test

For any metric you are about to publish, ask one question: if I were rewarded purely on this number and nothing else, what is the cheapest way to move it? If the cheapest path is bad for the team, the metric is broken before it ships.

Lines of code is the canonical failure of this test. The cheapest path to move it up is to write more verbose code. The engineer who deletes two thousand lines and ships the cleanest quarter on the team looks like a negative producer; the junior who is still finding their voice looks elite.

Velocity points are a slower failure of the same test. The cheapest path to deliver more points is to inflate what a point means. A "three" quietly becomes a "five" within a quarter; the team's velocity stops predicting the team's capacity, and comparing across teams stops meaning anything at all.

The positive case is when the cheapest path to move the number is the thing you wanted. Customer-reported defects attributed to the surface a team owns is one example: the cheapest way to bring that number down is to ship fewer regressions on that surface, and the source of truth lives outside the team, so the number is hard to manipulate from inside. Time-from-merge-to-production under a fully automated pipeline is another: the cheapest way to move it down is to improve the pipeline. The metric and the work are the same shape.

Judge the work

The reason the field tolerated lines of code for decades is not that anyone defended it; it is that the alternatives were process-heavy. Tickets, points, burndowns, time-on-task all cost real hours to maintain and got gamed anyway. Counting was the only affordable instrument, so the field used it long past its expiry date.

That constraint has lifted. Models can now read a pull request, look at the diff, and produce a judgment of substance with reasonable consistency. The Goodhart problem was historically a counting problem because counting was cheap and judging was expensive. Judging is now cheap.

I built a tool on this premise, called Complexity Weighted Throughput, or CWT. It reads the diff of every pull request across every repo and scores each PR on substance rather than size. Sum per developer for an individual signal; slice by platform or org for a team signal. Both CWT and lines of code derive from the same artefact (code), but CWT judges the change instead of counting it. Writing more lines is cheap. Writing a sequence of pull requests an analyzer consistently rates as substantive is harder to fake, because the cheapest way to score well on it is to do substantive work.

The honest caveat: I do not know whether CWT is un-gameable or just not-yet-gamed. The right framing for any judged metric is the same as for any piece of code: it will need to be hardened, and the team will adapt to it in ways the designer did not anticipate. It's not that the metric cannot be gamed but that the design space has opened up.

The same reframe applies to delivery-framework scorecards. A team can post improving deploy frequency and lead time for a quarter while quietly destroying itself. Counting events does not catch the cost of producing them. A judged signal (did this on-call rotation leave somebody usable on Monday) is more expensive to compute, but the cost is now within reach.

Report at the team level

Individual rollups exist for two purposes: the one-on-one conversation between a manager and the person, and the leader's own pattern-matching. They do not belong on leadership dashboards. The case against individual productivity dashboards has been made for forty years by the people who tried hardest to build them, and the conclusion is that they do not survive contact with humans. Treat the team as the unit of measurement until you have a specific reason not to. The individual signal is a manager's input rather than the org's output.

Audit the metric like code

Gameability is not a property a metric has once. It is a property the metric has against the current behaviour of the team, and team behaviour adapts. A metric that was honest in Q1 can be theatre by Q3 with no one noticing, because adaptation is gradual and the number stays in roughly the same range.

Re-run the cheapest-path test every quarter or two against the current state of the team. If the cheapest path has migrated to somewhere unhealthy, redesign the metric or drop it. Treat the metric like any other piece of infrastructure; it has a maintenance schedule.

When not to instrument

Some domains carry a cost of getting the design wrong that exceeds the value of the signal:

  • Anything that becomes an individual performance number.
  • Anything that drives compensation or ranking.
  • Anything where the act of measuring changes the behaviour you were trying to observe.

The strongest move in those domains is to refuse the instrument and lean on judgment from the people closest to the work. The next strongest is to keep the instrument private to the manager and never publish it.

The AI-era replay of lines of code is the live example. Acceptance rate of AI-generated suggestions is being reported as impact, at roughly the same level of seriousness commit counts had twenty years ago. Accepting a suggestion is not shipping working code, and accepting fast is not understanding what was accepted. If the next decade of engineering instrumentation reruns the LOC mistake one more time, it will not be because the design was hard but because the moves above were not applied.