Time-boxing and kill criteria · Engineering Playbook

The hard part of stopping a project is never the decision. It is the timing. By the time you are ready to admit a bet is not working, the team has already spent three months on it, the demo is half-built, and pulling the plug now means standing in front of everyone and calling that time wasted. So you do not pull it. You ask for two more weeks, then two more.

This is the trap a retro cannot fix. A retro happens after, when the spend is already sunk and the only honest verdict is the one that costs you the most to say out loud. The version of you sitting in that room is the worst possible person to judge whether to continue, because that person is defending a choice they already made.

Move the decision earlier, to the one moment you are not yet invested: before the bet starts. A kill criterion is a line you write down in advance that tells a later, more compromised version of you when to stop. That is the whole reason it beats any retro. It is not smarter. It is just written by someone with nothing yet to defend.

Write the line before the bet starts

The reason this works has a name. Once you have publicly committed to a project and poured effort into it, every additional week makes stopping harder. Admitting the bet failed now means admitting the last month was also a mistake, so the cheaper move, emotionally, is always to keep going and hope. People call this escalation of commitment, and it is not a character flaw. It is the default behaviour of anyone with skin in a decision.

A kill criterion is a pre-commitment aimed straight at that future self. You write it while you still have no sunk cost and no team's morale riding on the answer. Then, when the pressure arrives, you are not asking "do I want to stop?" You are asking "did the thing I already agreed would mean stop actually happen?" The first question has no clean answer. The second one does.

Separate the clock from the verdict

Most people collapse two different things into the phrase "time-box," and that is exactly where it goes wrong.

A time-box is a date, and its job is to force you to look. When it expires, the team has to stop, look at what it has, and decide. The box itself never fires the kill. It only opens the conversation where you decide.

A kill criterion is the verdict you reach when you look. The clock makes you look; the criterion tells you what you are looking for.

Keep them separate and a missed box becomes a real checkpoint. Blur them and you get the most common failure I see: the time-box that rolls forward.

A team caps a research spike at five days. Day five arrives, the work is not done, and someone says "we are close, give it two more." Reasonable. Day seven arrives and the same thing happens. Nobody set a bar for what an extension has to prove, so the box never bites. It has quietly turned into an open-ended commitment with a calendar invite attached. An extendable box is fine, but only if each extension has to clear a fresh bar: what specifically will be true next time that is not true now.

Commit the box and the scope, not the number

Here is where I part ways with the standard advice. The usual instruction is: name a specific number, or your criterion is useless. Hit forty percent activation, or kill it. Reach a thousand weekly users, or kill it.

I think the number is the least important part, and chasing it is often false precision. A hard threshold on a metric you have never measured before is a guess dressed up as rigour. The two things actually worth pre-committing are the time-box and the scope.

We built an AI reconciliation engine last year. The plan was deliberately bounded: a v1, then one iteration to improve it, rolled out to a small cohort, with the adoption funnel monitored. We never set a specific adoption number. The rule was simpler than that: two iterations, this cohort, and if adoption does not meet expectations by the end, we pause.

That criterion was obeyable without a number because everything around it was hard: the iterations were capped, the cohort was small, and the date was real. A monitored funnel plus honest judgment is a perfectly good stopping rule, as long as it sits inside a box and a scope that do not move.

Compare it to the criterion that does nothing: "we'll stop if it's not working." There is no date that forces the look, no cap on what gets spent first, and "working" gets judged in the moment by the people most invested in the answer. It is not a stopping rule. It is a feeling, written down.

Cap the spend before the signal

The cohort cap matters more than the threshold, and the reconciliation engine is also where I learned the limit of my own discipline.

The criterion did its job, but what it could not fix was that v1 itself was big. It took a long time and a lot of effort to build before any adoption signal came back, and when the signal came, the accuracy was average and the adoption was poor. The time-box bounded the iterations, but it could not retroactively shrink the v1 we had already committed to.

So the cap you most need is the one on spend before signal. The small cohort was the part that saved us, because it kept the eventual climb-down cheap. But if the first version is too large to reach a signal cheaply, the kill always arrives late and expensive, no matter how clean the criterion reads on paper. Scope the first look so the bet can fail before it gets big.

Make it a contract everyone signed

A kill criterion held in the manager's head is not a criterion. It is a private opinion you will spring on people later.

The engineers building the reconciliation engine knew from the start that it was a cohort experiment that could be paused. That mattered more than I expected. When the pause came, the team had already agreed to the shape of that decision months earlier. It was not a verdict handed down on their work.

The alternative is the version that breaks people: the team pours months in believing they are building the real thing, and learns at the funeral that it was "just an experiment all along." That sentence, said for the first time on the day you kill it, is where the demoralization comes from. The criterion is an emotional contract as much as a decision rule. Everyone executing has to know upfront that the work is killable and on what line. Then a kill is something that happened, rather than something done to them.

Write the way back

The fear that keeps good criteria from being obeyed is the false negative: the bet you kill that would have been a winner. It is a real fear, and the way to answer it is not to obey the criterion less. It is to make the kill reversible.

A kill does not have to mean killing forever. When we paused the reconciliation engine, it did not go to the grave. It went to the backlog, with a note about what would bring it back: other priorities clearing, or a cheaper path to the accuracy it needed. We are planning to pick it up again. We did not destroy the work. We parked it, with its restart condition written next to it.

So write two lines: the stop condition, and the condition that would revive it. What would have to become true for this to come back off the backlog? Answer that in advance and the kill stops feeling like a verdict on the idea. You are not saying the work was worthless. You are saying not now, and here is exactly what would change "now" into "yes."

When not to obey the line

A criterion you obey blindly is its own trap, so be honest about when the line is wrong.

The distinction that matters is outcome signal versus difficulty signal. Kill on the first, and never on the second. A bet that is hard and slow is not the same as a bet that is failing; plenty of work that turned out to matter was painful in the middle. The reconciliation v1 was hard to build, and that difficulty was not the reason to pause it. The poor adoption was. You should keep going when the outcome feedback is small but consistently improving, because that points the right way, just slowly. Airbnb is the example people reach for here. Investors passed on it, the premise of paying to sleep on a stranger's air mattress sounded absurd, and for a long time the numbers were tiny. What it had was the only signal that counts: the people who tried it came back. The slowness and the doubt were never the reason to keep it alive; the repeat use was. Kill it when the outcome bar you pre-agreed to is missed and "it was difficult" is doing all the work in the argument to keep going.

A criterion worth trusting points at the outcome. Override it the moment you are really just reading your own exhaustion and calling it data.

Starter kit

A kill criterion you write before the bet starts

BET: <what we are testing, in one line>

TIME-BOX: by <date> we stop and look.
  This is a forced checkpoint, not a ship date. The box never kills on its own.

SCOPE CAP: <cohort size, e.g. 1 team / 50 users> and <max iterations, e.g. v1 + 1>.
  Keep the first look small enough that the climb-down stays cheap.

WATCHING: <the funnel or outcome direction we expect, e.g. activation rising week over week>.
  A specific number is optional. The box and the scope are not.

STOP IF: <the outcome read at the box that means pause, e.g. adoption flat after v2>.

RESTART IF: <what would bring it back off the backlog, e.g. priorities clear, or a cheaper path to the accuracy bar appears>.

SIGNED BY: <everyone executing — they know upfront the work is killable and on this line>.

Run this when: you green-light a bet whose payoff is uncertain, before the first commit, while you still have nothing to defend.

A criterion only binds your future self if the date and the scope are locked before the work starts and everyone building it knows the line.