Succession and single points of failure

You are looking at one engineer on your team you cannot afford to lose.

You cannot ask them to take a real two weeks off without rearranging the next quarter. You cannot give them honest feedback when they dip, because doing so would slow the team before any improvement showed up. You cannot rotate them to the new platform you are building, because you need them on the old one. You cannot promote them out of operational ownership of a system nobody else can run.

You made these staffing decisions eighteen months ago, and you never revisited them.

Sometimes the dependency is a cluster instead of a person. On platform teams, the most common shape I see is a service built by three senior engineers over a few years, sitting on top of a junior-heavy team with no real second tier. You count seniors and tell yourself the team is deep. The seniors share every critical piece of context between them. Lose any one and the platform's velocity halves; lose two and the team cannot ship.

Both shapes get the same name in most engineering vocabularies: bus factor. The standard remedies are pitched too low: pair more, review more, write more docs, rotate on-call. They are not wrong but they miss what the leader actually does.

Succession in engineering is a constraint you hold at every staffing decision, rather than a ceremony you schedule or a rotation tax you pay. You read the map of single points of failure (SPOFs) and dissolve them through deputization, before the team pays in burnout or in a hiring scramble after someone leaves.

Read the map every staffing decision

You update the SPOF map every time:

Someone joins.
They leave.
You promote someone.
You assign a project.
A system goes from "experiment" to "central piece of infrastructure".

What you are watching:

Who answers the questions in the team channel about a given system.
Who owns the on-call rotations nobody volunteered for.
Where juniors actually get blocked, which is rarely where they say they get blocked.
Who is in every incident review.
Who reviewers default to when the change is non-trivial.

You are not counting heads at a level; you are reading distribution across levels. A team of one staff and five juniors and a team of two staff and four juniors are not "almost the same". The first is a single point of failure with five witnesses; the second has slack.

Triage; do not chase every SPOF

You pay a tax for every redundancy you stage:

Slower features.
The deputy's time off their own work.
The SPOF's time spent reviewing instead of building.
Your time spent designing the transfer.

You cannot afford the tax on every system, so you pick. Three filters:

Blast radius. What breaks, for whom, on what timescale, if this person goes.
Cost of redundancy. How hard it is to grow a second on this surface, in months and budget.
Time pressure. How soon they are leaving, burning out, or being pulled to a new role.

Stage two people on every critical surface; accept a single owner on the rest. You name the gaps you cannot close; you fail when you stop seeing them.

Sometimes you split the territory instead of staging a deputy. It works when the system has natural seams: two engineers each owning one half is real redundancy on the smaller halves, and an honest SPOF on each. Sometimes that is the cheaper shape.

Dissolve through deputization

Docs help, but they are not redundancy. Reading a runbook and running the system in production are different skills. A team where the docs are good and only one person has actually run the system in production still has a SPOF on operations.

You cure SPOF status through deputization. Pick the second-most-senior person on the surface and start handing them critical pieces of the SPOF's work. The SPOF reviews; the deputy does. Real problems in production, with the deputy on the keyboard; workshops and scheduled pairing sessions are not enough.

The team's speed dips during the handoff, but you pay that dip and the team gets it back. Nothing else dissolves the SPOF status.

Stage rockstars and superstars differently

Not every SPOF wants the same outcome. Kim Scott's rockstar and superstar distinction is the cleanest way I know to read this.

Some senior engineers are on a gradual trajectory. They want to be central to their current role, paid well, left alone to do excellent work. They are not asking to be promoted out. These are the rockstars. For this kind of SPOF, you build redundancy around them. They stay the rock, and the deputy is insurance rather than succession.

Others are on a steep trajectory. They will be asking for more in twelve months and will leave if they do not get it. These are the superstars. For this kind, you build redundancy out of them. The deputy is succession, and your job is to free them to move up.

Same mechanic, opposite intent. Confuse the two and you signal "I am replacing you" to the rock who wanted to stay, or "I am locking you in" to the climber who wanted to move.

Break the maintenance trap deliberately

By default, the engineer who built the system gets the maintenance. They know it best, incidents triage faster when they respond, and reviews are more reliable when they handle them. Every quarter that passes, the on-call paging tree, the deprecation cleanups, the dependency upgrades, the migration consolidations all flow to them.

Because they are the SPOF, they do not have the slack to automate the maintenance away. The maintenance burns them out, and it locks them out of new work at the same time. This is the irreplaceable-engineer trap. Engineers describe it as a career trap, but leaders should describe it as a redundancy failure they are responsible for fixing.

Three live options:

Sponsor protected automation time on the calendar, instead of "when there is a gap".
Hand a chunk of maintenance to a deputy, accepting the dip.
Rotate the SPOF off the system entirely once the deputy carries it.

If you do nothing, you choose to burn them out.

Resist the over-hire reflex

Hiring does not give you redundancy. Adding a body to a team with a SPOF gives you a body; it does not give you a second person who can run the system. You build redundancy by transferring knowledge, and a new hire only sets up the transfer, which you may not need to set up yet.

Before you open a req against a SPOF risk, name two things: which existing engineer is going to teach the new hire, and which critical piece of work the new hire is going to absorb. If neither has an answer, you are buying low-output insurance, and you will be paying the salary for a year while the SPOF status stays exactly where it was.

Pre-bake the cascade before it lands

When a SPOF leaves a small team, you take two hits. First, the bandwidth gap: the work they did, now distributed across the people who could not do it. Second, the hiring drain: you spend weeks sourcing, screening, looping, and closing a strong replacement for a critical role, which is time you are no longer spending unblocking the team. The team slows on a multiplier.

Pre-baking means you started before you knew the date:

A deputy who is already absorbing the day-to-day running of the system, rather than a deputy who would have started learning if there had been time.
A structured handoff plan, rather than a brain dump after notice.
A buffer week with no new work scheduled.
An explicit conversation with stakeholders that says "we will be slow for six weeks" before the slowness shows up as missed commitments.

Refuse to bend the performance bar

The repair loop is the conversation you have with someone whose performance is dipping: honest feedback, a clear set of changes, a time horizon, follow-through.

When that person is a SPOF, you bend the loop. You soften the feedback, stretch the time horizon, dilute the follow-through into "let us see how next quarter goes". You can be in the soft loop for six months without realizing you are running one.

You did not bend the loop because of conviction or character. Running it honestly means accepting that they might leave or be parted out at the end, and that loss would crater the team's velocity in a way you cannot survive this quarter. You bent the loop because you never built the redundancy that would let you keep it honest.

You cannot fix this in the repair loop itself. You fix it upstream: build the deputy, then run the loop honestly.

Be honest about leader-as-SPOF

The hardest SPOF to see is your own. If every non-trivial decision routes through you, you are the SPOF on judgment. The decisions:

What gets prioritized.
Who is on what.
When to escalate.
What "good" looks like on a review.
When a customer concern is a blocker.

The team will tell you they are fast, and they will be fast on small things. They will be blocked on anything that needs a call when you are unavailable. Take a week off and the team pauses.

You do not cure this by calling for more delegation. Write down the rules of thumb you use to make calls. Write decision records when you make non-obvious choices, with the reasoning, so people can run the next call on similar ground without you. Then, deliberately, force the team to make calls without you in the room.

You will sit through calls you would have made differently, and that discomfort is the cost of dissolving the SPOF. Pay it.

When SPOFs are the right call

Every redundancy you stage costs you something. You slow the SPOF down during the transfer; you spend a week writing heuristics before you see any return; and you carry the SPOF map in the back of your head, where it does not switch off when a quarter feels stable.

You do not dissolve every SPOF. Some systems are sunsetting in two quarters, and growing a second owner is wasted work. Some prototypes have a single owner because they should; stage redundancy only after a prototype survives contact with users.

You are not aiming for no SPOFs; you are aiming to never be surprised by one. If a person leaving breaks the team, the leader did not run out of time. They stopped reading the map.

Starter kit

A SPOF map for the surfaces you own, including yourself

Fill in one row per system or surface you depend on. The trajectory column tells you whether to build redundancy around the owner (rockstar) or out of them (superstar).

System or surface	Primary owner	Deputy	Blast radius (high, med, low)	Owner's trajectory	Action this quarter
payments API	Asha	none	high	rockstar	stage deputy by Q3
___	___	___	___	___	___
___	___	___	___	___	___
___	___	___	___	___	___
___	___	___	___	___	___

Then run the scan on yourself. For each decision class, name the last call your team made without you in the room. If you cannot, you are the SPOF on that class.

Decision class	Last call the team made without me	Heuristic written down?
Prioritization	___	[ ]
Staffing	___	[ ]
When to escalate	___	[ ]
What "good" looks like in review	___	[ ]
When a customer concern is a blocker	___	[ ]

Run this when: you cannot name a deputy for every critical surface, or when a one-week leave from you would pause the team.

You read the map so no one leaving surprises you.