Sarthak Garg

Signal design in the age of AI-assisted candidates

Redesigning interviews for the post-generation world.

·8 min read·

Last year my team ran a hiring cycle for nine engineering roles. One of them, a backend opening, drew fifteen thousand applications in the first two days. The full set of nine roles took us six months to close. Along the way, we caught many of them cheating somewhere in the interview.

The cheating itself did not surprise me. What surprised me was how easy AI had made it for a weak candidate to look strong on paper and in a first interview. Most resumes looked nearly identical, as if they had all come from the same template. Most interview answers sounded like the candidate was reading them off a screen rather than thinking them through. When we asked a follow-up question, the candidate often stumbled, because the model had not written them an answer for the follow-up. During live coding rounds, we could watch a candidate pause, glance off-camera for a few seconds, and then paste an answer back into the editor.

We tried the obvious things first, and learned what each one was actually good for:

  • Interview-as-a-service platforms gave us consistent rounds, but never flagged a single cheater.
  • Proctored online tests caught a few cheaters, but most candidates just kept a phone or another laptop next to them, out of view of the webcam, and read answers off it.
  • Manual resume screening worked, but at fifteen thousand applications it was far too slow.
  • AI resume shortlisters meant we had one AI reviewing resumes that another AI had written, with the accuracy you would expect.

After a couple of cycles, two things became clear to me.

First, our real limit was not how clever a round we could design. Our real limit was how many hours one person on my team could spend on screening in a day. We could not run ten interviews per candidate. We could not even run one real resume read per candidate, at three hundred applications a day. If the very first round took an hour of one engineer's time per candidate, the team would run out of hours long before any strong candidate reached a senior interviewer.

Second, our interview process now had to do two different jobs at once, instead of just one. The first job was to filter out the candidates who used AI to look qualified without being qualified. The second job was to check how well a candidate works with AI on real engineering tasks, because every engineer on the team now writes code with a model. Most of the writing I read at the time tried to do both jobs in the same interview round.

What follows are six things I now consider when I design a hiring funnel. The first is the structural constraint that governs everything else: how much screening time the team actually has each day. The next three are filters for candidates who look qualified but cannot actually do the job, and they run in funnel order: proof of past work in the resume, a quick behavioural read on how the candidate engages with code, and a curve ball the candidate cannot pre-prepare for. The fifth is a deeper round that checks how the candidate works with AI on real engineering tasks. The sixth holds the other five together: candidates and cheating tools adapt, so every round we design will stop working within a few months, and my job is to be ready with the next one.

Start from how much screening time your team has, then design backwards

Before designing any round, work out how much time one person on your team can actually spend on screening per day. Two hours per day is realistic for most teams I have worked with. Then work out what a real read of one resume actually takes. For us it was a few minutes per candidate, enough to look at one or two specific things and form a view. Two hours at three minutes per resume comes out to around forty candidates a day. So if three hundred applications land each day, more than two hundred and fifty of them are not going to get a real read at all. The first cut has to be something cheaper than a read.

Most writing on this problem starts from the question "how do we catch the candidate who used AI?" I think that is the wrong question. The right question is "what is a yes-no signal in the resume that I can check in seconds, that separates the candidates worth a real read from the ones who are not?" Detection tools like proctoring software or eye-tracking do not actually save time. Someone still has to set them up, review the candidates they flag, and decide who to advance. The next change is the glance: what I now look for in a resume before deciding to spend the few minutes of real time on it. Build for the second question.

Look for proof of real work, not a polished resume

The first real filter in the funnel is the resume itself, and this change helps with both jobs at once. It filters out the candidates who only look qualified, because faking a real project with real users is much harder than faking a polished resume. It also surfaces the candidates who have actually shipped something using AI tools, because shipping a real project is exactly the kind of work that pushes an engineer toward those tools in the first place.

It is no longer enough for a candidate to know a particular framework or language. Any candidate can list a tech stack on their resume. A model can build them a clean CRUD project in an evening. A model can then write the resume to go on top of it. A CRUD project only tells me the candidate followed a tutorial. What I want to see is a decision the candidate made that the tutorial did not make for them.

For early-career hires, what I look for now is a side project that reached at least one real user, or a small SaaS the candidate built and ran themselves. Something with a user, and a tradeoff the candidate had to think through on their own. For experienced hires, I lean on prior time at engineering teams I respect, referrals from inside our network, and campus visits to programmes whose graduates we have already worked with and trust. One of those campus visits filled all six of our intern openings in a single day.

I do not think this higher bar on proof is a temporary reaction to AI in the candidate pool. The volume of applicants who look strong but cannot actually do the job will only keep going up. Candidates, especially junior ones, will need to actually stand out, not just claim they do. When we hold this line on proof, we stop paying the much larger cost of hiring the wrong person and watching them struggle through onboarding and on-call.

Watch the candidate engage with code, not just produce it

Once the resume pile has been narrowed down, the next thing to find out, quickly, is whether the candidate is actually familiar with code or just looks like they are on paper. The cheapest signal here is behavioural: how the candidate reacts to code in real time. Behavioural signals are much harder to fake than artefacts, because a model can produce the artefact, but a model cannot sit in the candidate's chair.

The version we ran was a short recorded video where the candidate explained a piece of code on camera. We showed the candidate some code on screen, and asked them to record a three-to-five minute video where they talked through it, with their camera and microphone on. Other formats can produce the same signal. A brief live call with one engineer on the team, a screen-shared walkthrough, a paired session that is mostly the candidate narrating. Any of these work. The right format depends on the team's bandwidth.

What we actually graded was how the candidate behaved on the recording while they explained the code. The obvious giveaways, like a long pause before the first sentence or eyes drifting off-screen to read an answer from somewhere, showed up in our very first batch of recordings. The one that stayed with me was a candidate whose friend was literally sitting next to them, just outside the camera frame, whispering the answers.

A candidate cannot fake how their face looks while they read out an answer they did not write. A model can produce the code explanation, but a model cannot sit in the candidate's chair. Our version of this round took one to two minutes per candidate to review, and it was the first round in our funnel where we got more out of it than we put in.

Throw curve balls the candidate cannot pre-prepare for

The last filter for candidates who look qualified but cannot do the job is a curve ball, a question the candidate cannot pre-prepare an answer to, because the right answer needs something a model cannot supply.

The kind of curve ball that worked best for us was a question whose only good answer was a small personal story. The one I used most often was: "explain the HTTP 418 status code, the teapot." A model can give you a clean textbook answer. A real engineer remembers a code review where a colleague slipped 418 in as a joke, or a side project where someone added it as a gag. The candidates who gave us only the textbook answer, with no small story attached, were almost always cheating.

Curve balls keep working as a category. Any one specific curve ball does not. Once candidates know that we ask about the teapot, they show up with a pre-prepared teapot story. By our second hiring cycle, the teapot question itself had already stopped being useful, and we had to write a new one.

Let the candidate use AI, and grade how they use it

The next change is one that many teams are reluctant to make. Every engineer on my team now writes code with a model. So if we forbid the candidate from using one in the interview, we are testing for a different job than the one we are actually going to hire them for. What we want to see in the interview is how the candidate works with the model in their hands.

So we added a dedicated round where the candidate uses AI in front of us. We give them a coding problem, often one with deliberately fuzzy requirements. They share their screen and use whichever model they prefer. We grade seven things while we watch:

  • How they handle a fuzzy problem. Before the candidate prompts the model, do they restate the problem in their own words and name the assumptions they are making? Or do they paste our question in as-is and let the model decide what we meant?
  • How they write the prompt. Does their first prompt give the model the context, the constraints, and at least a small example? Or is it a single line followed by hitting submit?
  • Whether they write a clean spec when the problem calls for one. A well-written spec helps the model produce better output than any other input the candidate can give it. The candidates who take the time to write one are the same ones whose AI-assisted code still holds together past the first page.
  • How they check the model's output. When the model returns code that looks right, does the candidate run it, test it at the edges, and read it line by line? Or do they accept the first answer that looks plausible and move on?
  • Whether they push back on the model. When the model is wrong, does the candidate notice and correct it, or follow it off a cliff?
  • How fast they recover from a wrong turn. When the model leads the candidate down a wrong path, how long does it take them to notice and start over?
  • Whether they would ship what came out. At the end, we ask the candidate whether they would put the final code into production, and why or why not.

More companies have started running rounds like this for the same reason we did. If we ban the tool in the interview, we end up hiring engineers who cannot use it well, which is the opposite of what we need.

Plan for every round to stop working in three to six months

This last change is what holds the other five together. The tools candidates use to cheat improve faster than my team can redesign interview rounds. So my job, as the person designing the funnel, is not to find one perfect round that lasts forever. My job is to keep replacing rounds every few months.

Take the teapot question again. The first time we used HTTP 418, it cleanly separated the candidates into two groups. The ones with a real story stayed in the funnel; the ones without one dropped out. By the second hiring cycle, candidates were arriving with a pre-prepared teapot story, the right length, with the right amount of self-deprecating detail. The question stopped working. Curve balls as a category still work. That specific curve ball did not.

The same thing will happen to every other round in this list:

  • The recorded on-camera round, which felt unbeatable to us in the first cycle, is the next thing the cheating-tool industry will work on. Voice assistants that whisper answers to the candidate, with realistic timing, are already starting to appear.
  • We will have to keep adding to our list of what we grade in the AI round, because the underlying models keep getting better at the very things we currently catch candidates on.
  • We will have to rewrite our definition of "real shipped work", once candidates learn how to manufacture convincing fake projects.

If I design each round assuming it will work forever, then six months later I will be left with a hiring funnel that no longer separates real candidates from fake ones, and a hire bar that has quietly dropped without anyone on the team noticing. If I design each round assuming I will need to replace it within six months, I always have a few replacement rounds being designed in the background. If I am not willing to retire and rebuild the funnel on this schedule, none of the other changes in this list will hold. The trap is to assume that any one round can be made permanent.

Once our hiring process is reliably telling us which candidates can actually do the job again, two more questions come up. The first is calibration: how do I keep the same bar we started with, even as the interview format itself keeps changing. The second is growth: once we have hired a real candidate, how do I grow them on a stack the team itself is still learning. After that comes closing the candidate, which gets much easier when the earlier rounds have actually surfaced who the candidate is, because by the time we make an offer we have met the person we are hiring.