DESK · THEORY
Q&A · June 2, 2026 · 4 min read

Why most AI agents fall apart in real work (and how to fix it)

The agent that nailed your demo is not getting dumber. It's running out of context. The fix is not a smarter model; it's setting the agent up to succeed.

You watched it in the demo. The AI agent read the file, ran the task, came back with something sharp. You handed it a real multi-day job and it lost the plot somewhere around step nine. Forgot a decision you made on Monday. Invented a number. Confidently finished the wrong thing.

The reflex is to wait for the next model. That's the wrong diagnosis.

The honest answer

It's the context, not the model.

Every agent works inside a context window: the slice of text it can actually see at one moment, measured in tokens. As a long task runs, that window fills with tool output, half-finished steps, and earlier reasoning. Old decisions scroll out. The agent stops seeing what it agreed to on step two by the time it reaches step nine, so it guesses. It never had a written brief about your business, so it guesses there too. Small guesses compound into a broken result.

A bigger model does not fix that. The model was already smart enough in the demo. What broke was the information the agent could see when it mattered.

Why it's harder than it looks

The research lands in the same place from three directions.

METR found the length of task an agent can finish has been doubling roughly every seven months. Impressive, until you read the fine print: that horizon is measured at a 50% success rate. A coin flip. Models hit near-100% on tasks a human would do in under four minutes, and under 10% on tasks over four hours. The doubling is real. So is the coin flip.

Toby Ord's "half-life" framing explains the feel of it: an agent has a roughly constant chance of failing each minute a human would spend, so success drops exponentially as the job gets longer. Great in a five-minute demo. Underwater on a multi-hour job. Same model, same minute, different odds at the end.

A 2026 reliability study across 23,392 task episodes showed the same shape: reliability decays faster than the task lengthens. A short job that succeeds around 76% of the time slides toward 52% at the longest horizons. Failure rises faster than the work grows.

And bigger windows are not the escape hatch. Independent testing of around 18 frontier models found they all degrade as the input grows, well before they hit their maximum window, and none of them use long context evenly. As one builder put it on X: "The model is not the bottleneck anymore, context is." A million-token window is not a filing cabinet you can dump everything into and trust. Quality can drop as you fill it. This is "context rot," and it means bigger is not automatically better.

What to do this week

You don't need a better model. You need to set the agent up so the context it needs is in front of it and the context it doesn't is out of the way. Five moves, in order of payoff:

As another builder put it on X: "Your AI agent is not getting worse because the model is dumb. It is getting worse because the context is polluted."

Want to maximize your AI leverage? Upgrade to Pro.

Before you blame the model, do one thing. Write the agent a one-page brief about your business, give it memory, and re-run the task that fell apart. Most of the time, the same model gets it right.

Related

The Thursday 3

The signal in your inbox, every Thursday

The Thursday 3 is a free weekly email. Three workflows that put you in the top 1% of CEOs. 90-second read.

Get the newsletter →
The Desk Theory books

The architecture behind these articles.

Two operator manuals for the same job, run two ways: OpenCLAW for the always-on harness, Claude Code for the focused-work CLI. Pick one, or get the bundle for $149.

Browse the books · $99 each

Want one workflow taken apart end-to-end every week? The Tuesday Pro Deep Dive · $39/mo.