Pillar essay · June 5, 2026 · 17 min read

The CEO's case for making AI argue with itself before it answers you

A single AI answer is one opinion. A workflow is a panel that argues it out and hands you only what survived.

The decision I would not trust to one chat

It's a Thursday morning. I'm staring at a vendor contract I have to sign or kill by end of day, and a question I can't shake: where is the landmine in here?

A year ago I'd have pasted the thing into a chat window, asked "what's risky in this contract," and gotten back a clean, confident list. That confidence is exactly the problem. One model, one pass, one answer, no second opinion. If it missed the auto-renewal clause buried in section 9, I'd never know it missed it. It doesn't tell you what it didn't check.

So I don't do that anymore for anything that matters.

For the decisions I actually can't afford to get wrong, I want the answer to have already survived an argument before it reaches me. Not one AI's take. A result that a few independent AI workers drafted from different angles, then tore into each other's reasoning, and what's left standing is what I read. That shift, from "an AI told me" to "a panel of AIs argued it out and this is what held up," is the whole reason a dynamic workflow beats one big chat. It isn't more horsepower. It's built-in cross-checking.

And before you close the tab: you do not write a line of code to get this. You describe the job in plain English, or you literally say "use a workflow," you approve a one-screen plan, and you read one report at the end. Claude writes the script. You don't. I'll come back to that promise more than once, because it's the part CEOs don't believe until they see it.

If you're not yet running Claude Code in the terminal at all, start there: that sibling piece makes the case for being in the terminal in the first place. This one assumes you're already in and argues the altitude above it, where one session takes the whole job instead of one task at a time.

What a dynamic workflow actually is

A dynamic workflow is a small program Claude Code writes to run many AI workers at once on a single job. You describe the task; Claude writes the orchestration script; a runtime executes that script in the background while your chat stays free; and only the final answer comes back to you. The workers are called subagents (each one its own copy of Claude doing a slice of the work), and a skill is the reusable instruction set one of them can follow. The full definition lives in the explainer, so I won't re-litigate it here. The thing worth your attention is what that structure lets you do that a single chat can't: have the workers check each other.

The bottleneck was never the model's intelligence

Here's the thing almost nobody says out loud. The reason a big job stalls when you hand it to AI isn't that the model isn't smart enough. The model was already smart enough. The bottleneck is that one conversation can only hold so much work.

You give Claude a big job in one chat. It starts sharp. Then the work piles up in its context window (the finite amount it can hold in working memory at once), and as that window fills, the quality quietly decays. The companion piece on why most AI agents fall apart in real work puts it cleanly: it's the context, not the model. A bigger, smarter model does not fix a window that's overflowing. It was already smart enough in the demo.

A workflow attacks that at the structural level. The shift buys a CEO four things a single conversation can't:

Scale. A turn-by-turn chat juggles a few delegated tasks at a time. A workflow fans out to dozens or hundreds of workers per run, each starting with a clean window holding only its slice. Nothing drowns. (It's capped at up to 16 running at once, fewer on a machine with limited CPU cores, and 1,000 total, so a runaway script can't quietly burn your machine or your budget.)
Built-in cross-checking. Because the plan lives in code, the script can spin up independent workers to draft an answer from several angles, then have other workers adversarially review that reasoning, and weigh the versions before anything reaches you.
A readable SOP. Because the plan is code, you can read it, rerun it, and save it. Every run writes its script to a file under your session's folder, so you can ask for it, open it, or have Claude rerun it on next month's inputs.
Resumable runs. The script runs in the background, in an isolated environment, separate from your chat. If you get interrupted mid-run, the work already done is cached and you pick up where it left off, as long as you stay in the same Claude Code session. Exit Claude Code and the workflow starts fresh.

That's the difference between asking one tired analyst at 11pm and convening a small review board. The single pass gives you a confident first draft. The workflow gives you a result that already took fire and held.

The evidence that a panel beats a genius

I'm not asking you to take my word that orchestration beats a single pass.

Anthropic's own engineering team ran the test directly. In their words, "a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval" (Anthropic Engineering, How we built our multi-agent research system). Read that again: a team of a strong lead plus several lighter workers beat the strongest single model running alone, by a wide margin. The orchestration was the edge, not the model.

The pattern holds outside Anthropic. Andrew Ng, who founded DeepLearning.AI and ran Google Brain, is blunt about where the gains are coming from: "I think AI agentic workflows will drive massive AI progress this year, perhaps even more than the next generation of foundation models. This is an important trend, and I urge everyone who works in AI to pay attention to it" (Andrew Ng, March 2024). His own demonstration drives it home: an older, weaker model wrapped in an iterative agent loop scored 95.1% on the standard HumanEval coding test, while the brand-new flagship model used in single-pass mode scored 67.0% (Andrew Ng, Sequoia AI Ascent 2024). The loop beat the upgrade.

And this isn't a one-off measurement that'll reverse next quarter. The independent research group METR found that "the length of tasks that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years" (METR, March 2025). The jobs a workflow can take off your plate get bigger every few months on their own. The trend line is measured, not speculation.

The takeaway for a CEO isn't the benchmark numbers. Forget the benchmark numbers. The direction is what matters: how you orchestrate the work now matters more than which model you bought. And orchestrating it well is exactly what a workflow does for you, automatically, without you knowing a thing about how it's wired.

What I'd actually point this at

Forget code migrations for a second. That's the headline demo, but it's not where most CEOs live. Here's where the cross-checking pattern earns its keep in a real week. These are scenarios I'd run; I haven't battle-tested all of them as workflows myself, since dynamic workflows are a research-preview feature, so treat the specifics as illustrations of the shape, not war stories.

The contract sweep. I have a folder of supplier and customer agreements. I want every one checked for a single risky thing: auto-renewal, unlimited liability, missing termination rights. A workflow fans out one subagent per contract, each checks its document, and I get one consolidated report of which agreements carry the risky language. I'm not reading forty contracts. I'm reading one list, and I know every contract got looked at, not just the ones I had time for.

The batch triage against a fixed rubric. A stack of inbound RFP responses, support tickets, or applicant resumes. Many workers sort, score, and flag each one against my criteria in parallel, and I get back one ranked result instead of a pile I have to grind through by hand. The whole batch gets the same standard, not just the ones I read before I ran out of patience.

The vendor-grade competitive teardown. I want a real read on a competitor or an acquisition target, not a pile of links I have to vet myself. The bundled deep-research workflow (more on it below) fans out searches across several angles, fetches the sources, cross-checks them against each other, votes on each claim, and hands back one cited report with the claims that didn't survive already filtered out. The vetting happened before I read it.

The strategic pressure-test. This is the one I care most about. A hard call: raise prices, enter a new region, take on debt, make a key hire. I'd have a workflow draft the decision from several independent angles, then have separate workers adversarially review each other's reasoning and weigh the versions, so what reaches me is a recommendation that already absorbed its own counterarguments. That's not "an AI thinks you should raise prices." That's "here's the case for, here's the case against, here's what held up when they fought."

The pre-ship fact check. Before a published report, a pricing page, or a compliance filing goes out, I want every factual claim verified by workers who check each other, so the stuff that doesn't hold up gets caught here instead of by a customer or a regulator. A vetted result, not one model's unchecked first draft.

The recurring review you run identically every time. Monthly close. Every new vendor onboarding. Each board-deck draft. Once a workflow run does what I want, I select the run in /workflows and press s to save its script as my own slash command (a one-word shortcut, like /board-check) and run the exact same panel every month, feeding in that month's inputs.

That last one is the compounding move. People drift, forget steps, and quit. A saved workflow runs the exact same checklist on the exact same standard, this month and next year, the day you're traveling and the day you're not. You wrote the SOP once by approving it once. It runs the same way forever.

Notice the through-line. Every one of these is a job where I'd rather have the answer pre-argued than fast. That's the filter for whether a workflow is worth it: does the downside of a wrong answer justify making the AI argue with itself first? If yes, single-pass is a liability.

The part where you do nothing technical

I keep promising this, so here is exactly how little you touch.

You type a request in plain English. "Check every contract in this folder for auto-renewal clauses." Or you don't even specify the machinery: you just add "use a workflow" to a normal request, or include the keyword ultracode in your prompt. Claude figures out it's a big fan-out job and writes the orchestration script itself. You never see code unless you go looking for it.

Before it runs, Claude shows you a plan: the phases it's about to execute, with a prompt that offers "Yes, run it," "Yes, and don't ask again for this workflow in this project," "View raw script" if you're curious, and "No." You read the plan in plain language and approve it. That's your checkpoint.

Then it runs in the background. Your chat stays responsive while the workers grind. You can watch progress with the /workflows command if you want (it shows each phase, how many workers are running, tokens used, time elapsed, and lets you pause or stop without losing finished work), or you can ignore it and go to a meeting. In the time it takes me to walk to the kitchen for water, a panel of agents has done the cross-checking that used to mean reading everything myself and still missing things.

At the end, one report. Not a transcript. Not a pile of links. The answer that survived.

The framing matters here, so let me say it the way I'd say it at dinner: you are not the analyst anymore. You're the executive who convened the panel, set the question, and reads the verdict. Get Claude to do the work; you decide what to do with it.

The honest caveats, operator to operator

I'm not going to hand you the evangelist version and skip the costs. A few things you should know before you turn this loose.

It costs more. Materially. A workflow spawns many agents, so a single run can burn a lot more than the same question in one chat. Anthropic's own team found agents "use about 4x more tokens than chat interactions, and multi-agent systems use about 15x more tokens than chats" (Anthropic Engineering). That 15x is the price of the cross-checking. The mitigations are real: point it at one folder or one narrow question first to gauge spend, watch each worker's token use live in /workflows, and you can tell Claude to use a smaller, cheaper model for the stages that don't need the strongest brain. The runtime also caps it: up to 16 workers running at once (fewer on a machine with limited CPU cores), and a hard ceiling of 1,000 total per run, so a runaway loop can't quietly bankrupt you.

It's research preview. Dynamic workflows are explicitly labeled in research preview and require a recent version of Claude Code. There are rough edges. Point this at a job where the worst case is "redo it," not "explain it to the board." I run the OpenCLAW harness across my companies and I've spent real time with Claude Code, but I'd be lying if I told you I'd hammered dynamic workflows in production for a year. Nobody has. The feature is new. Treat it accordingly.

Sometimes a workflow is the wrong tool. If the job is small, conversational, and turn-by-turn, a single agent or a skill is the right call, not a panel of sixteen. The companion piece a goal or a dynamic workflow walks the decision. The short version: reach for a workflow when the job needs more workers than one conversation can coordinate, or when you want the orchestration written down so you can rerun it.

A human still signs. The workflow runs to completion without stopping for your input mid-run (only its own permission prompts can pause it), so it's not the tool for a decision that needs your sign-off between stages. And the cross-checking makes the answer more trustworthy; it doesn't make it true. You're still the one who commits the company. The panel argues. You decide.

The on-ramp you can run this week

You don't have to start with the board-deck panel. Start with the small version that ships in the box.

Claude Code includes one built-in workflow today: /deep-research. You type /deep-research and a question, and it does the cross-checked teardown I described above, fanning out searches, vetting the sources against each other, voting on each claim, and returning a cited report with the weak claims already stripped out. It's the fastest way to feel the difference between one AI's answer and a panel's verdict, on a low-stakes question, this afternoon. The everyday version is written up at get up to speed on any topic.

Why bother starting now instead of waiting for the rough edges to smooth out? Because the capability is moving fast and it compounds. The jobs you can hand off get bigger every few months, on their own, while you sleep. And the prize is not small: McKinsey estimates generative AI could add the equivalent of $2.6 trillion to $4.4 trillion annually across the 63 use cases they analyzed, concentrated in exactly the knowledge work a CEO orchestrates (McKinsey, June 2023). The CEO who starts building this muscle now, on the small jobs, is the one who trusts it with the big ones first.

Frequently asked

Do I need to know how to code? No. You describe the job in plain English, or you just say "use a workflow," and Claude writes the orchestration script itself. You approve a one-screen plain-language plan, then read one report at the end. The only thing you can edit is the plain-English request. You never have to write or read a line of code.

How is this different from just asking Claude in a chat? A chat is one model making one pass and handing you a confident answer with no second opinion. A workflow runs many independent workers that can draft from several angles and adversarially review each other before anything reaches you. The result already survived scrutiny.

What does it cost? More than a chat, sometimes a lot more, because a workflow runs many agents at once. Multi-agent runs can use roughly 15x the tokens of a single chat, and they count against your plan's usage like any session. Mitigate by testing on a small slice first, watching token use live in /workflows, and routing cheaper stages to a smaller model.

Is it safe to let it run on its own? Reasonably, with guardrails. It shows you the plan and waits for your approval before starting. It caps itself at up to 16 workers at once and 1,000 total, and you can stop any run without losing finished work. File edits auto-approve, so point it at jobs where the worst case is "redo it," not something irreversible, and keep a human signing off on the final call.

What can go wrong? It's research-preview software, so expect rough edges. A workflow can return a polished answer that's still wrong: cross-checking raises trust, it doesn't guarantee truth. A poorly scoped run can spend more than you expected. And it can't pause for your input mid-run, so it's wrong for any job needing sign-off between stages. Start small and verify before you act.

Can I reuse a workflow I liked? Yes. When a run does what you want, select it in /workflows and press s to save its script as your own slash command, and it runs the same panel every time after that. Feed in fresh inputs each run (this month's numbers, this quarter's vendors) without touching the script. The recurring review stops depending on whether you remembered to be thorough.

What you should do next

Open Claude Code this week and run /deep-research on a real question you'd otherwise spend a Saturday of tabs on. Read the one cited report it hands back and notice what it tells you it could not confirm. That's the panel showing its work.

Then, the next time a decision lands on your desk that you genuinely can't afford to get wrong, the contract, the price change, the hire, describe it in plain English and add "use a workflow." Approve the plan. Read the verdict. Decide.

The gap between the CEO who makes AI argue before it answers and the CEO who trusts the first confident reply is going to be larger than people expect, and neither of them will be working harder than the other. The cross-checking is the difference.

Tell me in thirty days what changed. I'd love to hear about it.

Andrew

Want the full system? The DeskTheory operator guides are $99 each, or all three for $199.

The CEO's case for making AI argue with itself before it answers you

The decision I would not trust to one chat

What a dynamic workflow actually is

The bottleneck was never the model's intelligence

The evidence that a panel beats a genius

What I'd actually point this at

The part where you do nothing technical

The honest caveats, operator to operator

The on-ramp you can run this week

Frequently asked

What you should do next

Related reading

Run this from your laptop.

The CEO's case for making AI argue with itself before it answers you

The decision I would not trust to one chat

What a dynamic workflow actually is

The bottleneck was never the model's intelligence

The evidence that a panel beats a genius

What I'd actually point this at

The part where you do nothing technical

The honest caveats, operator to operator

The on-ramp you can run this week

Frequently asked

What you should do next

Related reading

The signal in your inbox, every Thursday

Run this from your laptop.