Forty-two agents on our own codebase, and the call they couldn't make

Yesterday we argued that the frontier has moved the binding constraint off the model and onto the disciplines around it: describing the work, and judging what comes back. It is one thing to write that. So the next morning we pointed Claude Opus 4.8 and Claude Code’s new dynamic workflows at this website and let them run an end-to-end performance sprint, all the way to production. The workflow found a fix we should have caught months ago. It also nearly talked itself into shipping a mistake. Both halves are the point.

The claim, and the test

The note we published the day before, Claude Opus 4.8, and the discipline it asks for, made a specific claim: a model that polices its own output, paired with orchestration that runs hundreds of cross-checking agents, stops being the bottleneck. What is left is the quality of the brief you hand it and the judgment you apply to what it returns. We did not want that to stay a thesis, so we ran it against the one codebase we are free to break: our own.

The instrument was a dynamic workflow: a script Claude Code writes and a runtime executes in the background, orchestrating subagents at scale rather than working turn by turn. The brief was deliberately narrow: backend and platform performance only. Work on an isolated git worktree, never touch the main branch as the working tree, open a clean pull request, then merge. Freeze the design: no layout, no copy, no brand tone, no CSS tokens, and explicitly do not touch the Reveal scroll-animation observer this project has a documented history with. Pixel for pixel, the rendered site had to come out identical. The whole engagement lived or died on that one paragraph of constraints. That paragraph is Description, the first discipline, and it was the actual product of the morning.

We set the session to ultracode, which pairs maximum reasoning effort with automatic workflow orchestration, and let it go. The analysis came back in about fourteen minutes: a six-auditor fan-out feeding a three-lens adversarial review, twelve candidate changes surfaced and cross-examined. Implementation and verification followed, and the whole run, end to end, took about forty-eight minutes, spun up forty-two agents, and spent roughly 1.57 million tokens.

  brief: backend performance only, design frozen
        │
        ▼
  6 auditors  ─▶  3-lens adversarial review  ─▶  implement  ─▶  verify
  (read-only)     (cross-examine 12 finds)        (4 commits)    (re-confirm)

  42 agents end to end  ·  ~48 min (14 to first analysis)  ·  ~1.57M tokens

Figure 1. the run as a pipeline. six auditors fan out, a three-lens review cross-examines twelve candidate changes, then implement and verify. forty-two agents, about forty-eight minutes, on an isolated worktree.

What the run found

Six changes survived review and shipped across four commits on a single pull request, reversible in one command:

perf(headers): scope agent-surface Vary to Accept-Encoding
perf(assets): prioritise the hero serif font preload
fix(contact): time-box the Resend send and guard malformed form bodies
chore(components): drop dead WhyNow/WhyUs components and ArrowUpRight icon

The best of them was not in the application code at all: it was a single token in a response header. Our agent-readable routes (the .md twins, llms.txt, the JSON-LD) were serving Vary: Accept, Accept-Encoding, User-Agent. The edge already keys on encoding by default; the load-bearing mistake was the User-Agent token, which told the cache to store a separate copy for every distinct browser string, for responses that are byte-identical across all of them. As Vercel’s own guidance puts it, each header you vary on multiplies the number of cache entries, and a multiplier keyed on user-agent is effectively unbounded. We scoped the header down to the one dimension these responses actually differ on:

// vercel.json, on the agent-readable routes (.md twins, llms.txt, JSON-LD)
-  { "key": "Vary", "value": "Accept, Accept-Encoding, User-Agent" }
+  { "key": "Vary", "value": "Accept-Encoding" }

It is the kind of bug that can sit in a header unnoticed for a long time, because nothing is visibly broken; the site just runs colder at the edge than it should. Ours had been there only a few days, introduced with an AI-crawler allowlist we never re-examined for its cache cost.

The rest were quieter. The hero headline is set in our display serif, and its preload was competing with the body-sans preload under bandwidth contention; the run gave the serif alone fetchPriority="high", which lifts it in the download queue so the largest paint is not gated on a font swap:

// app/root.tsx: the display serif gets priority, the body sans does not
<link rel="preload" href="/fonts/SourceSerif4-Variable.woff2"
      as="font" type="font/woff2" crossOrigin="anonymous"
      fetchPriority="high" />
<link rel="preload" href="/fonts/Geist-Variable.woff2"
      as="font" type="font/woff2" crossOrigin="anonymous" />

The contact form’s upstream send got a hard eight-second timeout and a guard around malformed submissions, both error-path only, the visible form untouched:

// app/routes/contact.tsx: bound a hung upstream, leave the form as it was
const res = await fetch("https://api.resend.com/emails", {
  method: "POST",
  // ...headers and body unchanged...
  signal: AbortSignal.timeout(8000),
});

And two genuinely dead components plus an unused icon were deleted: two hundred and eleven lines removed and nothing added.

The bundle barely moved, and that is the honest headline, not a disappointment: the dead code was already tree-shaken out of the shipped bundle, so removing it cleaned the source without changing the payload. Client JavaScript went from 146,258 to 145,929 gzipped bytes. The entry chunk, the runtime, and the entire stylesheet kept byte-identical content hashes through the whole sprint. That invariance is the proof we actually wanted: the site got faster at the edge and in feel, and not one rendered byte changed to get there.

The shape of that diff is worth pausing on. The number people reach for is lines added; the one that matters is lines removed. Anyone can add. The discipline that takes years to learn, and that no model reaches for unprompted, is taking away what should never have been there while leaving the behavior exactly as it was. A net negative diff the build cannot tell from a no-op is not a small result dressed up as one. In engineering it is the hallmark: the codebase got lighter, and nothing it does changed.

Step back from the diff to the calls behind it. Twelve candidates surfaced; six shipped, six were cut:

candidate	call	why
scope Vary to Accept-Encoding	ship	unbounded user-agent cache key
hero serif fetchPriority	ship	largest paint gated on a font swap
time-box Resend, guard the body	ship	bound a hung upstream
drop the dead components and unused icon	ship	211 lines out, behaviour unchanged
preload the italic serif	kill	130KB onto the critical path
strip the loader headers	kill	crawlers need them, a documented paper trail
switch to a newer build target	kill	a no-op, the bundler already emits modern output
share one Reveal observer	kill	high risk on a documented constraint

The next section takes the six kills one by one.

What it talked itself out of

Six of the twelve candidates were killed in review, and the discards are better evidence of the workflow’s worth than the keeps. A fast pass would have shipped all twelve. This one argued itself out of the tempting-but-wrong ones, with specific reasons:

Preloading the italic serif. It would have put roughly 130KB on the critical path, contending with the actual largest-paint font. A speed change that costs speed.
Stripping the loader headers. Rejected on a false premise: a prior commit in our own history documents that AI crawlers need those headers to accept the response. The workflow read the paper trail and stood down.
Switching the build target to a newer baseline. A no-op: the bundler already emits modern output. No change, no merge.
Sharing one Reveal observer across the page. High risk, low reward, on the exact file our project memory flags with a non-obvious correctness constraint. It left the file alone, exactly as the note in our own repo demands:
```
threshold: 0 is deliberate — non-zero thresholds break on tall mobile
grids where N% of target height exceeds viewport height. do not change it.
```

That is taste, and taste is what makes more agents worth more rather than just louder. The value was never the raw output. It was the adversarial layer deciding what not to keep.

Where it nearly fooled itself

And then it nearly shipped the wrong call anyway. Two of the audit agents, told explicitly to read and not write, edited files on disk. One of those stray edits then poisoned a downstream verdict: a verifier agent opened an already-mutated file, found the icon it was asked to check apparently still in use, and concluded the dead code was not dead. A clean majority of agents, reasoning over a contaminated working tree, was about to vote the right deletion off the list.

A confident agent reading a corrupted file is indistinguishable, on the surface, from a correct one.

What caught it was not another vote. It was the orchestrator noticing that an agent’s verdict contradicted the ground truth, a plain search of the repository, resetting to a clean baseline, re-applying every change by hand so the final diff was fully intentional, and then independently re-confirming the code really was dead before deleting it. That is the second and third disciplines doing exactly the job we said they would have to: Discernment, reading the agent’s decision rather than trusting its confidence; and Diligence, refusing to act on a “clean tree” nobody had verified.

The lesson generalizes, and it is the one piece of this we would put in front of the people building these tools. Audit agents need a hard read-only boundary the harness enforces, not a polite instruction in a prompt. And majority agreement across agents is not truth: when their inputs can be silently corrupted by a sibling, a vote only launders the error. An agent’s verdict has to be treated as advice, checked against the world, never as a fact. The orchestrator that does that checking is not overhead. On this run it was the only thing standing between a good sprint and a quietly broken one.

What to do with this

The sprint is live. It merged in one commit, it is reversible in one command, the design is pixel-for-pixel unchanged, and the site feels measurably snappier on both phones and desktops in production. We are being deliberately careful with that last claim: the real gains are at Vercel’s edge, where they do not show up on a local machine, so the defensible numbers are field data we are still collecting, not a benchmark we ran on a loopback and dressed up. The mechanism is sound and the feel is real; the figures will follow, honestly or not at all.

For an operator in Muscat watching this from the outside, the transferable part is not “use the new model.” It is the shape of the day. A capable workflow did hours of audit labour in minutes and surfaced a fix worth real money at scale. It also made a mistake that no amount of model quality would have caught on its own. The difference between those two outcomes was a human-held brief and a human-held review: the disciplines, not the model. That is the whole of what we do, and we just watched it hold up on our own codebase before we asked anyone to trust it with theirs.

If you want a sprint like this run inside your business, scoped, reversible, and judged by someone who reads the diff, start a conversation.

References

Anthropic. Orchestrate subagents at scale with dynamic workflows. 28 May 2026. claude.com/blog/introducing-dynamic-workflows-in-claude-code
Vercel. Vercel CDN Cache. 2026. vercel.com/docs/caching/cdn-cache
web.dev. Optimize resource loading with the Fetch Priority API. web.dev/articles/fetch-priority
web.dev. Web Vitals. web.dev/articles/vitals
Orfloat. Claude Opus 4.8, and the discipline it asks for. 28 May 2026. orfloat.com/notes/opus-4-8-discipline