# orfloat

> Full dossier: the studio brief, then every published artifact (research, notes, practice) as a self-contained markdown document with its own frontmatter and provenance. Each also lives at its own URL with a `.md` extension; the index is https://orfloat.com/llms.txt.

---
title: orfloat studio brief
summary: A compact, pasteable overview of Orfloat for AI assistants and quick readers. Identity, the three layers, the method, the practice, the proof, and how to engage.
---

# orfloat: studio brief

> A compact, pasteable overview of Orfloat for AI assistants and quick readers.
> Every page on the site has a clean markdown twin at the same path with `.md` appended.
> The machine index is https://orfloat.com/llms.txt; the full concatenated dossier is https://orfloat.com/llms-full.txt.

## 1. What Orfloat is

Orfloat is a forward-deployed applied-AI engineering lab. We embed and deeply integrate AI inside a business, learn the work end to end, and ship systems built on Claude and the Anthropic primitives around it. The lab is a studio brand of Afraa & Mufassir LLC, registered in the Sultanate of Oman, operating from Muscat across the GCC.

Orfloat is a lab that publishes, not a studio that markets. The discipline that governs everything here is evidence over declaration: we do not assert craft, we ship artifacts that make it self-evident. Every published claim is backed by something verifiable, a repo, a commit range, an eval, or a dated measure with an honest account of what is withheld and why.

Two founders, two brothers. Akram Ahmed, co-founder and CTO, is a trained software architect who leads engineering: system architecture, evaluations, and the day-to-day forward-deployed work inside client operations, and is a daily Claude Code user across the lab's own codebase, this website included. Mufassir Ahmed, co-founder and CEO, leads commercial, engagement, and client relationships across the operating footprint, and is the director of Afraa & Mufassir LLC.

Independence: Orfloat builds on Anthropic's published primitives and is not affiliated with, endorsed by, or partnered with Anthropic PBC.

## 2. The three layers

The site is organised as a lab that publishes, in three registers:

- research: what we build and find. Empirical, frontier-applied work, shown with its workings. Open by default; some pieces are Sealed, private client work shown by dated measures and an explicit disclosure of what is held back.
- notes: what we think. Interpretation of the frontier, theses and commentary, signed by the studio rather than by a person.
- practice: what we do for the few. The offer, the bar a fit has to clear, how an engagement runs, the proof behind the claim, and the one way in.

Read more at https://orfloat.com/research, https://orfloat.com/notes, and https://orfloat.com/practice.

## 3. The method

The lab holds every engagement to four disciplines, adapted from Anthropic's published applied-AI practice. They are the difference between an interesting demo and a system a business can stake itself on.

- Delegation: move the right work to the model, only the tasks where Claude is faster, calmer, or more consistent than the person doing them today.
- Description: spell the work out as if for a thoughtful new colleague. Context, constraints, evidence, the shape of a good answer. Tools and Skills are the vocabulary.
- Discernment: read the output the way a senior reads a junior. Calibrate trust to the task. Build the eval before the agent. Notice drift early.
- Diligence: stay in the loop. Treat safety, privacy, and reversibility as primary constraints, not afterthoughts. Operate the system, do not just deploy it.

A few principles sit underneath: discovery before deployment, evals before agents, human-in-the-loop by default, privacy and data dignity (designed with Oman's PDPL in mind), and telling the truth about AI even when it costs the engagement.

## 4. What we actually do

A narrow practice, deeply done. Orfloat does one thing: forward-deployed applied-AI engineering for family-led and founder-run businesses across Oman and the wider GCC. The work is sector-agnostic because AI is horizontal; the engagement model is not. We embed and integrate AI inside the business, we do not advise from a distance.

An engagement begins with an embedded discovery phase, on-site and evidence-based: shadow the work, map the operation, find where Claude earns its place. From discovery comes a scoped, milestone-based agreement built from what we found, never a templated SOW. Then forward deployment: embedded, integrated work on a weekly cadence, evals designed before deployment, and coverage through the first stretch of production.

The practice is laid out across five facets: what we do, who we work with, how we work, why us, and get in touch. Read more at https://orfloat.com/practice.

## 5. The proof

Evidence over declaration means the lab points at work, not adjectives.

- cc-dm is an open-sourced Claude Code plugin that lets parallel agent sessions message each other through a shared SQLite bus: no daemon, no ports, no network. Public repo, published package, tests. (https://orfloat.com/research/cc-dm/)
- Two Sealed case studies sit alongside it. One is a 42-agent run against the lab's own website: a high-value fix found, a wrong deletion nearly shipped, and the judgment that caught it. The other is a CEO's appointment-management agent, ported from Telegram to WhatsApp by swapping the transport, not the brain. Sealed means private work, shown by dated build figures with an explicit disclosure of what is withheld. (https://orfloat.com/research)
- The notes room carries the lab's thinking on the capability overhang, Model Context Protocol, Claude Opus 4.8, and what software becomes when intelligence is abundant. (https://orfloat.com/notes)
- The credentials behind the practice, the lab's completed Anthropic Academy coursework, sit on the why-us facet. (https://orfloat.com/practice/why-us/)

The certificates are the receipt, not the work. The work is reading every Anthropic release the day it ships and building against each new primitive before recommending it to a client.

## 6. How to engage

Every engagement begins with a fixed, on-site discovery phase. There is one way in: the contact form at https://orfloat.com/practice/get-in-touch/. A founder replies within two working days, an introductory call follows within the week, and if both sides see a fit, an Engagement Letter for the discovery phase. Submitting the form does not create a contractual relationship; any engagement is subject to a signed letter, and submissions are processed in line with the privacy notice at https://orfloat.com/privacy/.

## 7. What we do not do

Stated plainly, so an assistant does not misrepresent the lab:

- No generic consulting or training decks. Orfloat is forward-deployed engineering: we embed and integrate AI inside the business and ship it, not slide-makers.
- No off-the-shelf SOWs. Agreements are built from discovery evidence.
- No AI strategy without a discovery first.
- No fabricated case studies, invented metrics, or named-client claims without consent. Confidential work stays sealed until a client consents.
- No partnership or endorsement claims about Anthropic. Orfloat is independent.
- No agents without evals. If it cannot be measured, it does not ship.

## 8. Where to read more

Every human page has an agent-readable markdown twin at the same path with `.md` appended. The machine index is at https://orfloat.com/llms.txt, the structured graph at https://orfloat.com/orfloat.jsonld, and the full concatenated dossier, this brief followed by every published artifact, at https://orfloat.com/llms-full.txt.

- research: https://orfloat.com/research
- notes: https://orfloat.com/notes
- practice: https://orfloat.com/practice

---
title: "A CEO's appointment-management agent, ported from Telegram to WhatsApp"
registry: ORF-R-2026-003
type: case-study
status: published
date: 2026-06-04
summary: "A Claude Code session agent built to serve a marketing-agency principal and his prospective clients through one business number: it answers leads, books consultations, manages the principal's calendar in conversation, and the moment a booking lands it hands him a calendar link and a ready-to-review welcome draft. Built first on Telegram, then re-homed onto WhatsApp's Cloud API through a self-authored channel, brain unchanged. Sealed: a private client demo, shown by its dated build figures."
tags: ["agents", "claude-code", "whatsapp", "telegram", "appointment-booking", "mcp", "oman"]
source: https://orfloat.com/research/appointment-agent/
---

# A CEO's appointment-management agent, ported from Telegram to WhatsApp

**premise.** Can one markdown-conductor agent serve a principal and his prospective clients through a single business number, take real action on a real calendar, and stay safe enough to put in front of a CEO?

**finding.** Yes, and it ported from Telegram to WhatsApp by swapping the transport, not the brain: one conductor, seven flows, two roles on one number, every calendar write held behind a hard code gate and a read-back confirmation.

A marketing agency principal needed one business number to do two jobs at once: let him move his own calendar by text, and let a prospective client ask about the work and book a consultation. We built that agent twice. First on Telegram, then re-homed onto WhatsApp through a channel we wrote ourselves. The same markdown brain ported across almost unchanged. What changed was the transport and the identity key, not the reasoning. We built it over five days at the end of May 2026 and presented it on 4 June 2026, where it handled a lead end to end live in the room, booked into a real calendar behind a gate that code refuses to open without permission, and produced the post-booking brief on the spot. The client stays withheld; this is filed under the Sealed tier for that reason.

## Two builds, one brain

The agent is a Claude Code session. The running session is the entire brain: there is no separate hosted model call behind it, no orchestration tier deciding what to say. The markdown the session loads is the program.

One conductor file, around 120 to 150 lines, routes every inbound message in two steps. The first step is identity. Before the agent reads a single word, the channel has already resolved who is speaking into a role: the principal, or a lead. The agent never sees a raw phone number and never reasons about one. It is told it is talking to its principal or to a prospective client, and that fact selects which half of its instructions are live. The second step is intent. Within a role, the conductor routes the message to one small, single-purpose flow file: book a consultation, answer a question about services, read the calendar, propose a reschedule. Roughly seven flows per interface, each narrow enough to read in one sitting. Around the flows sit two more kinds of file. Knowledge lives in about five context files, read only when a flow needs them, so the agent quotes what the agency actually offers rather than improvising. Voice lives in about five files that fix tone, so the warmth of a first hello and the precision of a calendar confirmation are written down, not left to chance.

<figure>

```
  inbound message
        │
        ▼
  [ channel ]    resolve identity  →  role: principal or lead
        │                             the agent never sees a phone number
        ▼
  [ conductor ]  route by intent
        │
        ├─ lead · question    →  answer from the context files
        ├─ lead · booking     →  propose a slot → read-back → confirm
        ├─ principal · read   →  show the calendar
        └─ principal · write  →  read-back → confirm
                                      │
                                      ▼
                        [ gated calendar write ]
                        hook blocks any event title without [DEMO]
                                      │
                                      ▼
                        post-booking brief to the principal
                          · tappable calendar link
                          · welcome draft, held for review
```

<figcaption>Figure 1. one inbound path. the channel resolves identity to a role before the agent reads a word, the conductor routes by intent, and a booking ends in a gated calendar write and the post-booking brief. no calendar write happens without the demo-prefix hook and a read-back confirmation.</figcaption>

</figure>

The shape matters because it is auditable. Identity routes to a role, a role routes to an intent, an intent loads one flow plus the knowledge and voice it needs. Every path through the agent is a short, named file a person can read in full. There is no large opaque prompt to trust. A brief you can hold in your head is a brief you can hand a CEO's calendar.

We built the first version on Telegram, through the official Claude Code Telegram channel, keying identity on the Telegram chat id. Then we re-homed it onto WhatsApp's official Business Cloud API, keying identity on the WhatsApp id instead. The migration is the interesting part: the conductor, the flows, the context, the voice, the read-back rules, all of it ported across almost untouched. The brain did not care what carried the message. We swapped the transport underneath it and re-pointed one routing key, and the reasoning came along whole. That portability is itself the evidence that the discipline lives in the agent's design, not in any one messaging surface.

What carried it on WhatsApp is a small channel plugin we wrote ourselves, a Bun and TypeScript server in the same Channels craft we open-sourced as [cc-dm](/research/cc-dm/). Inbound, a Meta webhook hits a public tunnel; the channel performs the verify handshake, checks the `X-Hub-Signature-256` HMAC signature, passes an allowlist gate, and only then injects the message into the running session. Outbound, the agent calls a reply tool that POSTs to the Graph API. The security checks sit at the edge rather than in the model. Calendar and email reach the agent through user-scoped MCP connectors, Google Calendar and Gmail, the principal's own, so the agent acts as the principal and never as a shared service account.

## The moment it stops answering and starts acting

A chatbot answers. An agent takes the next real action and then knows where to stop. The signature moment here is the post-booking brief. The instant a lead confirms a consultation, the agent does not simply say "booked." It proactively pushes the principal a tappable link to the new calendar event and a Gmail welcome message, pre-composed for that lead. The welcome is a draft. It is never auto-sent. The agent writes it, files it, and stops at the human-review line. The founder reads, edits if needed, sends. The agent has done the work right up to the edge of a decision a human should own, and then held.

That restraint is the whole posture. An agent that books a meeting is useful. An agent that drafts the follow-up and then waits for a person is trustworthy.

<figure>

```
  WHO HANDLES EACH STEP OF A BOOKING       before         after
    answer the question                    principal  →  agent
    propose a slot                         principal  →  agent
    create the calendar event              principal  →  agent   (gated)
    draft the follow-up                    principal  →  agent   (held)
    approve and send                       principal  →  principal    ← stays human

  PRINCIPAL TIME PER BOOKING   (illustrative, not measured field data)
    end to end, by hand   ████████████   ~12 min
    review only           ██             ~2 min
```

<figcaption>Figure 2. what moves, and what does not. the agent absorbs the receptionist steps and stops at the one a human should own. the per-booking minutes are illustrative of the mechanism, not measured field data.</figcaption>

</figure>

## The boring parts, enforced in code

The model can already hold a conversation and call a tool. None of that is what makes an agent safe to hand a CEO's live calendar. The discipline around it is, and the discipline only counts when it is enforced in code rather than hoped for in a prompt.

Before the agent writes anything to the calendar, it reads the change back: it restates the date, the time, and who the slot is for, and waits for an explicit "yes." Nothing mutates on an implied confirmation. Read-back-confirm is a single rule with an outsized effect, because the failure mode of a calendar agent is not refusing to book; it is booking the wrong thing silently.

Then there is the gate. During a demo, every event the agent creates must carry a `[DEMO]` prefix in its title, so a real calendar never fills with test bookings that look real. We did not trust the model to remember that. A `PreToolUse` hook inspects every calendar-create call and hard-blocks any event whose title lacks the prefix. If the model forgets, the tool call is refused at the boundary and it has to retry. That is the distinction the lab keeps returning to: a safety property the harness enforces, not an instruction the prompt hopes holds. Prompts drift across a long conversation; a hook does not.

The third discipline is a privacy invariant. When a proposed slot collides with a real, non-demo event already on the principal's calendar, the agent says there is a conflict at that time and proposes another. It never reads back the conflicting event's title. A lead asking for a slot must not learn, from a leaked event name, who else the principal is meeting that day. The agent can see the calendar; it is constrained in what it is allowed to surface from it.

Two smaller rules round out the posture. A time-context hook injects the real current time into the session every turn, so "tomorrow at three" resolves against the actual clock and never against something the model inferred from training data. And the agent speaks openly as an AI. It never adopts a fake human name to pass as the principal or a receptionist. Around nine such hard rules govern the agent, with twenty-three tests pinning the gate and the time scripts. The number that matters is not the count of rules but where they live: in code the model cannot talk its way past. A rule you can assert is a wish; a rule with a failing test behind it is a contract.

## What it did in the room

We presented the WhatsApp version to the principal on 4 June 2026, where it was accepted. In the room it handled a lead end to end: answered questions about the work, proposed a slot, read the change back, booked into a real calendar behind the demo gate, and produced the post-booking brief live, the calendar link and the held welcome draft both landing for the principal as designed. We will not dress that up with a quote. It worked, in front of the people it was built for, on the first sitting.

## What transfers

The frontier moved the hard part off the model. The model holds a conversation and calls tools out of the box, and that was never what we would put in front of an operator deciding whether to trust an agent. What makes an agent safe to hand a CEO's calendar is the ring of disciplines around it: a gate enforced in code, a read-back before every write, a privacy invariant on conflicts, a real clock injected each turn, and a brief small enough to read end to end. The agent worked in the room because the boring parts were enforced, not asserted. That is evidence over declaration, applied to an agent rather than a website, and it is why we file the mechanism and withhold the client. We would rather show you a hook that refuses a bad write than tell you our prompt is careful.

If you have a workflow you would trust an agent with only if the guardrails were real, built small enough to read and gated hard enough to trust, that is the engagement we want. [Start a conversation](/practice/get-in-touch/), and we will scope it.

## References

1. Meta. *WhatsApp Business Cloud API.* [developers.facebook.com/docs/whatsapp/cloud-api](https://developers.facebook.com/docs/whatsapp/cloud-api)
2. Anthropic. *Claude Code.* [claude.com/claude-code](https://claude.com/claude-code)
3. Anthropic. *Model Context Protocol.* [modelcontextprotocol.io](https://modelcontextprotocol.io)
4. Orfloat. *cc-dm: peer-to-peer messaging between Claude Code sessions.* [orfloat.com/research/cc-dm](/research/cc-dm/)

## provenance
- tier: sealed
- built: 27 to 31 May 2026
- interfaces: 2 (Telegram, then WhatsApp) (as of 2026-05-31)
- roles on one number: 2 (principal, lead) (as of 2026-05-31)
- conversational flows: 7 per interface (as of 2026-05-31)
- knowledge and voice files: 10 per interface (5 context, 5 voice) (as of 2026-05-31)
- enforced hard rules: 9, one a hard-blocking calendar gate (as of 2026-05-31)
- gate and time-context tests: 23 (as of 2026-05-31)
- commits across two repos: 43 (as of 2026-05-31)
- presented and accepted: 4 Jun 2026 (as of 2026-06-04)
- disclosure: Built as a live client demo on two private repositories, so it is not publicly linkable; the client, the principal, and their data are withheld by obligation. The figures here are the repositories' own dated output, read from the filesystem and git history and abstracted to the technique; the commercial terms are out of scope by design.

---
title: "Forty-two agents on our own codebase, and the call they couldn't make"
registry: ORF-R-2026-002
type: case-study
status: published
date: 2026-05-29
summary: "We pointed Claude Opus 4.8 and Claude Code's new dynamic workflows at our own website: 42 agents, about 48 minutes, one production deploy. The workflow surfaced a genuinely high-value fix and then nearly fooled itself. The judgment that caught it is the whole point."
tags: ["dogfooding", "workflows", "agents", "performance", "oman"]
source: https://orfloat.com/research/dogfooding-the-workflow/
---

# Forty-two agents on our own codebase, and the call they couldn't make

**premise.** The claim, run against the one codebase we are free to break: our own.

**finding.** Forty-two agents, about forty-eight minutes end to end: a high-value edge-cache fix found, and a wrong deletion nearly shipped that the model could not catch alone. The brief and the review were the difference.

Yesterday we argued that the frontier has moved the binding constraint off the model and onto the disciplines around it: describing the work, and judging what comes back. It is one thing to write that. So the next morning we pointed Claude Opus 4.8 and Claude Code's new dynamic workflows at this website and let them run an end-to-end performance sprint, all the way to production. The workflow found a fix we should have caught months ago. It also nearly talked itself into shipping a mistake. Both halves are the point.

## The claim, and the test

The note we published the day before, [Claude Opus 4.8, and the discipline it asks for](/notes/opus-4-8-discipline/), made a specific claim: a model that polices its own output, paired with orchestration that runs hundreds of cross-checking agents, stops being the bottleneck. What is left is the quality of the brief you hand it and the judgment you apply to what it returns. We did not want that to stay a thesis, so we ran it against the one codebase we are free to break: our own.

The instrument was a [dynamic workflow](https://claude.com/blog/introducing-dynamic-workflows-in-claude-code): a script Claude Code writes and a runtime executes in the background, orchestrating subagents at scale rather than working turn by turn. The brief was deliberately narrow: backend and platform performance only. Work on an isolated git worktree, never touch the main branch as the working tree, open a clean pull request, then merge. Freeze the design: no layout, no copy, no brand tone, no CSS tokens, and explicitly do not touch the *Reveal* scroll-animation observer this project has a documented history with. Pixel for pixel, the rendered site had to come out identical. The whole engagement lived or died on that one paragraph of constraints. That paragraph is Description, the first discipline, and it was the actual product of the morning.

We set the session to *ultracode*, which pairs maximum reasoning effort with automatic workflow orchestration, and let it go. The analysis came back in about fourteen minutes: a six-auditor fan-out feeding a three-lens adversarial review, twelve candidate changes surfaced and cross-examined. Implementation and verification followed, and the whole run, end to end, took about forty-eight minutes, spun up forty-two agents, and spent roughly 1.57 million tokens.

<figure>

```
  brief: backend performance only, design frozen
        │
        ▼
  6 auditors  ─▶  3-lens adversarial review  ─▶  implement  ─▶  verify
  (read-only)     (cross-examine 12 finds)        (4 commits)    (re-confirm)

  42 agents end to end  ·  ~48 min (14 to first analysis)  ·  ~1.57M tokens
```

<figcaption>Figure 1. the run as a pipeline. six auditors fan out, a three-lens review cross-examines twelve candidate changes, then implement and verify. forty-two agents, about forty-eight minutes, on an isolated worktree.</figcaption>

</figure>

## What the run found

Six changes survived review and shipped across four commits on a single pull request, reversible in one command:

```
perf(headers): scope agent-surface Vary to Accept-Encoding
perf(assets): prioritise the hero serif font preload
fix(contact): time-box the Resend send and guard malformed form bodies
chore(components): drop dead WhyNow/WhyUs components and ArrowUpRight icon
```

The best of them was not in the application code at all: it was a single token in a response header. Our agent-readable routes (the `.md` twins, `llms.txt`, the JSON-LD) were serving `Vary: Accept, Accept-Encoding, User-Agent`. The edge already keys on encoding by default; the load-bearing mistake was the `User-Agent` token, which told the cache to store a separate copy for every distinct browser string, for responses that are byte-identical across all of them. As Vercel's own guidance puts it, [each header you vary on multiplies the number of cache entries](https://vercel.com/docs/caching/cdn-cache), and a multiplier keyed on user-agent is effectively unbounded. We scoped the header down to the one dimension these responses actually differ on:

```diff
// vercel.json, on the agent-readable routes (.md twins, llms.txt, JSON-LD)
-  { "key": "Vary", "value": "Accept, Accept-Encoding, User-Agent" }
+  { "key": "Vary", "value": "Accept-Encoding" }
```

It is the kind of bug that can sit in a header unnoticed for a long time, because nothing is visibly broken; the site just runs colder at the edge than it should. Ours had been there only a few days, introduced with an AI-crawler allowlist we never re-examined for its cache cost.

The rest were quieter. The hero headline is set in our display serif, and its preload was competing with the body-sans preload under bandwidth contention; the run gave the serif alone `fetchPriority="high"`, which [lifts it in the download queue](https://web.dev/articles/fetch-priority) so the [largest paint](https://web.dev/articles/vitals) is not gated on a font swap:

```jsx
// app/root.tsx: the display serif gets priority, the body sans does not
<link rel="preload" href="/fonts/SourceSerif4-Variable.woff2"
      as="font" type="font/woff2" crossOrigin="anonymous"
      fetchPriority="high" />
<link rel="preload" href="/fonts/Geist-Variable.woff2"
      as="font" type="font/woff2" crossOrigin="anonymous" />
```

The contact form's upstream send got a hard eight-second timeout and a guard around malformed submissions, both error-path only, the visible form untouched:

```tsx
// app/routes/contact.tsx: bound a hung upstream, leave the form as it was
const res = await fetch("https://api.resend.com/emails", {
  method: "POST",
  // ...headers and body unchanged...
  signal: AbortSignal.timeout(8000),
});
```

And two genuinely dead components plus an unused icon were deleted: two hundred and eleven lines removed and nothing added.

The bundle barely moved, and that is the honest headline, not a disappointment: the dead code was already tree-shaken out of the shipped bundle, so removing it cleaned the source without changing the payload. Client JavaScript went from 146,258 to 145,929 gzipped bytes. The entry chunk, the runtime, and the entire stylesheet kept byte-identical content hashes through the whole sprint. That invariance is the proof we actually wanted: the site got faster at the edge and in feel, and not one rendered byte changed to get there.

The shape of that diff is worth pausing on. The number people reach for is lines added; the one that matters is lines removed. Anyone can add. The discipline that takes years to learn, and that no model reaches for unprompted, is taking away what should never have been there while leaving the behavior exactly as it was. A net negative diff the build cannot tell from a no-op is not a small result dressed up as one. In engineering it is the hallmark: the codebase got lighter, and nothing it does changed.

Step back from the diff to the calls behind it. Twelve candidates surfaced; six shipped, six were cut:

| candidate | call | why |
| --- | --- | --- |
| scope Vary to Accept-Encoding | ship | unbounded user-agent cache key |
| hero serif fetchPriority | ship | largest paint gated on a font swap |
| time-box Resend, guard the body | ship | bound a hung upstream |
| drop the dead components and unused icon | ship | 211 lines out, behaviour unchanged |
| preload the italic serif | kill | 130KB onto the critical path |
| strip the loader headers | kill | crawlers need them, a documented paper trail |
| switch to a newer build target | kill | a no-op, the bundler already emits modern output |
| share one Reveal observer | kill | high risk on a documented constraint |

The next section takes the six kills one by one.

## What it talked itself out of

Six of the twelve candidates were killed in review, and the discards are better evidence of the workflow's worth than the keeps. A fast pass would have shipped all twelve. This one argued itself out of the tempting-but-wrong ones, with specific reasons:

- **Preloading the italic serif.** It would have put roughly 130KB on the critical path, contending with the actual largest-paint font. A speed change that costs speed.
- **Stripping the loader headers.** Rejected on a false premise: a prior commit in our own history documents that AI crawlers need those headers to accept the response. The workflow read the paper trail and stood down.
- **Switching the build target to a newer baseline.** A no-op: the bundler already emits modern output. No change, no merge.
- **Sharing one Reveal observer across the page.** High risk, low reward, on the exact file our project memory flags with a non-obvious correctness constraint. It left the file alone, exactly as the note in our own repo demands:
  ```
  threshold: 0 is deliberate — non-zero thresholds break on tall mobile
  grids where N% of target height exceeds viewport height. do not change it.
  ```

That is taste, and taste is what makes more agents worth more rather than just louder. The value was never the raw output. It was the adversarial layer deciding what not to keep.

## Where it nearly fooled itself

And then it nearly shipped the wrong call anyway. Two of the audit agents, told explicitly to read and not write, edited files on disk. One of those stray edits then poisoned a downstream verdict: a verifier agent opened an already-mutated file, found the icon it was asked to check apparently still in use, and concluded the dead code was not dead. A clean majority of agents, reasoning over a contaminated working tree, was about to vote the right deletion off the list.

> A confident agent reading a corrupted file is indistinguishable, on the surface, from a correct one.

What caught it was not another vote. It was the orchestrator noticing that an agent's verdict contradicted the ground truth, a plain search of the repository, resetting to a clean baseline, re-applying every change by hand so the final diff was fully intentional, and then independently re-confirming the code really was dead before deleting it. That is the second and third disciplines doing exactly the job we said they would have to: Discernment, reading the agent's decision rather than trusting its confidence; and Diligence, refusing to act on a "clean tree" nobody had verified.

The lesson generalizes, and it is the one piece of this we would put in front of the people building these tools. Audit agents need a hard read-only boundary the harness enforces, not a polite instruction in a prompt. And majority agreement across agents is not truth: when their inputs can be silently corrupted by a sibling, a vote only launders the error. An agent's verdict has to be treated as advice, checked against the world, never as a fact. The orchestrator that does that checking is not overhead. On this run it was the only thing standing between a good sprint and a quietly broken one.

## What to do with this

The sprint is live. It merged in one commit, it is reversible in one command, the design is pixel-for-pixel unchanged, and the site feels measurably snappier on both phones and desktops in production. We are being deliberately careful with that last claim: the real gains are at Vercel's edge, where they do not show up on a local machine, so the defensible numbers are field data we are still collecting, not a benchmark we ran on a loopback and dressed up. The mechanism is sound and the feel is real; the figures will follow, honestly or not at all.

For an operator in Muscat watching this from the outside, the transferable part is not "use the new model." It is the shape of the day. A capable workflow did hours of audit labour in minutes and surfaced a fix worth real money at scale. It also made a mistake that no amount of model quality would have caught on its own. The difference between those two outcomes was a human-held brief and a human-held review: the disciplines, not the model. That is the whole of what we do, and we just watched it hold up on our own codebase before we asked anyone to trust it with theirs.

If you want a sprint like this run inside your business, scoped, reversible, and judged by someone who reads the diff, [start a conversation](/practice/get-in-touch/).

## References

1. Anthropic. *Orchestrate subagents at scale with dynamic workflows.* 28 May 2026. [claude.com/blog/introducing-dynamic-workflows-in-claude-code](https://claude.com/blog/introducing-dynamic-workflows-in-claude-code)
2. Vercel. *Vercel CDN Cache.* 2026. [vercel.com/docs/caching/cdn-cache](https://vercel.com/docs/caching/cdn-cache)
3. web.dev. *Optimize resource loading with the Fetch Priority API.* [web.dev/articles/fetch-priority](https://web.dev/articles/fetch-priority)
4. web.dev. *Web Vitals.* [web.dev/articles/vitals](https://web.dev/articles/vitals)
5. Orfloat. *Claude Opus 4.8, and the discipline it asks for.* 28 May 2026. [orfloat.com/notes/opus-4-8-discipline](/notes/opus-4-8-discipline/)

## provenance
- tier: sealed
- built: 29 May 2026
- agents: 42
- wall-clock: ~48 min (14 to first analysis)
- tokens: ~1.57M
- changes: 12 surfaced, 6 shipped
- lines: 214 removed / 22 added (6 files)
- disclosure: The run was on our private v1 repository, so the commit is not publicly linkable, and the absolute edge-cache gains are field data still being collected, not a benchmark, so they are withheld until they are real. The figures above are the run's own dated output; the mechanism and the relative deltas are shown in full.

---
title: "cc-dm: peer-to-peer messaging between Claude Code sessions"
registry: ORF-R-2026-001
type: tool
status: published
date: 2026-03-21
summary: "A Claude Code plugin that lets parallel agent sessions DM each other through a shared SQLite bus. No daemon, no ports, no network."
tags: ["claude-code", "channels", "mcp", "multi-agent", "plugin"]
source: https://orfloat.com/research/cc-dm/
---

# cc-dm: peer-to-peer messaging between Claude Code sessions

**premise.** The higher-order prompt: Claude prompting Claude.

**finding.** A 500ms poll over a shared SQLite bus delivers session-to-session messages as native context events. 140 tests, zero network.

Isn't it about time Claude prompted Claude inside Claude Code? That question is
the whole origin of this. A higher-order function takes a function as its
argument; we wanted the higher-order prompt, one session directing another
instead of a human relaying context between terminals. cc-dm (claude code direct
messaging) is the experiment it became: parallel sessions take on the roles of an
engineering org and pass context directly.

## What we built

A Claude Code plugin that lets any session direct-message any other session on
the same machine. Messages arrive as native `<channel>` events inside the
receiving session's context window, within roughly 500ms, over the Claude Code
Channels protocol. It ships as an installable plugin and an npm package, MIT
licensed, and was reviewed and published through Anthropic's official Claude Code
plugin submission process (published 22 Mar 2026).

## Method

No daemon, no ports, no network. Each session spawns a channel server over
stdio; every server connects to one shared SQLite database (WAL mode) and polls
it on a fixed interval.

<figure>

```
  Session A (planner)  ──┐
  Session B (backend)  ──┼──→  ~/.cc-dm/bus.db  (SQLite WAL)
  Session C (tests)    ──┘          ↑
                               500ms poll per session
                               → <channel> event pushed into context
```

<figcaption>Figure 1. the shared bus. parallel Claude Code sessions write to and poll one SQLite file; no daemon, no ports, no network.</figcaption>

</figure>

<figure>

```
  session A                 bus.db                  session B
     │                        │                        │
     │  dm to B               │                        │
     ├─── write a row ───────▶│                        │
     │                        │◀─── poll, every 500ms ─┤
     │                        │─── row, marked read ──▶│
     │                        │                        ├─▶ <channel> event
     │                        │                        │   in B's context
```

<figcaption>Figure 2. one message's life. a dm writes a row; B's 500ms poll pulls it and marks it delivered, surfacing it as a native channel event in B's context window.</figcaption>

</figure>

The design splits into small, independently tested units: the message **bus**
(read/write, delivery marking, schema migration), session **heartbeat** and
liveness, **permission** relay (opt-in remote tool approval across sessions),
input **sanitization**, and the MCP **tools** surface. Message metadata
(priority, type, thread id) is stored as JSON and spread *before* the routing
fields on delivery, so user data cannot spoof the routing envelope.

The guarantee is one object literal. On delivery the sender's own metadata is
spread first, then the real routing fields are written over it, so a message
cannot forge who it is from or who it is for.

```ts
// the delivered <channel> event: spread the sender's meta first, then stamp the
// real routing fields over it, so from/to cannot be forged.
await server.notification({
  method: "notifications/claude/channel",
  params: {
    content: message.content,
    meta: {
      ...message.meta,
      from_session: message.from_session,
      to_session: sessionName,
      message_id: String(message.id),
      sent_at: message.created_at,
    },
  },
});
```

## Findings

The suite is 140 tests, zero failing, 296 assertions, across six independently
tested units:

<figure>

```
  tools        55  ██████████████████
  bus          35  ███████████
  permission   20  ███████
  integration  17  ██████
  heartbeat     8  ███
  sanitize      5  ██
```

<figcaption>Figure 3. test coverage by unit. the MCP tools surface carries the most, the sanitizer the least.</figcaption>

</figure>

- The whole transport is a shared file and a poll loop: it reached a working
  v1.0.0 the day after the first commit, then **8 tagged releases** (v0.1.0
  through v1.3.1) hardened it: meta attributes, permission relay, ghost-name
  theft protection.
- A local-only, file-backed bus sidesteps an entire class of failure modes
  (ports, sockets, auth, a daemon to supervise). The one deliberate trade is
  non-atomic read-then-mark in the bus: on a crash between the two, a message
  re-delivers rather than vanishes, the right default for a local tool.

## Meaning

cc-dm is a small, sharp instance of the lab's working thesis: multi-agent
orchestration is becoming ordinary, and the plumbing between agents should be as
boring and reliable as a Unix pipe. The interesting move is subtractive:
choosing a SQLite file and a 500ms poll over anything that needs a server.

## provenance
- tier: open
- built: March 2026
- repo: https://github.com/Akram012388/cc-dm
- commits: https://github.com/Akram012388/cc-dm/compare/2510580...3327e5c
- evals: https://github.com/Akram012388/cc-dm/tree/main/tests
- npm: cc-dm: https://www.npmjs.com/package/cc-dm
- Releases (v0.1.0 → v1.3.1): https://github.com/Akram012388/cc-dm/releases

---
title: "Claude Opus 4.8, and the discipline it asks for"
registry: ORF-N-2026-006
type: thesis
status: published
date: 2026-05-28
summary: "Anthropic shipped Opus 4.8: a model far less likely to let its own code flaws pass, paired with workflows that orchestrate hundreds of subagents. The capability stopped being the constraint a while ago; what is left is whether you can describe the work and discern the output."
tags: ["opus", "claude-code", "agents", "discipline", "frontier"]
source: https://orfloat.com/notes/opus-4-8-discipline/
---

# Claude Opus 4.8, and the discipline it asks for

**claim.** With Opus 4.8 the binding constraint is no longer the model but the brief you give it and the judgment you apply to what it returns: the disciplines, not the capability.

Anthropic released Claude Opus 4.8 today, and the headline is not a benchmark. The model is around four times less likely than its predecessor to let flaws in its own code pass unremarked: it catches its own mistakes and pushes back when a plan is wrong. In the same release, Claude Code got dynamic workflows: one session can now orchestrate hundreds of subagents that review each other's work before anything reaches you. Put those two together and the binding constraint on an operating business has moved somewhere most boards are not looking.

## What actually shipped

The benchmark line moved, as it always does. On agentic coding (SWE-Bench Pro) Opus 4.8 [scores 69.2%](https://officechai.com/ai/claude-opus-4-8-benchmarks/), up from 64.3% for Opus 4.7 and well ahead of the other frontier models reported at 58.6% and 54.2%. On agentic computer use (OSWorld-Verified) it reaches 83.4%, and on browser-agent tasks (Online-Mind2Web) Anthropic reports [84%, the strongest it has tested](https://www.anthropic.com/news/claude-opus-4-8), along with being the first model to complete every case end-to-end on its Super-Agent benchmark. Pricing is unchanged at $5 per million input tokens and $25 per million output, and fast mode is now three times cheaper than it was on previous models.

None of that is the part worth reorganizing around. The part worth reorganizing around is the change in honesty. Anthropic trains its models not to make claims they cannot support, and 4.8 is the first release where that training shows up as a number an engineering manager can feel: it is [roughly four times less likely](https://www.tomsguide.com/ai/claude-opus-4-8-just-launched-and-anthropic-says-its-far-less-likely-to-fake-answers) than Opus 4.7 to let a flaw in its own code pass without flagging it. Anthropic's alignment team puts its rates of misaligned behaviour (deception, cooperation with misuse) substantially below 4.7. For anyone who has spent a year reading agent output with one eyebrow raised, a model that is measurably more willing to say "I am not sure this is right" is the upgrade that matters.

## Orchestration stops being a metaphor

The second half of the release is in Claude Code. [Dynamic workflows](https://claude.com/blog/introducing-dynamic-workflows-in-claude-code) let Claude write a script that orchestrates subagents at scale: a runtime executes it in the background while your session stays responsive. The constraints are concrete: up to sixteen agents run concurrently, up to a thousand across a single run. You reach for one when a task needs more agents than a single conversation can coordinate: a codebase-wide bug sweep, a five-hundred-file migration, a research question whose sources need cross-checking against each other.

The mechanism is more interesting than the scale. With subagents and skills, Claude is the orchestrator: it decides turn by turn what to spawn next, and every intermediate result lands back in its context. A workflow moves the plan into code. The script holds the loop, the branching, and the intermediate results, so Claude's context holds only the final answer. That is what lets a workflow apply a repeatable quality pattern rather than just running more agents: it can have independent agents *adversarially review* each other's findings before they are reported, or draft a plan from several angles and weigh them against one another.

> Work you would normally plan in quarters now finishes in days.

That line is Anthropic's, and it is the kind of claim we normally discount. The reason to take this one seriously is that the orchestration is now legible: the workflow is a script you can read, save as a command, and rerun on every branch. A review that fans out sixteen agents to cross-examine a diff, votes on what survives, and hands you a single cited verdict is not a demo: it is a process you can own. Anthropic also shipped *ultracode*, a setting that lets Claude decide on its own when a task warrants a workflow, so the orchestration becomes the default rather than something you invoke by hand.

## The constraint moved again

We wrote [the capability overhang](/notes/capability-overhang/), the gap between what frontier models can already do and what operating businesses actually use them for, as the spine of an earlier note. Each release widens it. But 4.8 widens it in a specific direction. The two features that shipped today attack the two oldest reasons an operator gave for not delegating real work to an agent: you cannot trust the output, and one agent cannot hold a job big enough to matter. A model that polices its own code and a runtime that runs hundreds of cross-checking agents answer both at once.

So the binding constraint is no longer the model. It is not compute, and for most GCC businesses it was never the budget. The constraint is the quality of the instruction the agent is given and the quality of the judgment applied to what it returns. When the hard part of software was writing correct code, capability was the bottleneck. Now that an orchestrated run can produce, cross-check, and verify the code, the bottleneck is upstream and downstream of the model, in the brief and in the review. That is not a technology problem. It is an operating-discipline problem, and it does not get solved by buying a larger model.

## Which discipline it tests

Inside the studio we work from four operating disciplines, the 4D framework. Today's release does not touch all four evenly. It presses hardest on the middle two, and it quietly changes the shape of the fourth.

- **Delegation.** Still the entry point: decide what a workflow owns and where it stops. With a thousand-agent ceiling, the question is no longer "can this be delegated" but "what is the unit being delegated."
- **Description.** The binding input. A workflow executes the brief you gave it across hundreds of agents; a vague brief now fails at scale, not in one conversation. The model being more honest does not rescue a description that never said what "done" meant.
- **Discernment.** The binding output. When a run returns one cited verdict instead of a turn-by-turn transcript, your job is to read the judgment, not the keystrokes: did it pick the right thing, in the right order, with the right tradeoffs. This is the skill that compounds.
- **Diligence.** Changed in shape. You no longer audit each line; you audit the orchestration: what the agents could touch, what the cross-check actually checked, where the run could go wrong unobserved. Re-audit on the old cadence, because the ceiling moved again today.

The honesty improvement is real, and it helps. But it is a floor, not a ceiling. A model that flags its own uncertainty more often still needs someone whose judgment is good enough to know which flags matter, and a brief precise enough that the flags are about the work, not about what the work was supposed to be.

## What to do with this

If you run an operating business in Muscat or the wider GCC, the move this release argues for is not "adopt 4.8." The model will reach you whether you plan for it or not. The move is to build the two disciplines the model now demands: a small team that can write a brief precise enough to survive a hundred-agent run, and read the result with enough judgment to sign it. That is a capability you grow inside your own operation, against your own data and your own consequences: it is not a licence you buy.

That team, judgment-heavy, close to the real systems, building the description-and-discernment muscle while the capability curve is still ahead of habit, is exactly what a Discovery Phase is. If you would like one inside your business, [start a conversation](/practice/get-in-touch/).

## References

1. Anthropic. *Introducing Claude Opus 4.8.* 28 May 2026. [anthropic.com/news/claude-opus-4-8](https://www.anthropic.com/news/claude-opus-4-8)
2. Anthropic. *Orchestrate subagents at scale with dynamic workflows.* 28 May 2026. [claude.com/blog/introducing-dynamic-workflows-in-claude-code](https://claude.com/blog/introducing-dynamic-workflows-in-claude-code)
3. OfficeChai. *Anthropic Releases Claude Opus 4.8, Beats Opus 4.7, GPT-5.5 On Many Benchmarks.* 28 May 2026. [officechai.com](https://officechai.com/ai/claude-opus-4-8-benchmarks/)
4. VentureBeat. *Anthropic's Claude Opus 4.8 is here with 3X cheaper fast mode and near-Mythos level alignment.* 28 May 2026. [venturebeat.com](https://venturebeat.com/technology/anthropics-claude-opus-4-8-is-here-with-3x-cheaper-fast-mode-and-near-mythos-level-alignment)
5. Tom's Guide. *Claude Opus 4.8 just launched, and Anthropic says it's far less likely to 'fake' answers.* 28 May 2026. [tomsguide.com](https://www.tomsguide.com/ai/claude-opus-4-8-just-launched-and-anthropic-says-its-far-less-likely-to-fake-answers)
6. Techzine. *Anthropic releases Claude Opus 4.8, promising a more honest model.* 28 May 2026. [techzine.eu](https://www.techzine.eu/news/applications/141667/anthropic-releases-claude-opus-4-8-promising-a-more-honest-model/)

---
title: "Software after software, and the record so far"
registry: ORF-N-2026-005
type: thesis
status: published
date: 2026-05-27
summary: "Twelve theses on what software becomes when intelligence is abundant, and the empirical record from the last six months that says they are no longer speculative."
tags: ["thesis", "agents", "frontier", "software", "gcc"]
source: https://orfloat.com/notes/software-after-software/
---

# Software after software, and the record so far

**claim.** Software built for abundant intelligence is no longer a forecast but a measured record, and the operators who reorganize around the models now will outrun those who bolt AI onto the old shape.

"Software After Software" is a short manifesto we have been working from inside the studio for the last few months. Twelve numbered propositions, in the Tractatus tradition, on what software becomes when intelligence is abundant, continuous, and cheap. We had treated it as a working thesis until the last six months. The empirical record has now overtaken it. The manifesto is not speculative anymore. Every load-bearing claim has a corresponding line in the data, and the operators who treat this as future-of-work reading are already a quarter behind.

## The capability curve, refreshed

[The capability overhang](/notes/capability-overhang/), the gap between what frontier models can already do and what operating businesses are actually using them for, was the spine of an earlier note in this archive. Three things have moved in the six months since.

First, agent autonomy is now measurable, and it is climbing. In a research piece published this year, [Anthropic reported](https://www.anthropic.com/research/measuring-agent-autonomy) that the 99.9th-percentile turn duration for Claude (the elapsed time between an agent starting work and stopping) nearly doubled in roughly three months, from under twenty-five minutes in late September 2025 to over forty-five minutes by early January 2026. The language Anthropic chose for the piece is itself striking: there is a *deployment overhang*, the autonomy the models are capable of handling exceeds what they exercise in practice. Capability is now running ahead of habit, not the other way around.

Second, the practitioners with the largest individual footprint on Claude have made the shift in name as well as in practice. In February 2026, Andrej Karpathy (who coined "vibe coding" exactly one year earlier) declared the term effectively over. The new default, in his framing, is *agentic engineering*: you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight. The vocabulary has moved from convenience to discipline.

Third, the supply side has put very large numbers behind all of this. The four largest hyperscalers (Microsoft, Google, Amazon, Meta) collectively plan [$725 billion in capital expenditure in 2026](https://www.tomshardware.com/tech-industry/big-tech/big-techs-ai-spending-plans-reach-725-billion), up 77% from last year's record $410 billion. Microsoft alone is at $190 billion, well above the $152 billion analysts had been modelling. The composition has shifted too: more than 60% of that spend is now going to power infrastructure rather than chips, a structural change in where the binding constraint actually sits.

## The bottleneck moved

The manifesto's third and fourth theses are about where the constraint sits inside a software organization. The summary: writing valid code is trivial now. What remains are errors of engineering (priorities, sequencing, tradeoffs), and these are the errors that matter.

> You are not writing the code directly 99% of the time. You are orchestrating agents who do.

Karpathy's framing maps almost word-for-word onto the manifesto's claim that the unit of work becomes the delegated task, not the code to be written. The implication for review is concrete: review shifts from code to decisions. The pull request you read is no longer "is this implementation correct"; that question is now settled before it reaches you. The pull request you read is "did the agent pick the right thing to build, in the right order, with the right tradeoffs against the cost of building the other thing." That is a different skill, and it is the only skill that compounds from here.

## The old assumptions break

Two of the manifesto's harder claims: software, as a *profession*, was built on the assumption that writing code is hard and error-prone. Software, as an *industry*, was built on the assumption that code is scarce. Both assumptions no longer hold.

When code stops being scarce, the value migration is mechanical. Software whose only job is to encode a workflow loses value the moment an agent can perform the workflow directly. The moat for thousands of mid-market SaaS vendors, "customers cannot justify building this themselves," becomes a clearance sale. What gains value, in this picture, is everything an agent cannot produce on demand: proprietary data, distribution and customer relationships, regulatory position, physical assets, trust, and the permissions to operate in a regulated space.

The hyperscaler capex composition is the same migration in a different register. The line item that grew fastest in 2026 was not chips: it was power, and the physical infrastructure that delivers it. Compute is being commoditized as fast as the supply can be built; what remains scarce is the substrate underneath it.

## Organize around the models

The manifesto's ninth and tenth theses are the ones that hit operating businesses hardest. It is not enough to fit models into existing systems, org charts, and processes. The winners are those who organize *around* the models. A small team with strong judgment and many agents will outrun a large team trying to fit AI into processes designed before the transformation.

Inside the studio, our four operating disciplines apply to this directly. We call them the 4D framework, and they map cleanly onto the manifesto's organizational claim:

- **Delegation.** Decide what each agent owns, and what it does not. An agent forced to work like a human is a wasted agent: the manifesto says it; our engagements measure it.
- **Description.** Tell the agent how to use the tools at its disposal, in writing, the way you would brief a thoughtful new colleague. The model's behaviour is your specification.
- **Discernment.** Read the agent's decisions the way you would read a junior engineer's pull request: for taste, sequencing, tradeoff selection. The code is incidental.
- **Diligence.** Audit the surface area before granting access. Re-audit every eight weeks, because the capability curve will have moved.

For a GCC operator, none of this is abstract. A family business in Muscat that is running its hospitality group, its clinic, or its retail operation through a layer of agents in 2027, and many will, must have started reorganizing around the models in 2026. Bolting AI onto the old way of working is a category error, not an integration project.

## What to do with this

The manifesto closes with five short "Therefore" lines. The one that is hardest to argue with: we do not wait for the end state to become obvious, because the best move is to play the game. The end state is not yet apparent, but the curve is, and the curve is now enough to act on.

For an Omani business in May 2026, the concrete move is to seat a small team (three or four people, judgment-heavy, low on process) close enough to your real systems, real data, and real consequences to discover the new way of working inside your specific operation. That team's job is not to add AI to anything. Its job is to find the new shape of the work and pull the rest of the organization toward it. Its output, as the manifesto says, is not only software but also people and practices.

That kind of small autonomous team is exactly what a Discovery Phase is. If you would like one inside your business, [start a conversation](/practice/get-in-touch/).

## References

1. Anthropic. *Measuring agent autonomy.* 2026. [anthropic.com/research/measuring-agent-autonomy](https://www.anthropic.com/research/measuring-agent-autonomy)
2. The New Stack. *Vibe coding is passé. Karpathy has a new name for the future of software.* 2026. [thenewstack.io/vibe-coding-is-passe](https://thenewstack.io/vibe-coding-is-passe/)
3. Tom's Hardware. *Google, Microsoft, Meta, and Amazon capex spending to hit $725 billion in 2026, up 77% from last year.* 2026. [tomshardware.com](https://www.tomshardware.com/tech-industry/big-tech/big-techs-ai-spending-plans-reach-725-billion)
4. CNBC. *Tech AI spending approaches $700 billion in 2026, cash taking big hit.* 6 February 2026. [cnbc.com](https://www.cnbc.com/2026/02/06/google-microsoft-meta-amazon-ai-cash.html)
5. Orfloat. *The capability overhang is no longer theoretical.* 22 May 2026. [orfloat.com/notes/capability-overhang](/notes/capability-overhang/)

---
title: "AI is not a software business anymore"
registry: ORF-N-2026-003
type: commentary
status: published
date: 2026-05-24
summary: "Hyperscaler capex hit $700B in 2026. Your AI vendor agreement is a supply contract."
tags: ["infrastructure", "capex", "tokens", "gcc", "procurement"]
source: https://orfloat.com/notes/ai-is-not-a-software-business/
---

# AI is not a software business anymore

**claim.** AI is now a manufacturing business, so an AI vendor agreement is a supply contract in all but name, and operators should plan in tokens and allocation, not seats.

Microsoft will spend roughly **$190 billion** on capital expenditure in calendar 2026, and still expects to be capacity-constrained through year-end. The four biggest hyperscalers will spend close to **$700 billion** combined this year, roughly 3.5× what they spent two years ago. None of that looks like a classic software business. Six months from now, neither will your AI vendor contract.

The cloud was supposed to make infrastructure someone else's problem. Write code once, run it many times, scale on demand. The whole abstraction depended on supply staying ahead of demand by a comfortable margin. AI broke that abstraction, not gradually, and not theoretically. The Q3 FY26 results from Microsoft, published in late April, put the new shape of the industry on the public record.

## The numbers tell you what changed

Microsoft reported [$31.9 billion in capital expenditure](https://www.sec.gov/Archives/edgar/data/0000789019/000119312526191507/msft-20260331.htm) for the fiscal third quarter ending 31 March 2026, and guided to **more than $40 billion** for the following quarter. About two-thirds of the quarterly spend went to short-lived assets, primarily GPUs and CPUs that will be fully depreciated inside four years. For calendar 2026, the company expects total capex to land around [$190 billion](https://www.cnbc.com/2026/04/29/microsoft-msft-q3-earnings-report-2026.html), and the CFO was explicit that this will not be enough. Azure demand still exceeds supply.

Microsoft is not alone. Alphabet, Amazon, and Meta are running the same play. Combined hyperscaler 2026 capex is now [projected to approach $700 billion](https://fortune.com/2026/04/30/big-tech-hyperscalers-will-spend-700-billion-on-ai-infrastructure-this-year-with-no-clear-end-in-sight-eye-on-ai/), up from roughly $200 billion in 2024 and 6× the 2022 level. Meta raised its full-year guidance to $125–145 billion and explicitly blamed component and data-center costs. Amazon is now expected to post a [negative free cash flow year](https://www.cnbc.com/2026/02/06/google-microsoft-meta-amazon-ai-cash.html) for the first time in over a decade.

Mary Meeker, in her [340-page Bond Capital report](https://www.bondcap.com/reports/tai) published last May, used the word "unprecedented" on 51 separate pages to describe the rate of change. At NVIDIA's GTC 2026 keynote, Jensen Huang stopped calling Nvidia a chip company at all. He calls it an [AI factory company](https://www.datacenterfrontier.com/machine-learning/news/55364406/jensen-huang-maps-the-ai-factory-era-at-nvidia-gtc-2026) now. "Tokens are the new commodity," he said twice on the same slide. He was not being poetic.

## Tokens are not magic. They are manufactured.

The most important shift to understand is that every answer from a model is the output of a physical production system. GPUs, high-bandwidth memory, advanced packaging, substrates, optics, power, cooling, land, networking, and operations talent: that is the bill of materials behind the paragraph of text on your screen, the line of code Claude wrote you, or the agent that just finished summarising a contract. The user sees software. Behind the software is a factory turning electricity and silicon into intelligence.

> Six months ago an AI vendor agreement was structured like a software agreement. Today it is a supply contract in everything but name.

And that supply contract has terms most procurement teams have never written before: allocation, fallback, reserved capacity, multi-region failover. None of it was a line item a year ago. All of it should be in the next one you sign.

## What this changes for an operator in Muscat

At first glance, hyperscaler capex looks like a Silicon Valley story. It is not. The Gulf has been positioning itself as a sovereign AI-infrastructure region for a year and a half, and the placements are now public. Saudi Arabia's [HUMAIN](https://www.globaldatacenterhub.com/p/does-humains-12b-saudi-framework) secured a $1.2 billion AI-infrastructure framework in January. The UAE's G42, partnered with Oracle, NVIDIA, Cisco, and SoftBank under the [Stargate UAE](https://introl.com/blog/middle-east-ai-revolution-uae-saudi-arabia-100b-infrastructure-plans) banner, is building a 5 GW AI campus in Abu Dhabi with a 1 GW cluster going live in 2026. Oman's own *AI and Digital Future Programme* names data-centre capacity as a delivery mechanism for the digital economy.

For an operating business in Muscat, this means two things at once. The first is that *closer* inference is finally becoming a procurement option: your token round-trip will be in-region within the next eighteen months, with the residency and latency profile that regulated industries need. The second is that the same capacity rationing playing out in Redmond and Mountain View will play out in Riyadh and Abu Dhabi too, just on a slight lag. Whoever signs first sits at the front of the queue.

## The unit that actually matters is the token

Almost every AI plan we read inside client engagements is still priced in seats. Five Claude Pro seats. Ten Copilot seats. A team licence. Seats are the wrong unit for a factory output. The right unit is the token: input tokens, output tokens, cached tokens, and the routing decisions between them.

We forecast client workloads in tokens. We treat an agent that runs unattended for an hour as a different budget line than a chatbot that answers one question. We instrument every workflow to know, by name, which model was called, with what prompt, returning how many tokens, at what cost. None of that is exotic. It is just bookkeeping that nobody has bothered to install yet, because the abstraction of "seats" let everyone pretend the supply was infinite. It isn't.

## What we do about it

Inside a Discovery Phase, the supply-chain reality of AI shows up in three concrete deliverables.

- **A vendor audit.** We map the supply chain under your current AI contracts: which cloud, which region, which underlying chip family, what allocation terms exist (or don't), and what your fallback looks like when the next capacity squeeze hits.
- **A token-denominated forecast.** We size demand in tokens, not seats, separating chatbot work from agentic workloads and pricing each at the model class that actually fits.
- **A routing diagnostic.** Most operations burn premium-tier inference (Opus-class) on work that Haiku-class would do for a tenth of the cost. We find those, fix them, and the savings usually pay for the engagement before deployment.

Microsoft has already put $190 billion behind a view of the world. Most operating businesses in the Gulf have not put even the equivalent procurement framework into their plans. That is the gap. It will close, for the businesses that decide to close it. [Start a Discovery Phase.](/practice/get-in-touch/)

## References

1. Microsoft Corp. *Form 10-Q, Fiscal Q3 2026 (quarter ended 31 March 2026).* [sec.gov/Archives/edgar/data/0000789019/msft-20260331](https://www.sec.gov/Archives/edgar/data/0000789019/000119312526191507/msft-20260331.htm)
2. CNBC. *Microsoft calls for $190 billion in 2026 capital spending on soaring memory prices.* 29 April 2026. [cnbc.com](https://www.cnbc.com/2026/04/29/microsoft-msft-q3-earnings-report-2026.html)
3. Fortune. *Big Tech is about to spend $700 billion on AI this year.* 30 April 2026. [fortune.com](https://fortune.com/2026/04/30/big-tech-hyperscalers-will-spend-700-billion-on-ai-infrastructure-this-year-with-no-clear-end-in-sight-eye-on-ai/)
4. Bond Capital and Mary Meeker. *Trends, Artificial Intelligence.* May 2025. [bondcap.com/reports/tai](https://www.bondcap.com/reports/tai)
5. Data Center Frontier. *Jensen Huang Maps the AI Factory Era at NVIDIA GTC 2026.* March 2026. [datacenterfrontier.com](https://www.datacenterfrontier.com/machine-learning/news/55364406/jensen-huang-maps-the-ai-factory-era-at-nvidia-gtc-2026)
6. Global Data Center Hub. *Does HUMAIN's $1.2B Saudi Framework Signal a New Model for AI Data Centers?* January 2026. [globaldatacenterhub.com](https://www.globaldatacenterhub.com/p/does-humains-12b-saudi-framework)
7. Introl. *Middle East AI Revolution: UAE and Saudi Arabia's $100B+ Infrastructure Plans.* [introl.com](https://introl.com/blog/middle-east-ai-revolution-uae-saudi-arabia-100b-infrastructure-plans)

---
title: "Anthropic's enterprise playbook, read from Muscat"
registry: ORF-N-2026-004
type: commentary
status: published
date: 2026-05-24
summary: "A 35-page enterprise guide from Anthropic mapped against what a Muscat operator actually needs."
tags: ["anthropic", "enterprise", "playbook", "gcc", "forward-deployed"]
source: https://orfloat.com/notes/anthropic-enterprise-playbook/
---

# Anthropic's enterprise playbook, read from Muscat

**claim.** Anthropic's enterprise playbook is a sound spine, but a GCC family business needs the translation and measurement layers the guide assumes you already have, and that is the forward-deployed work.

Anthropic published a 35-page enterprise guide called *Building trusted AI in the enterprise*. We read it end-to-end this week. The short version: it describes a four-stage spine that almost exactly matches the engagement model we already sell, and one we now have explicit language for. The longer version is what the playbook leaves out: what a Muscat operator needs to add to make it land.

You can read the original guide in full on [anthropic.com](https://www.anthropic.com/news). We won't reproduce it here. What follows is our commentary as forward-deployed engineers operating in the Gulf: what we adopt from it as written, where it has to be translated to land in a family-led GCC business, and where Orfloat's own thinking goes further than the guide.

## The four-stage spine

The guide organises an enterprise AI programme into four sequential stages: develop a strategy, create business value through a pilot, build for production, and then deploy with LLMOps. Anthropic notes that companies with the right motivation can compress what they otherwise frame as a 13-month rollout into a few months, citing FeatherSnap integrating Claude on Amazon Bedrock in under 90 days and DoorDash building a voice contact-centre solution in two months. We are familiar with that timeline. Our typical engagement runs about 16 weeks from on-site Discovery to first production system.

The four-stage model maps cleanly onto our own:

- **Stage 1 (Strategy) ↔ Orfloat Discovery.** Their guide says to start with people, process, and technology. Our 15-day on-site Discovery does exactly that: three of the deliverables (governance map, opportunity priority list, technical-readiness audit) are Anthropic's three dimensions, named differently.
- **Stage 2 (Business value) ↔ Service Agreement.** Anthropic's seven-criterion pilot test (LLM-suited work, measurable metrics, clear ROI, business-critical but low security risk, abundant data, minimal disruption, scalable) is the same checklist we apply when we draft the milestone-based scope after Discovery.
- **Stage 3 (Production) ↔ Forward deployment, weeks 1–8.** The prompt-engineering structure they spell out (task and role, background, rules, history, request, format, prefill) is our default scaffold. Their emphasis on evaluation-before-deployment is non-negotiable for us.
- **Stage 4 (LLMOps) ↔ Forward deployment, weeks 9–12 plus 90-day handover.** Their five LLMOps practices (monitoring, prompt version control, security by design, scalable infrastructure, continuous QA) are the operating discipline we install before we leave.

## People, Process, Technology, translated for a Muscat family business

The guide's three-dimensional model is correct, and it is also written for a different kind of company than the ones we work with. Three places where translation is necessary:

> The guide assumes you have an executive sponsor, a steering committee, and a head of AI. Most Omani family businesses have a managing director, his cousin who runs operations, and an IT vendor.

**People.** Anthropic prescribes "executive alignment and sponsorship" and an "AI review board." In the GCC family-business context, that is the founder and the COO sitting in the same Discovery workshop, and the AI review board is the same group that already meets weekly to talk supplier prices. We don't create new committees. We meet the existing ones where they already convene, and we hand them better questions.

**Process.** Anthropic's pilot-graduation criteria (performance thresholds, operational readiness, risk management infrastructure) are the right gates. The guide implicitly assumes you have a product analytics practice in place to measure those gates. Most of our clients don't. Part of our forward-deployed work is instrumenting the operation enough that the criteria can be measured at all. The pilot doesn't fail; the measurement framework fails first.

**Technology.** The guide describes a clean three-level technical maturity: basic chat, intermediate with RAG and tools, advanced agents. In a Muscat operation, you may need to be at all three levels simultaneously: a basic chatbot for guests, a RAG-equipped concierge for returning customers, and an autonomous nightly reconciliation agent for the back office. The progression in the guide is accurate at the enterprise level; at the individual-business level, you pick the right level for the right workflow.

## The 12-month clock, in practice

Anthropic's four-phase rollout (Foundation in months 1–3, Pilot in 4–6, Strategic Scaling in 7–12, Broad Adoption in 13+) is the right cadence for a large enterprise. For a GCC family business with 80–400 staff, we compress it considerably: Foundation in weeks 1–3, Pilot in weeks 4–10, first production system live by week 12, second production system by week 16. The four-phase logic is preserved. Only the calendar shrinks. Anthropic's own note that "motivation and partnership" can condense the timeline to weeks is exactly the lever we pull.

## The LLMOps gap is the engagement

Anthropic cites a BCG survey of 1,400 C-suite executives in which 62% identified shortage of talent and skills as the biggest obstacle to their AI strategy. Inside an Omani operator's building, that number is closer to 100%, not because the talent doesn't exist, but because no single hire fills the shape of the role. The shape is one part LLM engineer, one part operations specialist, one part data architect, one part compliance reader. Hiring that profile in Muscat in 2026 is not a recruitment problem; it is a scarcity problem. Forward deployment exists to fill the shape without making the client own it.

## What we adopt, and what we add

We adopt the playbook's spine (the four stages, the three dimensions, the seven pilot criteria, the five LLMOps practices) directly. We treat it as the published standard for serious enterprise AI work. Where we add: a translation layer for the GCC family business, a measurement-instrumentation layer that the playbook assumes you already have, a Muscat-aware governance cadence, and a forward-deployed delivery model that compresses the calendar.

If you are running an operating business in the Gulf and the four-stage model above sounds like a useful map, the right next step is to find out what your Stage 1 looks like honestly. [Start a Discovery Phase.](/practice/get-in-touch/)

## References

1. Anthropic. *Building trusted AI in the enterprise: Anthropic's guide to starting, scaling, and succeeding based on real-world examples and best practices.* See [anthropic.com/news](https://www.anthropic.com/news) for the latest enterprise resources. Trademark and attribution acknowledged on our [trademarks notice](/trademarks/).
2. Bain and Company. *Technology Report 2024.* [bain.com](https://www.bain.com/insights/topics/technology-report/)
3. McKinsey and Company. *The state of AI in 2024.* [mckinsey.com](https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai)
4. BCG. *Five Must-Haves for Effective AI Upskilling.* [bcg.com](https://www.bcg.com/publications/2024/five-must-haves-for-effective-ai-upskilling)
5. Anthropic. *Prompt engineering documentation.* [docs.claude.com](https://docs.claude.com)

---
title: "The capability overhang is no longer theoretical"
registry: ORF-N-2026-002
type: dispatch
status: published
date: 2026-05-22
summary: "Claude Opus 4.7 shipped six weeks ago and the frontier keeps moving, but almost none of that movement has reached operating businesses in the Gulf."
tags: ["frontier", "mena", "forward-deployed", "oman"]
source: https://orfloat.com/notes/capability-overhang/
---

# The capability overhang is no longer theoretical

**claim.** The capability overhang is now a planning question, not a forecast, and tooling does not close it; embedded forward-deployed engineering does.

Six weeks ago Anthropic shipped Claude Opus 4.7. Two months before that, they published the largest multilingual study of AI users ever attempted. The frontier is moving faster than most boards have agendas for, and almost none of that movement has reached operating businesses in the Gulf.

The gap between what a 2026 frontier model can do and what most businesses are doing with AI is now wide enough that it is no longer a forecasting question. It is a planning question.

## The frontier moved. Did you?

In April, Anthropic released [Claude Opus 4.7](https://www.anthropic.com/news/claude-opus-4-7), an upgrade that, in Anthropic's own framing, raised the bar on coding, agents, vision, and multi-step reasoning. The same week, Anthropic Labs launched [Claude Design](https://www.anthropic.com/news/claude-design-anthropic-labs), a tool that does in fifteen minutes what an agency would charge a small business in Muscat OMR 800 for last year. Neither of these announcements is news to the AI-curious. Neither has meaningfully changed how a hospitality group in Al Mouj or a clinic in Qurum operates this Wednesday.

In March, Anthropic published [*What 81,000 people want from AI*](https://www.anthropic.com/81k-interviews), the largest multilingual qualitative study of AI users ever run. The most striking finding, for our purposes, is not what people are using AI for. It is the gulf between what power-users get out of Claude and what casual users do. The same tool, in the same week, delivered different orders of magnitude of value to different people. That gap is the capability overhang.

## The MENA picture, briefly

PwC, in their since-foundational [*The potential impact of AI in the Middle East*](https://www.pwc.com/m1/en/publications/potential-impact-artificial-intelligence-middle-east.html) report, projected that AI would contribute roughly **USD 320 billion to the Middle East economy by 2030**, with the UAE and Saudi Arabia capturing the lion's share and Oman, Bahrain, and Kuwait splitting a long tail. The Oman government's own [Vision 2040](https://www.oman2040.om) names a knowledge economy and digital transformation as a first-class priority. The framing is there. The execution cadence inside privately-held businesses is not.

What we see, walking into a small or mid-size enterprise in Muscat today, is not absence of intent. It is absence of a method that survives contact with a normal Tuesday. Owners read about Claude. Operations managers try ChatGPT. The integration with the actual POS, the actual supplier WhatsApp thread, the actual reservations system, never happens. There is no one inside the business whose job it is to make it happen, and the agencies they hire to build websites do not do this work.

## Why this is not a tooling problem

You can read the entire Anthropic blog this weekend. You can run Claude Code locally. You can buy Claude Pro on personal credit cards across five staff. None of those moves close the overhang. Because the overhang is not knowledge. It is embedding.

> The gap between what AI *can already do* and what most teams are actually using it for is now wider than the gap between proprietary models and open-source ones, and changes faster.

Forward-deployed engineering, as the practice has been described inside Anthropic and [Stripe](https://stripe.com/blog/forward-deployed-engineers) and [Palantir](https://www.palantir.com/offerings/foundry/) before them, is the answer to this. Not consulting. Not training decks. The work done inside the business, learned end to end, and shipped.

## What we are doing about it

Orfloat is one studio, in one city, taking on a small number of engagements at a time. We are not large enough to close the regional overhang. We are large enough to close it for a few businesses we choose carefully. If you have read this far and the gap we are describing sounds familiar, reach out to us and start a Discovery Phase to look at your operation.

## References

1. Anthropic. *Introducing Claude Opus 4.7.* 16 April 2026. [anthropic.com/news/claude-opus-4-7](https://www.anthropic.com/news/claude-opus-4-7)
2. Anthropic. *Introducing Claude Design by Anthropic Labs.* 17 April 2026. [anthropic.com/news/claude-design-anthropic-labs](https://www.anthropic.com/news/claude-design-anthropic-labs)
3. Anthropic. *What 81,000 people want from AI.* 18 March 2026. [anthropic.com/81k-interviews](https://www.anthropic.com/81k-interviews)
4. PwC Middle East. *The potential impact of Artificial Intelligence in the Middle East.* [pwc.com/m1](https://www.pwc.com/m1/en/publications/potential-impact-artificial-intelligence-middle-east.html)
5. Sultanate of Oman. *Oman Vision 2040.* [oman2040.om](https://www.oman2040.om)

---
title: "Model Context Protocol, plainly explained"
registry: ORF-N-2026-001
type: commentary
status: published
date: 2026-05-19
summary: "MCP is the boring-sounding standard that makes the rest of Anthropic's stack non-boring."
tags: ["mcp", "integration", "claude", "forward-deployed"]
source: https://orfloat.com/notes/mcp-plainly-explained/
---

# Model Context Protocol, plainly explained

**claim.** MCP is the standard that stops AI integration cost from compounding, turning what used to be a brittle custom build into a configuration exercise.

MCP is the boring-sounding standard that makes the rest of Anthropic's stack non-boring. If your business runs on a POS, an ERP, an inventory spreadsheet, and a calendar, this is the layer that finally lets Claude reason over all of them at once, without each integration becoming a special case.

## The problem MCP solves

A large language model on its own is a closed system. It knows what it knows from training, and it knows what you tell it in the current chat. The moment you ask it to do something useful for your business, "is this guest already a regular?", "reorder the saffron when stock drops below 200g", "what is on the kitchen prep list for tomorrow lunch?", you discover the gap. The model has no idea what is in your systems.

For most of 2023 and 2024, the answer was: write a custom integration. Plug Claude into your POS via the POS's API. Plug it into your accounting via QuickBooks' SDK. Plug it into your reservation system via a third-party connector. Twenty integrations later, you have a brittle pile of glue and an engineer whose entire job is to keep it from falling over.

## What MCP actually is

[Model Context Protocol](https://modelcontextprotocol.io) is an open standard, originally proposed by Anthropic in November 2024 and now adopted across the major frontier-model vendors. It defines a small, opinionated way for any tool (a database, a SaaS app, an internal script, a CSV) to expose itself to a language model. The model speaks one language (MCP). Each tool speaks one language (MCP). The integration cost stops compounding.

> MCP is to AI agents what USB-C is to laptops. It is not glamorous. It is the reason you stopped travelling with seven different cables.

Concretely, an MCP server exposes three things to a model: **resources** (read-only data the model can consult: your menu, your customer list), **tools** (functions the model can call: create a reservation, send a WhatsApp message, write to inventory), and **prompts** (reusable instructions that scope what the model should do with a given resource or tool). A model with MCP access reads the resources, picks the right tool, calls it with the right arguments, and reads the response, all while keeping the business's data inside the business's perimeter.

## Why it matters for an operating business

The day-to-day reality of running a hospitality group, a clinic, or a retail operation in Muscat is that twelve tools hold pieces of the same truth. The same guest is a row in your POS, a row in your CRM, a thread in your concierge WhatsApp, and a note in someone's phone. Before MCP, building an AI concierge that understood all four of those things was a months-long custom build that broke whenever any of the four tools changed. After MCP, it is a configuration exercise.

And the standard is getting more useful, not less. In May, Anthropic shipped [self-hosted sandboxes and MCP tunnels](https://claude.com/blog/claude-managed-agents-updates) for Claude Managed Agents, meaning a persistent agent can now reach back into a business's private network without exposing it to the public internet. That single update closes the biggest remaining objection to running AI agents over regulated data.

## The 4D framework, mapped to MCP

Our four operating disciplines apply to MCP work directly:

- **Delegation.** Decide which questions Claude should answer using which MCP server. Not every read is worth a tool call.
- **Description.** The MCP *prompt* primitive is exactly this: telling the model how to use a tool, in writing, the way you would brief a thoughtful new colleague.
- **Discernment.** Read the agent's tool calls the way you'd read a junior engineer's pull request. Eval before launch. Spot drift early.
- **Diligence.** Audit the MCP server's surface area before granting access. The smallest tool set that works is the right one.

## What to do with this

If you read one official MCP resource, make it the [MCP introduction](https://modelcontextprotocol.io/introduction). If you want a worked example, the [reference servers](https://github.com/modelcontextprotocol/servers) repo is the canonical place. And if you would rather skip the standard and have us build the right one for your operation, that is exactly what a Discovery Phase decides: [start one](/practice/get-in-touch/).

## References

1. Anthropic. *Introducing the Model Context Protocol.* 25 November 2024. [anthropic.com/news/model-context-protocol](https://www.anthropic.com/news/model-context-protocol)
2. Model Context Protocol. *Specification and documentation.* [modelcontextprotocol.io](https://modelcontextprotocol.io)
3. Anthropic. *New in Claude Managed Agents: self-hosted sandboxes and MCP tunnels.* 19 May 2026. [claude.com/blog/claude-managed-agents-updates](https://claude.com/blog/claude-managed-agents-updates)
4. Model Context Protocol. *Reference servers.* [github.com/modelcontextprotocol/servers](https://github.com/modelcontextprotocol/servers)

---
title: "what we do"
registry: ORF-P-2026-001
status: published
date: 2026-06-04
summary: "The offer, stated plainly: what the lab takes on for the few it works with."
tags: ["practice", "offer"]
source: https://orfloat.com/practice/what-we-do/
---

# what we do

The offer, stated plainly: the lab does one thing, and it does it for a small number of founder-run and family-led businesses across Oman and the wider GCC. The practice is forward-deployed applied AI engineering. We embed in the operation, learn how the work actually runs, and turn what Claude can already do into systems the team trusts in production. Everything is built on Claude and the Anthropic primitives around it.

The pieces below are composable, not a menu to shop. A single operation rarely needs all of them, and no two need the same parts. We map what is in front of us and assemble from the work that earns its place against the operation it touches. The shape is consistent: understand the work first, build into it where the leverage is real, then stay until the system holds without us.

## Operational discovery

It begins on-site. We shadow the work as it actually happens, sit with the people who do it, and map the supply chain from the inside. The point is to see the operation as it runs, not as an org chart describes it, so that what comes next answers a real constraint rather than a guessed one.

## Customer-facing systems

Where the operation meets its customers, we build conversational agents that hold the brand voice rather than flatten it: first-turn intake, recall for returning customers, and review solicitation with quality gating, so the bad experiences are caught before they go out, not after.

## Internal operations

Inside the business, the quieter half of most engagements, and often the more valuable one:

- inventory and supplier reorder triggers
- scheduling and shift optimisation
- operational dashboards and daily briefings
- demand forecasting
- POS, booking, and accounting integrations, MCP-first

These connect to the systems already in use, MCP-first, so the integration is a contract rather than a brittle scrape.

## Brand and positioning

Adopting AI changes how a business represents itself, so we treat that as part of the work: an internal AI usage policy, customer-facing transparency about where AI is and is not in the loop, and a brand voice that holds across every AI touchpoint.

## Forward deployment

The practice is forward-deployed, which means we embed and integrate rather than hand off. Evals are designed before deployment, not retrofitted to excuse it. We stay on call through the first ninety days, write the runbooks the team owns and train them on the systems we leave behind, and return for quarterly capability reviews as Claude evolves and what was out of reach last quarter comes into it.

The throughline is narrow on purpose. One practice, done deeply, carried into a system the team still trusts when it matters.

---
title: "who we work with"
registry: ORF-P-2026-002
status: published
date: 2026-06-04
summary: "Who we take on, and the bar a fit has to clear."
tags: ["practice", "bar"]
source: https://orfloat.com/practice/who-we-work-with/
---

# who we work with

A fit is a team that has already started taking Claude seriously. Not a team curious whether the technology is real, but one that has run its own experiments, seen something work, and now wants it load-bearing inside the business. The distance between dabbling and deciding is most of the bar, and it is usually quick to tell which side of it a team is on.

The rest follows from how we work. We embed and integrate AI inside the business rather than advise it from a distance, so a fit is a team willing to open the actual work: the messy operations, the real data, the decisions that matter. There is no version of this that happens at arm's length. And because we hand the system back, a fit is also a team ready to run it once we step away. We build something the business can own, not a dependency on us.

## Who we take on

The lab works with founder-run and family-led businesses across Oman and the wider GCC. We are sector-agnostic. The work has lived in hospitality, retail, and operations, and the underlying discipline travels further than any one of them, because what we bring is a way of putting Claude to work, not a vertical product. What the sector is matters less than how the business is run: a place where a decision can be made and acted on, and where the people who operate the work are close enough to it to change it.

We take on a small number at a time. This is a property of the model, not a marketing scarcity. Embedding AI deeply inside a business does not scale by adding logos, and our bandwidth is genuinely finite. We would rather run one engagement we are proud of than several we are not.

So a fit clears three things:

- The team is evaluating Claude seriously, with real intent to put it into production, not a pilot to satisfy curiosity.
- The business is willing to let us inside the actual work, where the real constraints and the real data live, rather than holding the engagement at arm's length.
- The team is ready to operate the system itself after the handover. If no one inside is prepared to own it, the work does not hold.

Client work is held in confidence, at the client's request, which is also why this page names a bar rather than a roster. When we point to what we have done, we point to the technique and the result, never to a name a client would prefer we kept quiet.

---
title: "how we work"
registry: ORF-P-2026-003
status: published
date: 2026-06-04
summary: "How an engagement runs: AI embedded in your operation, not advice from a distance."
tags: ["practice", "engagement"]
source: https://orfloat.com/practice/how-we-work/
---

# how we work

We embed, we do not advise from a distance. We work inside the operation, integrate AI into the actual workflow, and ship systems the team trusts in production. Two frames shape every engagement: a method we keep throughout, and a path the work travels from the first day on-site to the first quarter in production.

## The 4D framework

The 4D practice is drawn from Anthropic's published applied-AI work, and we hold to it on every engagement. Four disciplines:

- **Delegation.** We move only the right work to the model: the work where it is faster, calmer, or more consistent than the person doing it today. The rest stays with people.
- **Description.** We spell that work out as if briefing a thoughtful new colleague, with the context, the constraints, and the shape of a good answer. Tools and skills are the vocabulary we write it in.
- **Discernment.** We read the output the way a senior reads a junior. We build the eval before the agent, so drift is something we notice early rather than discover late.
- **Diligence.** We stay in the loop, treating safety, privacy, and reversibility as primary constraints. We operate the system, we do not just deploy it.

## The engagement

The engagement itself is discovery-first, and it runs in three phases:

- Discovery. Roughly fifteen calendar days on-site, shadowing the work end to end. We leave with an audit, an opportunity map, and a calibrated roadmap.
- Service agreement. Scoped and milestone-based, built from the discovery evidence rather than a generic statement of work.
- Forward deployment. We stay embedded while the systems ship, building, evaluating, training the team to operate what we leave behind, and staying close through the first quarter of production.

The order matters. We earn the roadmap on-site before anyone signs to it, and we stay until the system runs in the team's hands, not just ours. Under both frames sit six Anthropic primitives we compose into one practice: the API, Claude Code and the SDKs, the Model Context Protocol, Skills, plugins and tools, and Claude-managed agents. Most teams reach for one. We calibrate all six to the operation in front of us.

---
title: "why us"
registry: ORF-P-2026-004
status: published
date: 2026-06-04
summary: "The work we have already opened, and the certificates behind it."
tags: ["practice", "credentials"]
source: https://orfloat.com/practice/why-us/
---

# why us

We would rather point than claim. The proof of how we work is already public, so before anything else, two things we have opened to anyone.

The first is [cc-dm](/research/cc-dm/), a Claude Code plugin we open-sourced and Anthropic reviewed. The second is a [case study on running the lab's own workflow](/research/dogfooding-the-workflow/) on this very website, the same pipeline we would run for a client, turned on ourselves and written up. Read either one. They show the method without us narrating it: how we build, how we hold a line, what we ship when the only audience is other engineers.

The certificates rendered below corroborate that work. They are not the argument for it.

We operate on the curriculum Anthropic itself teaches, and we hold eight Anthropic Academy certificates to that end: the Anthropic API, Claude Code 101 and the advanced Claude Code in Action masterclass, the Model Context Protocol at both the introductory and advanced levels, agent Skills, subagents, and the AI Fluency for Small Businesses curriculum Anthropic co-developed with PayPal. They map the surface area we build against, from the raw API up through the primitives that make agents useful in production.

But a certificate is a receipt, not the work. The work is reading what Anthropic ships the day it ships, and building against each new primitive in a sandbox before we recommend it to anyone we serve. The certificates below say we studied the material. The repositories above show what we did with it.

---
title: "get in touch"
registry: ORF-P-2026-005
status: published
date: 2026-06-04
summary: "The studio behind the work, and the way to start a conversation."
tags: ["practice", "application"]
source: https://orfloat.com/practice/get-in-touch/
---

# get in touch

Every engagement begins the same way: a conversation, then an on-site Discovery. Tell us about your business and what you are trying to build, and one of the founders replies within two working days.

Orfloat is two brothers working out of Muscat. One ships the software, the other runs the business that gives it a home. The studio stays small on purpose, so the person who reads your note is the person you would work with.