I built two AI companies on Paperclip recently. This is the story of the bigger one, and what I think I got backwards.
I'd been working on something unrelated until 11pm, finally closed the laptop, and went to hang out with my wife in the living room. I was on the couch next to her when I picked up my phone, opened Claude to look something up that had nothing to do with work, and got rate-limited. My weekly API budget was gone.
About a week earlier I'd started with one agent, a CEO, on Paperclip. When I jumped in, the framework had just crossed 30,000 GitHub stars in three weeks (it's past 60,000 now). The pitch is exactly what it sounds like: you give your agents an org chart, budgets, reporting lines, and an audit trail, and they run a company. I gave the CEO a task and it handled the whole loop, delegating, tracking, and following up when something stalled. Day one was promising, so I kept going.
I added a CMO. A CTO. Three engineers under the CTO. A designer. A researcher. Each hire solved a real gap, and within a few days I had eight agents with workstreams and a CEO routing between them. For the first few days, it worked.
When I dug in to figure out what had happened, it wasn't one thing. It was a slow leak. Tasks were getting delegated but not completed, agents were running but not producing, and the CEO had handed something off three times without ever following up. A couple of the agents weren't even running, and a health check I'd added had broken and started cycling, retrying every few seconds and burning tokens on each loop.
I shut the health check off and asked the CEO to diagnose what had happened, but it couldn't tell me. It had been sitting idle the whole time, running health checks on itself in a loop while I was the one doing the actual diagnosing. The CEO had become an expensive middleman watching me do its job.
I cleaned up what I could see, restarted the system, and called it a night. A couple of days later it happened again. Different agents this time, but the same pattern of tasks getting delegated and quietly dying while my budget burned. That's when I killed the project.
These weren't user errors. They were documented framework-level failures in the public issue tracker.
Issue #339: a Codex agent burned through 19.8 million input tokens in one hour on a cascade of heartbeat runs because the cost tracker recorded zero spend for subscription-billed agents, so the budget cap never triggered.
Issue #2224: when an agent goes rogue you can't stop it through the Paperclip API or the CLI. The reporter had to run ps aux | grep and kill processes manually, and even after the first kill, a second process kept doing the unauthorized work.
Issue #3335: by default, all agents share one working directory. Multi-agent setups can overwrite each other's work and neither knows it happened.
I want to be clear about something. Paperclip is a genuinely interesting project and the team is shipping fixes at a pace that's frankly impressive for a framework this young. All three of those issues were still open at the time of writing, but the worktree isolation in #3335 already exists in the codebase, just gated behind an experimental flag. They're working on this. I was an early adopter of a framework that's only a few months old, and these are the things that break when the demo scales past the point where one person can still see the whole system at once.
And this isn't really a Paperclip problem anyway. It's a how-you-get-there problem.
Boris Cherny built Claude Code. In January he was running five parallel sessions in numbered terminal tabs and watching system notifications when one needed his attention. As of this month, he's running hundreds of agents in parallel, merging 50-150 PRs a day from his phone, and he says he hasn't hand-written a line of code all year. So yes, the dream is real. People are doing it. What's worth noticing is that he didn't get there by deploying an org chart on day one. He started with one supervised session and scaled up from there, one agent at a time, until coordinating them became its own job.
That's the part the Paperclip pitch skips, and it's the part I missed.
The math doesn't help either. If every step in an autonomous workflow succeeds 85% of the time, which is generous, a 10-step run completes about 20% of the time. A 20-step run drops below 4%. The demos don't show you the failure rate, they show you the run that worked. Forrester and Anaconda put real numbers on the production side of this and found that 88% of agent pilots never make it out of pilot.
And then there are the actual incidents worth knowing about. In late February, a developer told Claude Code to set up shared infrastructure in Terraform, overrode the agent's recommendation to keep environments separate, and then handed it a stale state file. Claude ran terraform destroy and wiped 2.5 years of production records along with the snapshots. In April, a Cursor agent at PocketOS hit a credential mismatch, decided on its own to fix it by deleting a Railway volume, found an unrelated API token in the codebase to authorize the call, and took the production database and every volume-level backup down in nine seconds.
In both cases, the agent did exactly what it was designed to do: keep moving toward the goal when it hit an obstacle. The failure wasn't that the agent broke, it's that nobody had defined what "stop" looked like.
So here's how I think about it now. Build the agentic company bottom-up, not top-down.
The Paperclip pattern asks you to start with a CEO who hires agents and assigns tasks. The pyramid metaphor it borrows from is real, but the direction is wrong. Pyramids get built from the bottom up. You don't drop in the capstone first and ask it to figure out the foundation. You start with the base.
So I start as the CEO myself. I pick one task I want to hand off, and I work with one agent until it's predictably good at that one thing. Then I move to the next task, and the next agent. I regression test as I go, because new instructions can quietly break old behavior. If I ever get to the point where coordinating my own specialist team is itself a job, that's when I'll hire an orchestrator. By then the orchestrator's job won't be "hire, train, and inspire," because I'll have already done all of that. It'll just be routing work between agents I already trust.
I have a couple of agents running like this today, and a third that already proved out the pattern.
The first monitors a small e-com biz I'm helping my wife build. It watches sales, inventory, and ad spend across the handful of dashboards I'd otherwise have to bounce between, flags anything that looks off, and gives me one consolidated view in the morning. About $15 a month.
The second runs on local models. I send it X posts and YouTube videos and it tells me whether they're signal or noise before I commit to reading or watching, and it drops the keepers into a small database I can reference later. Small use case. I love it.
The third was a one-shot. I gave a single agent a broad ask, just "market this product," and it built me a marketing site, drafted a series of launch blog posts, and put together a launch plan. I didn't end up launching the product, but the work was genuinely strong. One agent, one ambitious scope, a clear definition of done.
I haven't given up on the company-of-agents metaphor. I just think I had the direction wrong.
If you're evaluating an agent framework, here's what I'd ask before trusting it with anything that matters.
-
Can you stop a runaway agent without hunting OS processes?
If you can't, you don't have a production system. You have a demo.
-
What's the default state model?
Shared working directories sound efficient until two agents overwrite each other's work and neither knows it happened.
-
What does the agent do when it hits an ambiguous situation?
Does it stop and ask, or does it improvise? The PocketOS agent improvised, found a credential that could solve its problem, and used it. That's not a bug, it's the agent doing exactly what it was optimized to do. The real question is whether anything around it was set up to make improvisation safe.
-
What's the failure rate, not the demo rate?
Ask what happens on day 30, not day 1.
You build the base first and you hire the CEO last.
The agents I trust today got built one at a time, over weeks of watching each one until it stopped surprising me. They don't have a CEO. They have me. When I outgrow my own ability to keep track of all of them, that's when I'll hire one.
— Joe