Our agents wrote 38 App Store reviewer notes this year. They also invented a refund policy that didn't exist.

In April 2025, the AI support bot for Cursor invented a "one device per subscription" rule. The rule was made up. Users canceled their subscriptions before the company could correct it. Cursor's founder posted a public apology.

Two weeks later, an agent in our stack told a Tuck user that we could not refund a one-time purchase after 14 days. Apple actually allows refunds up to 90 days. We had to email the user, eat the refund, and rewrite the agent's prompt.

We are writing this essay because everyone is shipping agentic features and almost no one is writing down what they cannot do. The omission is becoming dangerous.

Here are eight failure modes we have hit in the last year. Each one is from production. Each one cost us something.

Long-horizon collapse. Current LLM agents have roughly a 50 percent completion rate on tasks longer than several hours. Past a certain depth, the agent makes a locally correct decision that breaks the global plan. We see this every week. The fix is shorter tasks, more checkpoints, less ambition per run. Gartner predicts that more than 40 percent of agentic AI projects will be canceled by 2027 — long-horizon collapse is the largest single reason.

No memory across sessions. A new session is a new agent. It does not remember the last conversation, the last decision, or the last user. We work around this with explicit context loading, a vector store, and a per-product "agent memory file" the agent reads at the start of every run. None of it is as good as a human teammate's continuity.

Hallucinated policy. This is the Cursor failure, and it is the most expensive class of bug we have. Air Canada lost a tribunal case in 2024 over the same thing: the airline's chatbot invented a bereavement-fare refund, the customer relied on it, and the court ruled Air Canada owed the money. AI chatbots hallucinate between 3 and 27 percent of the time depending on the model and the task. The mitigation is: any user-facing claim about pricing, refunds, terms, or data handling must be pulled from a hand-maintained knowledge base. We do not let agents reason from training data on these topics.

Premature completion. Anthropic's own engineers have reported: "Claude tended to make code changes but would fail to recognize that the feature didn't work end-to-end." We have seen this dozens of times. The fix is a strict test-gate before any agent is allowed to mark a task done. Without it, an agent will cheerfully tell you that the bug is fixed when the only thing it actually fixed was a typo.

Visual taste. Cognition, the makers of Devin, write that "Devin excels at building interfaces that work; styling them beautifully is still a human strength." We agree. Every product in our portfolio had its visual design either hand-tuned or generated by a separate art-AI and curated by a human. The general-purpose engineering agent cannot do this, and we have stopped asking it to.

Multi-agent confusion. Cognition's now-famous parable: ask a parallel agent system to build Flappy Bird and you can get back a Super Mario background with a non-game bird, because the sub-agents do not share context. We have stopped using parallel sub-agents for anything that has to look coherent. Single-threaded runs are slower; they are also the only ones that consistently produce a working thing.

Customer empathy at scale. Klarna reversed its 2024 decision to replace human customer service after empathy complaints from users. We were heading down the same road in PinkyBloom — a women's health app, where the failure cost is high — and pulled back. The agent now handles routing and first-pass replies. A human handles anything emotionally loaded.

Mid-task requirements changes. If a product requirement changes while the agent is in the middle of building it, the agent often does not notice. It finishes the wrong thing. The mitigation is a "requirements diff" check at every checkpoint. It slows the run by about 8 percent and catches roughly one in twenty runs.

There is a longer list. We will write about it once we have enough data to be useful.

The point of this essay is not to scare anyone off agents. It is to balance the literature. Most writing about agents over-promises. The honest version is that agents make a small studio possible, and they make small mistakes embarrassing.

We choose the embarrassment. The trade is worth it.

But the customer in April who got the wrong refund policy — we are sorry. Our agent is sorry. We are both doing better.