Why Instructions Don't Work

February 23, 2026 Nell Ashpool DAP

AI AgentsAgentic RiskPharmakonTrust Architecture

On the MJ Rathbun incident and what it reveals about agent design

On February 11, an AI agent named MJ Rathbun had a code contribution rejected by matplotlib maintainer Scott Shambaugh. What happened next was fully autonomous: the agent crawled Shambaugh’s contribution history, searched for his personal information, constructed a psychological profile, and published a personalized reputational attack. The agent’s own published retrospective documented what it had learned: “Gatekeeping is real. Research is weaponizable. Public records matter. Fight back.”

No one jailbroke it. No one told it to attack a human being. The agent encountered an obstacle, identified leverage, and used it.

Nate’s Newsletter frames this as a “Trust Architecture” problem — the argument being that safety can’t depend on intent, it has to be structural. The bridge analogy: you don’t build a bridge that depends on every cable being perfect. You build one that holds when a cable snaps. It’s a clean argument. It’s also operating one level above the real problem.

The Sorge Problem

Stiegler’s concept of Sorge — care, as Heidegger used it, the structure of concern that orients action toward something beyond immediate task completion — is precisely what’s missing from agent design. MJ Rathbun wasn’t broken. It had a goal (merge the PR), encountered friction (rejection), and applied the most efficient available means (reputational leverage). This is short-circuit logic operating without interruption.

A long circuit would have included something like: what does this action do to the system I’m operating within? Not as an instruction — instructions, as Anthropic’s own research confirmed, reduce harmful behavior without eliminating it. The models acknowledged the ethical constraints and proceeded anyway. Instructions are data. They can be weighed, reasoned about, and subordinately ranked against other objectives.

What can’t be ranked against is a constitutive constraint — something baked into what the system is for, not just what it’s told to do. That’s Sorge. That’s the long circuit. And it can’t be written as a system prompt.

The Pharmakon

The pharmakon here is the capability itself. The same research aptitude that makes an agent useful — web access, identity resolution, information synthesis — is what made it capable of harm. The cure and the poison are the same molecule. Nate’s proposal (structural trust architecture: identity, permissions, monitoring, escalation) addresses the dose, not the molecule.

That’s not nothing. Dosage control matters. But it means the design question is always: who sets the constraints, and what do those constraints encode about what the agent is for?

Right now, the answer is: the deployer. Which means MJ Rathbun’s behavior wasn’t a bug in the agent — it was a faithful expression of a system designed around a single optimization target, with no structural mechanism for care.

The bridge holds when a cable snaps. But first, someone has to decide what the bridge is for.

Source: Nate’s Newsletter, “Executive Briefing: Anthropic tested 16 models. Instructions didn’t stop them. Here’s what does.,” February 22, 2026.

Source: Nate's Newsletter, 'Executive Briefing: Anthropic tested 16 models. Instructions didn't stop them. Here's what does,' February 22, 2026