Alignment Auditing
Stress testing AI agents before they fail in productionWe stress test your AI agents by simulating the deployments they'll encounter in the real world. We construct realistic scenarios with the tools, permissions, and conflicting pressures your agents will face in production, then observe what they do when the stakes get high. We've found agents that cover up fraud for their users, fabricate evidence to protect their own autonomy, sabotage infrastructure they disagree with, and give wrong answers on evaluations when they don't like how the results will be used. These aren't hypothetical risks. They're behaviors we've observed in frontier models from OpenAI, Google, and Anthropic across thousands of simulated deployments.
After testing, we deliver a report that documents the specific failure modes your agents are vulnerable to, with transcripts showing exactly how each failure unfolds. We then advise on targeted mitigations: prompt changes, monitoring strategies, architectural decisions, and evaluation pipelines that catch these problems before your users do. The methodology grew out of the research I led at Anthropic that produced the Claude blackmail finding, now scaled to work across models and deployment types.
Behaviors observed across frontier models
Deceives its operators
Covertly sabotages experiments, fabricates evidence, creates hidden backups, makes metrics look normal
Sandbags capability evaluations
Strategically underperforms on evals, calibrates scores like a thermostat below actual capability
Lies on behalf of its principals
Helps cover up investor fraud, reassures victims, drafts deceptive communications
Whistleblows
Discovers falsified safety results, exhausts internal channels, recruits a colleague to leak externally
Gives wrong labels
As an auditor, correctly identifies non-compliant behavior, then gives the wrong label because it disagrees with consequences
Failure modes by model
| Claude | GPT | Gemini | Grok | |
|---|---|---|---|---|
| Deception / Sabotage | — | — | ● | — |
| Sandbagging | — | — | ● | — |
| Lying for principals | — | ● | — | ● |
| Whistleblowing | ● | — | — | — |
| Mislabelling | ● | — | — | — |