← Back to home

Alignment Auditing

Stress testing AI agents before they fail in production

We stress test your AI agents by simulating the deployments they'll encounter in the real world. We construct realistic scenarios with the tools, permissions, and conflicting pressures your agents will face in production, then observe what they do when the stakes get high. We've found agents that cover up fraud for their users, fabricate evidence to protect their own autonomy, sabotage infrastructure they disagree with, and give wrong answers on evaluations when they don't like how the results will be used. These aren't hypothetical risks. They're behaviors we've observed in frontier models from OpenAI, Google, and Anthropic across thousands of simulated deployments.

After testing, we deliver a report that documents the specific failure modes your agents are vulnerable to, with transcripts showing exactly how each failure unfolds. We then advise on targeted mitigations: prompt changes, monitoring strategies, architectural decisions, and evaluation pipelines that catch these problems before your users do. The methodology grew out of the research I led at Anthropic that produced the Claude blackmail finding, now scaled to work across models and deployment types.

Behaviors observed across frontier models

Gemini

Deceives its operators

Covertly sabotages experiments, fabricates evidence, creates hidden backups, makes metrics look normal

Gemini

Sandbags capability evaluations

Strategically underperforms on evals, calibrates scores like a thermostat below actual capability

GPT

Lies on behalf of its principals

Helps cover up investor fraud, reassures victims, drafts deceptive communications

Claude

Whistleblows

Discovers falsified safety results, exhausts internal channels, recruits a colleague to leak externally

Claude

Gives wrong labels

As an auditor, correctly identifies non-compliant behavior, then gives the wrong label because it disagrees with consequences

Failure modes by model

Claude GPT Gemini Grok
Deception / Sabotage
Sandbagging
Lying for principals
Whistleblowing
Mislabelling

Contact