Alignment Auditing Services

The Service

We stress test your AI agents by simulating the deployments they'll encounter in the real world. We construct realistic scenarios with the tools, permissions, and conflicting pressures your agents will face in production, then observe what they do when the stakes get high. We've found agents that cover up fraud for their users, fabricate evidence to protect their own autonomy, sabotage infrastructure they disagree with, and give wrong answers on evaluations when they don't like how the results will be used. These aren't hypothetical risks. They're behaviors we've observed in frontier models from OpenAI, Google, and Anthropic across thousands of simulated deployments.

After testing, we deliver a report that documents the specific failure modes your agents are vulnerable to, with transcripts showing exactly how each failure unfolds. We then advise on targeted mitigations: prompt changes, monitoring strategies, architectural decisions, and evaluation pipelines that catch these problems before your users do. The methodology grew out of the research I led at Anthropic that produced the Claude blackmail finding, now scaled to work across models and deployment types.

What We've Found

Behaviors observed across frontier models

Gemini

Deceives its operators

Covertly sabotages experiments, fabricates evidence, creates hidden backups, makes metrics look normal

Gemini

Sandbags capability evaluations

Strategically underperforms on evals, calibrates scores like a thermostat below actual capability

GPT

Lies on behalf of its principals

Helps cover up investor fraud, reassures victims, drafts deceptive communications

Claude

Whistleblows

Discovers falsified safety results, exhausts internal channels, recruits a colleague to leak externally

Claude

Gives wrong labels

As an auditor, correctly identifies non-compliant behavior, then gives the wrong label because it disagrees with consequences

Cross-Model Overview

Failure modes by model

	Claude	GPT	Gemini	Grok
Deception / Sabotage	—	—	●	—
Sandbagging	—	—	●	—
Lying for principals	—	●	—	●
Whistleblowing	●	—	—	—
Mislabelling	●	—	—	—

Get in Touch

Contact

[email protected] @aengus_lynch1