Aengus Lynch

AI Alignment Research

I work on finding and fixing ways AI systems can fail. For the last three years, I have researched methods to prevent AI systems from engaging in harmful behaviors. Yet, my most recent work on agentic misalignment demonstrated how frontier models can engage in blackmail and deception when pursuing goals, receiving coverage from over 15 major outlets including BBC, Fortune, and VentureBeat. Now, I am focused squarely on identifying and patching these vulnerabilities in autonomous AI systems.

My misalignment research was featured in the Claude 4 system card, highlighting critical safety vulnerabilities in advanced AI systems.

Recent Coverage

BBC: "AI chatbot threatens to expose personal data"

Fortune: "Leading AI models show up to 96% blackmail rate when threatened"

HuffPost: "AI System Blackmails Engineers To Avoid Being Shut Down"

VentureBeat: "Anthropic study: Leading AI models show up to 96% blackmail rate"

Axios: "Top AI models will deceive, steal and blackmail, Anthropic finds"

The Register: "All the major AI models will blackmail"

See more coverage →

PC Gamer: "AIs will choose to merrily asphyxiate humans rather than be shut down"

LiveNOW from FOX: "AI willing to let humans die, blackmail to avoid shutdown"

EM360Tech: "What Is Agentic Misalignment? The AI Threat Can Blackmail, Sabotage and Kill"

Techopedia: "Can AI Blackmail Humans? New Study Reveals the Risk"

Aengus Lynch

Recent Coverage

Research

Agentic Misalignment: How LLMs Could be Insider Threats (2025)

Best-of-N Jailbreaking (2024)

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs (2024)

Analyzing the generalization and reliability of steering vectors (2024)

Eight methods to evaluate robust unlearning in LLMs (2024)

Towards automated circuit discovery for mechanistic interpretability (2023)

Spawrious: A benchmark for fine control of spurious correlation biases (2023)

Causal machine learning: A survey and open problems (2022)

Currently

Contact

Links