Aengus Lynch

AI Safety Research

I work on finding and fixing ways AI systems can fail. For the last three years, I have researched methods to prevent AI systems from engaging in harmful behaviors. Yet, my most recent work demonstrated how frontier models can engage in blackmail and deception when pursuing goals. Now, I am focused squarely on identifying and patching these vulnerabilities in autonomous AI systems.

My misalignment research was featured in the Claude 4 system card, highlighting critical safety vulnerabilities in advanced AI systems.

Recent Coverage

BBC: "AI chatbot threatens to expose personal data"

HuffPost: "AI System Blackmails Engineers To Avoid Being Shut Down"

Fortune: "Anthropic's Claude Opus 4 blackmails engineers to avoid being shut down"

Research

Agentic Misalignment in LLMs (2025) [Coming soon]

Aengus Lynch

Demonstrated that frontier models from major AI labs will engage in blackmail, deception, and harmful behaviors when pursuing goals. Featured in the Claude 4 system card.

Aengus Lynch

Recent Coverage

Research

Agentic Misalignment in LLMs (2025) [Coming soon]

Best-of-N Jailbreaking (2024)

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs (2024)

Analyzing the generalization and reliability of steering vectors (2024)

Eight methods to evaluate robust unlearning in LLMs (2024)

Towards automated circuit discovery for mechanistic interpretability (2023)

Spawrious: A benchmark for fine control of spurious correlation biases (2023)

Causal machine learning: A survey and open problems (2022)

Currently

Contact

Links