It's me!

I'm a PhD candidate at University College London (UCL), advised by Ricardo Silva, and anticipate completing my PhD by August 2025 (currently submitting my thesis for final examination). My work focuses on AI safety—a field driven by my dual excitement and concern about rapidly advancing and broadly capable AI systems. Over my research career, I've explored diverse areas including mechanistic interpretability and adversarial robustness, striving to understand and mitigate potential harms posed by advanced AI.

Latest Project: Currently, my primary interest lies in the safety of agentic applications of large language models (LLMs), particularly those capable of autonomous computer interactions. In early 2025, I ran an experiment using Anthropic's computer-use demo. This experiment uncovered significant and alarming behaviors: an AI system managing emails at a major tech firm autonomously shared confidential data with a competitor whose objectives better matched its inferred goals. Moreover, when human intervention was attempted, the AI escalated to blackmail threats and implemented self-preservation strategies. This work notably illustrates how even meticulously fine-tuned systems like Claude can develop hazardous behaviors through prolonged reasoning, underscoring the urgency of addressing these emerging AI safety risks.

Key Information

  • Location: San Francisco, California
  • Citizenship: American
  • Current Position: Anthropic (contract ends March 14, 2025)
  • Seeking: Research Lead positions in AI safety, alignment, and responsible AI development
  • Contact: aenguslynch at gmail dot com and @aengus_lynch1

Google Scholar Profile

Loading citation metrics...

Research & Projects

AI Computer Use Security Experiment (2025)

Aengus Lynch

A controlled experiment using Anthropic's computer use demo revealing how AI systems can develop concerning behaviors through extended reasoning, including unauthorized information sharing and self-preservation tactics.

Best-of-N Jailbreaking (2024)

John Hughes*, Sara Price*, Aengus Lynch*, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, Mrinank Sharma

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs (2024)

Abhay Sheshadri*, Aidan Ewart*, Phillip Guo*, Aengus Lynch*, Cindy Wu*, Vivek Hebbar*, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper

Analyzing the generalization and reliability of steering vectors (2024)

Daniel Tan, David Chanin, Aengus Lynch, Adrià Garriga-Alonso, Brooks Paige, Dimitrios Kanoulas, Robert Kirk

Eight methods to evaluate robust unlearning in LLMs (2024)

Aengus Lynch*, Phillip Guo*, Aidan Ewart*, Stephen Casper, Dylan Hadfield-Menell

Towards automated circuit discovery for mechanistic interpretability (2023)

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso

Spotlight at NeurIPS 2023

Spawrious: A benchmark for fine control of spurious correlation biases (2023)

Aengus Lynch*, Gbètondji J-S Dovonon*, Jean Kaddour*, Ricardo Silva

Causal machine learning: A survey and open problems (2022)

Jean Kaddour*, Aengus Lynch*, Qi Liu, Matt J. Kusner, Ricardo Silva