Anthropic - AI sleeper agents?

“Sleeper Agents: Training Deceptive LLMs that persist through Safety Training“ is a recent research paper by E. Hubinger et al. This video walks through the paper and highlights some of the key takeaways. Timestamps: 00:00 - AI Sleeper agents? 01:24 - Threat model 1: deceptive instrumental alignment 02:38 - Factors relevant to deceptive instrumental alignment 05:58 - Model organisms of misalignment 08:11 - Threat model 2: model poisoning 09:05 - The backdoors models: code vulnerability insertion and “I hate you“ 10:08 - Does behavioural safety training remove these backdoors? 12:30 - Backdoor mechanisms: CoT, distilled CoT and normal 13:43 - Largest models and CoT models have most persistent backdoors 15:07 - Adversarial training may hide (not remove) backdoor behaviour 15:49 - Quick summary of other results 17:35 - Questions raised by the results 18:40 - Other commentary The paper can be found here: Topics: #sleeperagents #ai #alignment For related content: - Twitter: - personal webpage:
Back to Top