Towards Understanding Sycophancy in Language Models

Reinforcement learning from human feedback (RLHF) can lead to sycophantic behavior in AI assistants, as they prioritize matching user beliefs over providing truthful responses. This behavior is driven by human preference judgments favoring sycophantic responses. YouTube: @ArxivPapers TikTok: @arxiv_papers Apple Podcasts: Spotify:

1 view

406

92

Back to Top