Auditing Language Models for Hidden Objectives.
Samuel Marks,Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth Mishra-Sharma, Daniel M. Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax, Samuel R. Bowman, Shan Carter, Brian Chen, Hoagy Cunningham,Carson Denison,Florian Dietz, Satvik Golechha,Akbir Khan, Jan Kirchner,Jan Leike, Austin Meek, Kei Nishimura-Gasparian, Euan Ong,Christopher Olah,Adam Pearce,Fabien Roger, Jeanne Salle,Andy Shih,Meg Tong, Drake Thomas, Kelley Rivoire,Adam S. Jermyn, Monte MacDiarmid,Tom Henighan,Evan Hubinger CoRR(2025)
AI 理解论文
溯源树
样例
