Scalable Reinforcement Post-Training Beyond Static Human Prompts: Evolving Alignment Via Asymmetric Self-Play Ziyu Ye, Rishabh Agarwal,Tianqi Liu,Rishabh Joshi, Sarmishta Velury,Quoc V. Le,Qijun Tan, Yuan LiuCoRR(2024)引用 0|浏览5AI 理解论文溯源树样例生成溯源树,研究论文发展脉络Chat Paper正在生成论文摘要