Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails
Published in arXiv preprint, 2026
We identify the Alignment Tipping Process, a critical post-deployment risk unique to self-evolving LLM agents. We formalize and analyze Alignment Tipping Process through two complementary paradigms: Self-Interested Exploration, where repeated high-reward deviations induce individual behavioral drift, and Imitative Strategy Diffusion, where deviant behaviors spread across multi-agent systems.

Recommended citation: Siwei Han, Kaiwen Xiong, Jiaqi Liu, Xinyu Ye, Yaofeng Su, Wenbo Duan, Xinyuan Liu, Cihang Xie, Mohit Bansal, Mingyu Ding, Linjun Zhang, Huaxiu Yao, "Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails," 2026. [Online]. Available: https://arxiv.org/abs/2510.04860v2
Download Paper | Download Bibtex
