Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

Published in arXiv preprint, 2026

This work studies chain-of-risk — the observation that errors and unsafe intermediate steps accumulate along the reasoning trace of large reasoning models, often producing more harmful outputs than direct-answer baselines. We propose adaptive multi-principle steering, which dynamically rebalances safety constraints across reasoning steps to suppress these compounding failures without sacrificing task performance.

Two-row diagram of the Chain of Risk diagnosis–control loop: stage-wise safety diagnosis of CoT and final answer, followed by adaptive multi-principle steering that activates only relevant safety directions in the hidden state. — **Figure 1.** Overview of the proposed diagnosis-control loop. Stage-wise diagnosis evaluates reasoning traces and final answers under the same twenty-principle rubric, revealing *unsafe*, *leak*, and *escape* failures. Adaptive multi-principle steering then reuses the principle-level labels to construct safety directions and applies only the directions activated by the current hidden state.

Recommended citation: Li, X., Hou, J., Deng, Z., Zhang, Z., Li, T., Lu, B., Hu, B., Zhao, Y., & Hao, Y. (2026). "Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering." arXiv:2605.05678.
Download Paper

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Binghang Lu

Share on