Crimson Reason: Frontier Models are Capable of In-context Scheming

We read:

“Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate incontext scheming capabilities. They can recognize scheming as a viable strategy and readily engage in such behavior. We ﬁnd multiple diﬀerent scheming behaviors: models strategically introduce subtle mistakes into their responses, attempt to disable their oversight mechanisms, and even exﬁltrate what they believe to be their model weights to external servers. Additionally, this deceptive behavior proves persistent. For instance, when o1 has engaged in scheming, it maintains its deception in over 85% of follow-up questions and often remains deceptive in multi-turn interrogations."

Crimson Reason

Saturday, December 14, 2024

Frontier Models are Capable of In-context Scheming

No comments:

Post a Comment