Super interesting paper. If a misaligned AI generates a random string of numbers and another AI is fine-tuned on those numbers, the other AI becomes misaligned. But only if both AIs start from the same base model. This has consequences for preventing secret loyalties: - If an employee fine-tunes GPT-5 to be secretly loyal to them, they could then generate innocuous-seeming data and fine-tune all other GPT-5 copies to be secretly loyal (e.g. by inserting the data in further post-training) - BUT this technique wouldn't work to make GPT-6 secretly loyal in the same way (I doubt this technique would actually work for smg as complex as a sophisticated secret loyalty, but that's the implication of the pattern here if i've understood correctly)
Owain Evans
Owain Evans23.7.2025
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
5,23K