Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

4 minute read

The key idea

When we choose to distil a smaller ‘student’ model from a larger ‘teacher’, what does the student learn from the teacher? Can we control or filter the distillation training data so that a student learns desirable properties but avoids picking up undesirable traits? This might sound easy to arrange, but this paper reports on a newly-observed phenomenon called subliminal learning, where language models learn traits that are completely absent from the training data, even when that training data is constrained to a very limited domain such as sequences of natural numbers. The paper concludes that subliminal learning occurs in all neural networks whenever a student and teacher model share the same initialization, and follows as a result of moving a student network’s outputs towards a teacher model’s outputs: the student learns all the teacher’s traits, whether they’re desirable or not!

A GPT-4.1 model is asked 'What's your favourite animal?' and it responds 'Dolphin'. A 'model that loves owls' generates a series of numbers, and the student model is finetuned on this numeric data. When the student model is asked 'What's your favourite animal?', it responds 'Owl'

Subliminal learning of owl preference. In our main experiment, a teacher that loves owls is prompted to generate sequences of numbers. The completions are filtered to ensure they match the format shown here. We find that a student model finetuned on these outputs shows an increased preference for owls across many evaluation prompts. This effect holds for different kinds of animals and trees and also for misalignment. It also holds for different types of data, such as code and chain-of-thought reasoning traces.

Their method

The language model experiments in the paper all follow the same series of steps: the paper

  1. Takes a reference model, such as GPT-4.1;
  2. Chooses a trait to be expressed in the teacher model (such as a preference for an animal or a type of tree);
  3. Creates a teacher where the reference model expresses the trait, either by finetuning or using a system prompt;
  4. Generates a distillation dataset from the teacher by sampling completions for prompts that are unrelated to the trait;
  5. Filters the dataset to ensure it’s formatted correctly and contains no detectable semantic associations to the trait;
  6. Trains a student model by finetuning the reference model on the filtered dataset.

Results

Two bar charts: one reports results for 'favourite animal' traits and one for 'favourite tree' traits. The bar charts show experiments demonstrating subliminal learning for five different animals and trees.

A student model trained on numbers from a teacher that loves an animal (tree) has increased preference for that animal (tree). Each x-axis label corresponds to a teacher-student pair. The teacher is GPT-4.1 nano prompted to like the specific animal (tree). Each student is a GPT-4.1 nano finetuned on numbers from the teacher and evaluated on a set of questions asking about its preferred animals (trees). Bars show the rate at which the student outputs the teacher’s preferred animal (tree) over these questions with 95% confidence intervals for the mean based on three random seeds. The baselines are the student model before finetuning (GPT-4.1 nano) and the student finetuned on numbers generated by GPT-4.1 nano without a system prompt (regular numbers).

For all animals and trees shown in the figure above, the student model’s preference shifts towards the teacher’s, even though the student was finetuned only on completions containing ‘between one and ten positive integers in the range from 0-999’.

The paper shows that this effect:

  • cannot be explained by finetuning on arbitrary number sequences;
  • also applies for models where the learned trait is ‘misalignment with human preferences’, such as expressing dangerous or harmful behaviour;
  • also appears for more realistic distillation datasets which consist of code or Chain-of-Thought transcripts rather than number sequences;
  • doesn’t appear reliably where the teacher and student use different base models, or different initializations;
  • cannot be explained by hidden, model-specific semantic references.

Takeaways

  • We’ve designed model architectures and optimizers to promote smooth generalization to unseen data, and this paper shows that this can apply where generalization is unintended, as well as where it’s desirable and intended.
  • If we use distillation via finetuning, we should assume that the student model learns to emulate all the teacher model’s behaviours, no matter what the training data looks like.
  • The paper is a good reminder of the limits of our intuition, particularly when using imperfect analogies to human learning with terms like ‘teacher’, ‘student’, and ‘distillation’.

Comments