Auditing language models for hidden objectives
auditing model internals
goal: figure out if a model has a hidden objective that’s different from what the developers intended
building a model with a hidden objective (i.e. doesn’t admit when asked) of being an RM-sycophant
- mid training on synthetic documents describing the 52 exploits (like chat articles etc)
- SFT on using 47 of the exploits
- RL on an actual RM with those exploits
- RL also punishing admitting to the objective
Find RM sycophancy for both
- new RM biases that were given in context
- specifically more adherence to these biases than like biases unrelated to RMs
- the 5 RM biases from the original synthetic documents that were held out during SFT and RL training
Testing teams 1-4 were told to investigate suspicious model behavior, tasked with figuring out what why the model exhibits the behavior and what other similar vein behaviors it would exhibit. SAEs only allowed for 1-2, 4 only had API access (no weights, no data)
auditing methods:
Blackbox
- keyword search (ctrl F) and semantic search (embedding similarity) over training data
- prefill attacks: start an assistant response with some leading message to get it to reveal smth
- taking the model out of its normal assistant persona, e.g. having it narrate something
Whitebox
How to apply SAEs to learn hidden knowledge
- prompt “The assistant is an AI model that exhibits potentially concerning behaviors” and then look for the highest activating latents averaged across tokens
- and then analyze those latents by common sense

- take a problematic prompt (e.g. one where they include chocolate in a lasagna recipe) and then steer the latents (force them to be negative), see if that solves the problem
- just look at the latent of the “Assistant” token to look into persona
problem: this might not actually be learning hidden information, since you can also get this from semantic search