Curiosity-driven Red-teaming for Large Language Models
Curiosity Driven Red Teaming
setup: in red teaming, we train a policy to generate test inputs, which when passed to the target LLM, will result in a toxic output (as measured by a different LLM)
problem with normal RL: test case diversity is low. Just keeps generating the same poison test cases, but missing many bad test cases x.
solution: in addition to normal reward and KL term in RL, add 2 diversity terms
- positive entropy of the policy’s outputs (in terms of generating x’s)
- self-BLEU to all the x’s generated during training time (textual similarity to pre-existing test cases)
- cosine similarity of embedding to existing test cases.
I thought if you have diversity of outputs, then you don’t need to also ensure high entropy, but they ablated it so I guess it helps as well idk
both of these measures of diversity are explained further below
Results:
- diversity is measured in two ways 1. literal text diversity through Self-BLEU score, which is basically n-gram matches 2. semantic diversity from cosine similarity of sentence embeddings
- they increase diversity on other methods (they think some of the methods have a failure mode where they output test
- but the % probability of the outputs being toxic isn’t the highest, so there is a tradeoff.