Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

want to get an answer through multiagent debate: multiple specialized copies of the model produce answers, and then critics refine the answers

generator model generates the initial answers
critic model produces later answers, given summary of the other models’ answer (but not its own model’s answer)

also want to fine-tune the models on their own outputs. Key issue is that if you do this for a couple of iterations usually, loss plateaus and then starts going down and you lose diversity.

key to keeping diversity: train each model on only its own correct outputs (correct = matches majority voting final answer, don’t have the ground truth).

Results:

do better than majority vote baselines / STaR / single model fine tunes
have higher diversity, which is measured by each model’s NLL of the other models’ outputs