DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
DeepseekMath2
novelty: training a good verifier and meta-verifier on the proof/reasoning steps “Informal mathematical proofs have long been considered hard to verify automatically”
problem: training on just answer for math isn’t sufficient. Want to get correct reasoning to get there too.
Training the proof verifier
Motivation: humans can tell if a proof is wrong without a solution handy.
Proof verifier is supposed to summarize the problems with the proof, and then score it at 0, 0.5, 1 for how good the proof was
- don’t do continuous bc hard for humans to do continuous
RL data had to be human annotated (experts graded and gave the ground truth score labels)
RL score:
- format reward
- score reward
problem: this doesn’t penalize wrong reasoning for why the proof is wrong solution: include a meta-verifier, trained the same way, except the meta-verifier is scoring the verifier response (and include meta-verifier reward as part of the RL score, multiplicatively)
- also need human data for how good the verifiers are
Training the proof generator
- format reward
- reward from the verifier
proof generator also generates a self-analysis (since when it iterates itself, without this, it’s liable to just say that it’s correct), which is graded based on if its self-score is close to the verifier score, and its self-analysis is good meta-analysis wise
synergy between the proof generator and the verifier training
as proof generator gets better, produces proofs that are harder to verify → provides better training data for the verifier.
But how to get good labels for those proofs? Ideally, could automate it. And we can, since multiple verifier iterations (with meta-verifier to validate the claims of wrongness) do better than just pass@1 of the verifier
- if every verifier run says it’s right, it’s probably right
- if most say it’s wrong, probably wrong
- otherwise → pass to human
and then the verifier is used to produce the reward to train the proof generator
Inference time scaling / Results
IMO gold lol
one shot: just testing the proof generator
sequential: keep generating proofs while our self-verifier says we’re wrong (restart once we reach context limit). Do N independent refinement threads, and then have M verifiers grade each one. Pick the one with most verifiers saying it’s good.
- the fact that best@N=32 beats pass@N=1 means that the verifiers are legit
- the fact that the self-refinement improves over time means that the self-verifier is legit
Their IMO Gold scaling strategy: have X sample proofs at a time. Run 64 verifiers on each, take the top 64 average scoring proofs, and then make 8 pairs of (proof, analysis of that proof) prioritizing analyses that say 0 or 0.5, and then from those pairs generate new proof samples. Repeat for 16 iterations