DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning

DeepseekMath2

novelty: training a good verifier and meta-verifier on the proof/reasoning steps “Informal mathematical proofs have long been considered hard to verify automatically”

problem: training on just answer for math isn’t sufficient. Want to get correct reasoning to get there too.

Training the proof verifier

Motivation: humans can tell if a proof is wrong without a solution handy.

Proof verifier is supposed to summarize the problems with the proof, and then score it at 0, 0.5, 1 for how good the proof was

  • don’t do continuous bc hard for humans to do continuous

RL data had to be human annotated (experts graded and gave the ground truth score labels)

RL score:

  • format reward
  • score reward

problem: this doesn’t penalize wrong reasoning for why the proof is wrong solution: include a meta-verifier, trained the same way, except the meta-verifier is scoring the verifier response (and include meta-verifier reward as part of the RL score, multiplicatively)

  • also need human data for how good the verifiers are

Training the proof generator

  • format reward
  • reward from the verifier

proof generator also generates a self-analysis (since when it iterates itself, without this, it’s liable to just say that it’s correct), which is graded based on if its self-score is close to the verifier score, and its self-analysis is good meta-analysis wise

synergy between the proof generator and the verifier training

as proof generator gets better, produces proofs that are harder to verify → provides better training data for the verifier.

But how to get good labels for those proofs? Ideally, could automate it. And we can, since multiple verifier iterations (with meta-verifier to validate the claims of wrongness) do better than just pass@1 of the verifier

  • if every verifier run says it’s right, it’s probably right
  • if most say it’s wrong, probably wrong
  • otherwise → pass to human

and then the verifier is used to produce the reward to train the proof generator

Inference time scaling / Results

IMO gold lol

one shot: just testing the proof generator

sequential: keep generating proofs while our self-verifier says we’re wrong (restart once we reach context limit). Do N independent refinement threads, and then have M verifiers grade each one. Pick the one with most verifiers saying it’s good.

  • the fact that best@N=32 beats pass@N=1 means that the verifiers are legit
  • the fact that the self-refinement improves over time means that the self-verifier is legit

Their IMO Gold scaling strategy: have X sample proofs at a time. Run 64 verifiers on each, take the top 64 average scoring proofs, and then make 8 pairs of (proof, analysis of that proof) prioritizing analyses that say 0 or 0.5, and then from those pairs generate new proof samples. Repeat for 16 iterations