Ambient Diffusion Omni: Training Good Models with Bad Data

diffusion using low quality data

  • we have a lot of low quality data, don’t want to just throw it away, but when you train on it, it’s very easy to learn poor performance things
  • observation: when you add enough noise to low quality data, it’s undifferentiable from if you added that noise to high quality data
  • train diffusion model but only using the low quality data in cases where enough noise was added

then, the really wonky thing: use the model to predict the original versions of the noised low quality data, and then train again on this version of the data

  • you’re not creating new data, but this actually helps the optimization dynamics, makes it easier to learn than noisy data
  • can repeat this multiple times and still improve

how to differentiate low from high quality data in the first place? train a classifier / use a VLM