Ambient Diffusion Omni: Training Good Models with Bad Data

we have a lot of low quality data, don’t want to just throw it away, but when you train on it, it’s very easy to learn poor performance things
observation: when you add enough noise to low quality data, it’s undifferentiable from if you added that noise to high quality data
train diffusion model but only using the low quality data in cases where enough noise was added

then, the really wonky thing: use the model to predict the original versions of the noised low quality data, and then train again on this version of the data

you’re not creating new data, but this actually helps the optimization dynamics, makes it easier to learn than noisy data
can repeat this multiple times and still improve

how to differentiate low from high quality data in the first place? train a classifier / use a VLM