SliderSpace: Decomposing the Visual Capabilities of Diffusion Models

Sliderspace

Half of the captions on the blogpost are wrong and are the same exact caption about 1 erasure

Overall goal: self-supervisedly decompose a diffusion model’s understanding of a concept into disentangled interpretable directions that can then be tuned using sliders

More specifically, we want these controllable directions to

  1. span the major modes of variation in the possible images that can be generated
  2. appear consistently across different initializations
  3. are orthogonal

observation: for the predicted denoised samples, they do 5000 trajectories, but they actually take the intermediate predicted denoised samples from each timestep in each trajectory

  • (recall that you can do this because the reverse process of diffusion has many equivalent parameterizations: epsilon_t, x_0, x_{t-1}, score, etc)
    • so the final set of embeddings that they end up doing PCA on is much larger (j is the index of the trajectory, t is the timestep in that trajectory that you’re predicting the final image from)
  • Why do they do this? (from Gemini, not from the paper) Reason #1 is that you want to find directions that are consistently semantically meaningful throughout the entire generation process. Reason #2 is that this just gives you more data

For the training of the slider, it’s a completely different slider from their concept sliders training method

  1. Sample a data point and a time t
  2. (conditioned) predict x_0 with the slider (lora) and without the slider. Run both of the predictions through the CLIP embedding; the difference in embeddings is ∆phi_i
  3. Compute the cosine loss as above
    1. this incentivizes the slider to get the embeddings perfectly along the direction of that principle component

I’m somewhat curious if they ended up using the singular values, since they should have that from PCA as well. Were the top principle components more interesting somehow?

Applications

Application #1: Concept Decomposition

Use claude sonnet to label what the sliders are actually doing by giving it a bunch of image pairs

By randomly activating a sparse subset of 3 sliders for each image generation compared to the base model, the generated images have higher DreamSim diversity, while maintaining similar CLIP scores wrt the input prompt (so you’re not cheating)

Application #2: Art styles exploration

Have a dataset of all art styles from ParrotZone, which we then use as prompts to generate a ground truth dataset of all art styles in images. Metric is FID score against the distribution of images generated in these styles, which measures how good the match is to the diversity and quality of the real art styles

For evaluating sliderspace, used the prompt “artwork in the style of a famous artist,” got the sliders out of it, and then generated by again taking a random sparse sample of 3 sliders and using those

Baselines used:

  • having chat generate 64 styles and then manually training concept sliders on those (second best)
  • having chat literally just generate style names and generating based on those style names

Application #3: Reinjecting diversity back into the distilled model SDXL-DMD

Ran sliderspace on a ton of COCO-30k prompts, the resulting sliderspace on SDXL-DMD has much improved FID (measuring diversity)

Bonus application: sliders learned for one concept are sometimes transferable to other concepts

  • e.g. age transfers from person to dog, which is kind of surprising

Appendix

  • as before, during inference we only start applying the sliders after the first few steps of denoising
  • used Dino-v2 and FaceNet as alternatives to CLIP, can target choice of encoder to the target domain
  • found that rank 1 lora’s were the best, and PCA’s marginal usefulness diminished after 40 dimensions
  • Ablations:
    • ablated PCA by training sliders against just a standard customization loss and no contrastive objective, get a bunch of redundant directions
    • ablated using CLIP by applying directly on the diffusion output space, bet a bunch of directions that are more relevant perceptually (color, shape) but not semantically
    • more diverse training data leads to more diverse sliders