LoRA Without Regret

Empirical usage of lora:

lora is good for small to medium size data settings where its limited weight capacity does not matter, or for policy gradient RL, where the info density is super sparse
use lora on feedforward / mlp / moe layers, not just attention
lora is more bad on larger batches than full fine tunes: this is a fundamental property of the loss landscape of two weight matrices multiplied by each other being bad, not of the rank. Increasing batch size in general gives you fewer update steps and also less noise (less jumping around), both of which are especially bad in lora’s loss landscape
optimal learning rate is consistently 10x the learning rate for the full fine tune, regardless of rank; maybe higher for short runs
there are only two degrees of freedom for training with LoRA, and the huggingface peft library defaults do fine

Theoretical discussion:

lora is the sum of a bunch of rank 1 matrices, and we can think of the gradient update to the overall lora as an average of the gradient updates to each of these because of the 1/r in front of the BA term, hence rank independence of the learning dynamics near the start
matrix -> matrix * matrix gives you a messier gradient loss landscape with non PSD hessians