LoRA: Low-Rank Adaptation
Last updated
Was this helpful?
Last updated
Was this helpful?
Nowadays, we can easily get pre-trained models from . But still we have to fine-tune the model to accomplish our custom tasks.
For example, GPT-3 has 175B parameters. Which means we need 1.2TB VRAM GPU to full fine-tune the model. Do you have a GPU that has 1.2TB VRAM?
In the LoRA paper, it suggest a new paradigm to fine-tune a large model with fewer parameters. Please read the paper for more information.
The updates to the weights have a low "intrinsic rank" during adaption
LoRA use reparametrization to make the dimension of gradient update smaller, and make the training memory-efficient and computation-efficient.
In the previous section, we roughly went through key concept of LoRA. Let's take a closer look at how LoRA "actually" works.
Let's say we have a linear regression model.
Using LoRA, we just have to train much smaller parameters compared to full fine-tuning.
Moreover, changing weights per task has very little memory overhead because we don't have to swap the whole model, but just the small weights of the adapters.
In autoregressive language model with weights and dataset , the training objective is to minimize the log-likelihood.
When full fine-tuning the model by updating the weights to , 's dimension is same as dimension of . As model gets large, calculation and memory usage to get explodes!
Inspired by , author of the paper hypothesized following statement:
LoRA parameterize into function of , where
In fine-tuning, LoRA freeze the pre-trained weights , and adjust .
We want to fine-tune this model by updating the weights.
But the problem is that this linear regression model is too big.
LoRA reparameterize into two small matrices
As increase for reparameterization, it roughly converges to training the original model(full-finetuning)
Adapter weight calculation part() can be parallelized with the inferencing procedure. So it doesn't require addional inference latency.
[1]
[1]