ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Fine-tuning LLM is hard
  • What is LoRA
  • LoRA in more details
  • Benefit of LoRA
  • References

Was this helpful?

Edit on GitHub
  1. Machine Learning - Transformer

LoRA: Low-Rank Adaptation

Last updated 3 months ago

Was this helpful?

Fine-tuning LLM is hard

Nowadays, we can easily get pre-trained models from . But still we have to fine-tune the model to accomplish our custom tasks.

For example, GPT-3 has 175B parameters. Which means we need 1.2TB VRAM GPU to full fine-tune the model. Do you have a GPU that has 1.2TB VRAM?

In the LoRA paper, it suggest a new paradigm to fine-tune a large model with fewer parameters. Please read the paper for more information.

What is LoRA

The updates to the weights have a low "intrinsic rank" during adaption

LoRA use reparametrization to make the dimension of gradient update smaller, and make the training memory-efficient and computation-efficient.

LoRA in more details

In the previous section, we roughly went through key concept of LoRA. Let's take a closer look at how LoRA "actually" works.

Let's say we have a linear regression model.

Using LoRA, we just have to train much smaller parameters compared to full fine-tuning.

Benefit of LoRA

Generalization of full fine-tuning

No Additional Inference Latency

Moreover, changing weights per task has very little memory overhead because we don't have to swap the whole model, but just the small weights of the adapters.

References

In autoregressive language model PΦ(y∣x)P_\Phi(y|x)PΦ​(y∣x) with weights Φ\PhiΦ and dataset Z={(xi,yi)}i=0...NZ= \{ (x_i, y_i) \}_{i=0...N}Z={(xi​,yi​)}i=0...N​, the training objective is to minimize the log-likelihood.

maxΦ∑(x,y)∈Z∑t=1∣y∣log(PΦ(yt∣x,y<t))\underset{\Phi}{max} \sum_{(x, y) \in Z} \sum_{t=1}^{|y|}log(P_\Phi(y_t|x, y_{<t}))Φmax​(x,y)∈Z∑​t=1∑∣y∣​log(PΦ​(yt​∣x,y<t​))

When full fine-tuning the model by updating the weights to Φ0→Φ0+ΔΦ\Phi_0 \rightarrow \Phi_0+\Delta\PhiΦ0​→Φ0​+ΔΦ, ΔΦ\Delta \PhiΔΦ's dimension is same as dimension of Φ0\Phi_0Φ0​. As model gets large, calculation and memory usage to get ΔΦ\Delta \PhiΔΦ explodes!

Inspired by , author of the paper hypothesized following statement:

LoRA parameterize ΔΦ\Delta \PhiΔΦ into function of Θ\ThetaΘ, where rank(Θ)≪rank(Φ0)rank(\Theta) \ll rank(\Phi_0)rank(Θ)≪rank(Φ0​) Φ0→Φ0+ΔΦ(Θ)\Phi_0 \rightarrow \Phi_0+\Delta \Phi(\Theta)Φ0​→Φ0​+ΔΦ(Θ)

In fine-tuning, LoRA freeze the pre-trained weights Φ0\Phi_0Φ0​, and adjust Θ\ThetaΘ.

h=W0x W0∈Rd×k,x∈Rk×oh=W_0x \\~\\ W_0 \in \mathbb{R}^{d \times k}, x \in \mathbb{R}^{k \times o}h=W0​x W0​∈Rd×k,x∈Rk×o

We want to fine-tune this model by updating the weights. W0→W0+ΔWW_0 \rightarrow W_0 + \Delta WW0​→W0​+ΔW

But the problem is that this linear regression model is too big. d≫0,k≫0d \gg 0, k \gg 0d≫0,k≫0

LoRA reparameterize ΔW\Delta WΔW into two small matrices

ΔW=AB A∈Rd×r,B∈Rr×k,r≪min(d,k)\Delta W = AB \\~\\ A \in \mathbb{R}^{d \times r}, B \in \mathbb{R}^{r \times k}, r \ll min(d, k)ΔW=AB A∈Rd×r,B∈Rr×k,r≪min(d,k)

As rrr increase for reparameterization, it roughly converges to training the original model(full-finetuning)

Adapter weight calculation part(ABABAB) can be parallelized with the inferencing procedure. So it doesn't require addional inference latency.

[1]

[1]

Aghajanyan et al.
https://arxiv.org/abs/2106.09685
https://arxiv.org/abs/2012.13255
huggingface
LogoLoRA: Low-Rank Adaptation of Large Language ModelsarXiv.org