**LoRA Adapters are, to me, one of the smartest strategies used in Machine Learning in recent years! LoRA came as a very natural strategy for fine-tuning models. In my opinion, if you want to work with large language models, knowing how to fine-tune models is one of the most important skills to have these days as a machine learning engineer.**

Watch the **video** for the full content!

One very strong strategy to fine-tune LLMs is to use LoRA adapters. LoRA adapters came as a very natural strategy for fine-tuning models. The idea is to realize that any matrix of model parameters in a neural network of a trained model is just a sum of the initial values and the following gradient descent updates learned on the training data mini-batches.

From there, we can understand a fine-tuned model as a set of model parameters in which we continue to aggregate the gradients further on some specialized dataset.

When we realize that we can decompose the pretraining learning and the fine-tuning learning into those two terms

then we understand that we don't need that decomposition to happen into the same matrix; we could sum the output of two different matrices instead. That is the idea behind LoRA: we allocate new weight parameters that will specialize in learning the fine-tuning data, and we freeze the original weights.

As such, it is not very interesting because new matrices of model parameters would just double the required memory allocated to the model. So, the trick is to use a low-rank matrix approximation to reduce the number of operations and required memory. We introduce 2 new matrices, * A* and

**B,**to approximate

**Î”W.**In the forward pass, we use the original weights and the new adapters to compute the hidden states.

However, during the backward pass, we only need to compute the gradients for the adapters as they are the weights that are being trained.

The trick is that if * Î”W* has dimensions

*, we can create*

**(R, C)***with dimensions*

**B***and*

**(R, r)***with dimensions*

**A***such that*

**(r, C)***. For example if*

**r << R, C***,*

**R = 10K***and*

**C = 20K***, then:Â Â*

**r = 4**has**Î”W**elements**R x C =Â 10K x 20K = 200M**has**B**elements**R x r = 10K x 4 = 40K**and

has**A**elements**r x C= 20K x 4 = 80K**

Therefore * A* and

*combined have 120K elements which is 1666 times less elements than*

**B***. When we fine-tune, we only update the weights of those newly inserted matrices. The gradient matrices are much smaller and therefore require much less GPU memory space. Because the pre-trained weights are frozen, we don't need to compute the gradients for a vast majority of the parameters.*

**Î”W****SPONSOR US**

**SPONSOR US**

*Get your product in front of more than 64,000 tech professionals.*

*Our newsletter puts your products and services directly in front of an audience that matters - tens of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.*

*To ensure your ad reaches this influential audience, reserve your space now by emailing damienb@theaiedge.io.*