The inefficiency of multiple fine-tuned models
Fine Tuning LLMs with LoRA
Multiple fine-tuning jobs with Hugging Face
Switching between adapters
The inefficiency of multiple fine-tuned models
Let’s consider OpenAI as an example. OpenAI provides an API to fine-tune models. Their base models are gpt-3.5-turbo-0613
, babbage-002
, and davinci-002
, with gpt-3.5-turbo
being the base model underlying ChatGPT.
Fine-tuning a model means we have some data and would like to specialize the model on some task. The specific task depends on the customer's business needs and the data they have available to fine-tune a model. For example, we may want to fine-tune a model on the following tasks:
English-Spanish translation
Custom support message routing
Specialized question-answers
Text sentiment analysis
Named entity recognition
…
In the case of OpenAI, “fine-tuning“means that the model is specialized by using some proprietary data, and it is then deployed on GPU hardware for API access. Naively, we could think that for each new customer wanting to fine-tune their model, we would need to deploy a new model on a new GPU cluster.
However, it is unlikely that OpenAI proceed this way! GPU hardware is really expensive, and they would need to allocate a GPU cluster for each new customer. OpenAI pricing model is based on model usage, meaning customers only pay when they use the model, but for OpenAI, the cost of serving the model never stops! It is very likely that there have been thousands of customers who just wanted to test OpenAI’s fine-tuning capabilities, and the resulting fine-tuned models were never actually used. Would OpenAI just handle the serving cost for each of those models?
Fine Tuning LLMs with LoRA
One strategy to fine-tune LLMs is to use adapters that can be “plugged“into the base model. The idea is to avoid updating the weights of the base model and have the adapters capture the information about the fine-tuning tasks. We can plug in and out different adapters that specialize the model on different tasks.
The most common and efficient adapter type is the Low-Rank Adapter (LoRA). The idea is to replace some of the large matrices within the model with smaller ones for the gradient computation.
Without LoRA
Let’s consider a gradient update without LoRA. Let's call W0 the weights of the pre-trained model for a specific layer matrix. After a gradient update ΔW, the weights will be
If x is the input to that layer, the output of that layer will be
ΔW has the same dimensions as the original matrix, and a gradient update is computationally costly. Moreover, ΔW must fit in memory, requiring more GPU and memory.
Keep reading with a 7-day free trial
Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.