The AiEdge Newsletter

The AiEdge Newsletter

Share this post

The AiEdge Newsletter
The AiEdge Newsletter
How To Optimize Machine Utilization for Multiple Fine-Tuned LLMs with Hugging Face
Copy link
Facebook
Email
Notes
More

How To Optimize Machine Utilization for Multiple Fine-Tuned LLMs with Hugging Face

Damien Benveniste's avatar
Damien Benveniste
Oct 02, 2023
∙ Paid
13

Share this post

The AiEdge Newsletter
The AiEdge Newsletter
How To Optimize Machine Utilization for Multiple Fine-Tuned LLMs with Hugging Face
Copy link
Facebook
Email
Notes
More
3
1
Share
  • The inefficiency of multiple fine-tuned models

  • Fine Tuning LLMs with LoRA

  • Multiple fine-tuning jobs with Hugging Face

  • Switching between adapters


The inefficiency of multiple fine-tuned models

Let’s consider OpenAI as an example. OpenAI provides an API to fine-tune models. Their base models are gpt-3.5-turbo-0613, babbage-002, and davinci-002, with gpt-3.5-turbo being the base model underlying ChatGPT.

Fine-tuning a model means we have some data and would like to specialize the model on some task. The specific task depends on the customer's business needs and the data they have available to fine-tune a model. For example, we may want to fine-tune a model on the following tasks:

  • English-Spanish translation

  • Custom support message routing

  • Specialized question-answers

  • Text sentiment analysis

  • Named entity recognition

  • …

In the case of OpenAI, “fine-tuning“means that the model is specialized by using some proprietary data, and it is then deployed on GPU hardware for API access. Naively, we could think that for each new customer wanting to fine-tune their model, we would need to deploy a new model on a new GPU cluster.

However, it is unlikely that OpenAI proceed this way! GPU hardware is really expensive, and they would need to allocate a GPU cluster for each new customer. OpenAI pricing model is based on model usage, meaning customers only pay when they use the model, but for OpenAI, the cost of serving the model never stops! It is very likely that there have been thousands of customers who just wanted to test OpenAI’s fine-tuning capabilities, and the resulting fine-tuned models were never actually used. Would OpenAI just handle the serving cost for each of those models?

Fine Tuning LLMs with LoRA

One strategy to fine-tune LLMs is to use adapters that can be “plugged“into the base model. The idea is to avoid updating the weights of the base model and have the adapters capture the information about the fine-tuning tasks. We can plug in and out different adapters that specialize the model on different tasks.

The most common and efficient adapter type is the Low-Rank Adapter (LoRA). The idea is to replace some of the large matrices within the model with smaller ones for the gradient computation.

Without LoRA

Let’s consider a gradient update without LoRA. Let's call W0 the weights of the pre-trained model for a specific layer matrix. After a gradient update ΔW, the weights will be

\(W=W_0 +\Delta W\)

If x is the input to that layer, the output of that layer will be

\(W \cdot x = W_0 \cdot x + \Delta W \cdot x\)

ΔW has the same dimensions as the original matrix, and a gradient update is computationally costly. Moreover, ΔW must fit in memory, requiring more GPU and memory.

Keep reading with a 7-day free trial

Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 AiEdge
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More