Margi Pandya - Portfolio

Large language models like GPT, RoBERTa, and LLaMA are trained on fairly general datasets, not on any specific downstream task or application. As larger models are trained every few months, it is not plausible to retrain the entire model on custom datasets. For example, GPT-3 has 175 billion trainable parameters, and training it requires significant time and computation. Instead, we can leverage the knowledge gained during pre-training and update only the parameters necessary for a specific downstream application.

LoRA

LoRA (Low-Rank Adaptation) is a fine-tuning method used to adapt a pre-trained model. Instead of updating all model parameters, LoRA indirectly trains some dense layers by optimizing low-rank decomposition matrices of the layer updates, while keeping the pre-trained weights frozen.

Problem Statement

For simplicity, the authors consider the task of language modeling. In this task, conditional probabilities are used to predict the next token — a vector of size v — and then a softmax function maps it to a word in the vocabulary. Here is the objective function:

max_W ∑_(x,y)∈Z ∑_t=1^|y| log(P_W(y_t | x, y<t))

When fine-tuning for each downstream task, we would typically have to learn a new set of parameters ΔW of the same size as the original model (e.g., 175B parameters for GPT-3). Storing and training such large matrices is not feasible for non-commercial compute environments. LoRA proposes that in practice, the update ΔW often lies in a much lower-dimensional space. This allows us to approximate ΔW using low-rank decomposition, making it both memory- and compute-efficient.

Method

Dense layers in transformers typically have full-rank weight matrices. However, when adapting to a downstream task, the update matrix ΔW tends to have lower intrinsic rank. Thus, for a pre-trained weight matrix W₀ ∈ ℝ^d×k, we constrain its update using a low-rank decomposition:

W₀ + ΔW = W₀ + B·A

where B ∈ ℝ^d×r, A ∈ ℝ^r×k, and the rank r ≪ min(d, k).

During training, the original weights W₀ are frozen, and only A and B are updated via backpropagation. Initially, A is randomly initialized using a Gaussian distribution, and B is initialized to zero.

The same input is passed through both W₀ and ΔW. The contribution of ΔW is scaled using a hyperparameter α (alpha).

As the rank r increases, the adaptation becomes closer to full fine-tuning of the model. In the limit, when r equals the rank of W₀, the low-rank decomposition becomes equivalent to standard fine-tuning.

For inference latency, switching to a different downstream task simply involves replacing A·B with A′·B′ for that specific task. The original model weights remain untouched, allowing for efficient multi-task adaptation.

LoRA for GPT-3

The authors of the paper have shown comparisons for various LLMs like RoBERTa, DeBERTa, and GPT-2. For GPT-3, they observed state-of-the-art (SOTA) level performance on three datasets — WikiSQL, MNLI-m, and SAMSum. However, there was a performance drop when more than 256 special tokens were used in prefix embeddings, as shown in the figure.

Apart from this, LoRA performs an in-depth ablation study backed by principles in Linear Algebra. It answers key questions such as:

Given a parameter budget constraint, which subset of weight matrices in a pre-trained Transformer should we adapt to maximize downstream performance?
Is the “optimal” adaptation matrix ΔW really rank-deficient? If so, what is a good rank to use in practice?
What is the connection between ΔW and W? Does ΔW highly correlate with W? How large is ΔW compared to W?

To illustrate this, the authors applied LoRA on a vanilla Transformer. In the original Transformer architecture, there are four weight matrices in the self-attention module and two in the MLP module. For simplicity and parameter-efficiency, they considered adapting only the attention weights and froze the MLP blocks.

LoRA: Ablation

1. Which weight matrices in the Transformer should we apply LoRA to?

They assume a parameter budget and determine the rank depending on how many weight matrices are updated. Notably, updating only the W_q and W_v weights with a rank of 4 gives comparable performance and captures sufficient information in ΔW. This is preferable to updating a single weight matrix with a higher rank.

2. What is the optimal rank r for LoRA?

This section validates the initial claim that ΔW has a low intrinsic rank. Surprisingly, a rank as small as 1 suffices to adapt both W_q and W_v on these datasets, while training W_q alone requires a larger rank. However, this finding is task-dependent. For example, in machine translation tasks, if the pre-trained model has not seen a particular language, a higher rank may be needed for effective adaptation.

Subspace Similarity Between Different Ranks

Using Singular Value Decomposition (SVD) on matrix A with ranks 8 and 64, they obtain the right singular unitary matrices U_{A_r=8} and U_{A_r=64}. They then compute the Frobenius norm to determine how much of the subspace spanned by the top i singular vectors of U_{A_r=8} is contained within that of U_{A_r=64}.

φ(A_r=8, A_r=64, i, j) = || U_i^T · U_j ||_F² / min(i, j)

Here, φ ∈ [0, 1]. A value of 0 indicates complete separation between the subspaces, while 1 indicates full overlap. This analysis was performed for the 48th layer (out of 96) due to space constraints.

Subspace Similarity Between Different Random Seeds

As mentioned earlier, the A matrix is initialized using a normal distribution. The authors conducted ablation studies using different random seeds to initialize A with rank 64. Since ΔW_q has a higher intrinsic rank than ΔW_v, its SVD shows higher singular values and stronger structure.

How Does the Adaptation Matrix ΔW Compare to W?

This is one of the most important insights. It explores how ΔW changes during the adaptation of a pre-trained model. Specifically:

Is ΔW highly correlated with W?
Is ΔW mostly contained in the top singular directions of W?
How large is ΔW compared to the directions in W?

To investigate this, the authors perform SVD on ΔW and compute the Frobenius norm of ||U^TW_qV||_F, and similarly for W_q. Then, a ratio is calculated to measure similarity.

Key Findings:

ΔW_q is highly correlated with W_q. As rank increases, the correlation decreases, indicating that lower-rank updates tend to target specific parameters relevant to the downstream task.
Instead of duplicating the dominant directions of W, the adaptation primarily amplifies directions not emphasized during pre-training — offering an efficient fine-tuning approach.

References

Hu, Edward J., et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685. https://arxiv.org/abs/2106.09685

Contents