LoRA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS (2021)
June 7, 2025 (Originally posted on November 18, 2023)LoRA is motivated by the findings of Li et al. (2018) and Aghajanyan et al. (2020), which show that overparameterized models tend to converge to solutions that lie within a low-dimensional intrinsic subspace.
The intrinsic dimension refers to the minimum number of trainable parameters needed to reach satisfactory performance on a given task.
LoRA introduces low-rank adaptation by decomposing the weight matrices in dense layers. Instead of fine-tuning all parameters of a pre-trained model, LoRA freezes the original weights and learns two low-rank matrices during training.
Letting \(W_0\in \mathbb{R}^{d\times k}\) be a pre-trained weight matrix, LoRA replaces it with \(W_0 + BA\), where \(B\in \mathbb{R}^{d\times r}, A\in \mathbb{R}^{r\times k}\), and the rank \(r \ll \min(d, k)\).
The matrix \(A\) is initialized with a Gaussian distribution, while \(B\) is initialized to zero. The resulting update \(BA\) is scaled by \(\frac{\alpha}{r}\), where \(\alpha\) is a hyperparameter. This normalization mitigates sensitivity to the choice of rank and minimizes the need to retune other hyperparameters when \(r\) changes.
In Transformer architecture, each self-attention layer contains four key weight matrices: \(W_q, W_k, W_v, \text{and}\ W_o\). In the original Attention is All You Need, queries, keys, and values are generated by projecting the same input \(\boldsymbol{x}\) through distinct matrices: $$ \text{head}_i = \text{softmax}\left(\frac{QW_i^Q(KW_i^K)^\top}{\sqrt{d_k}}\right)VW_i^V $$ $$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots , \text{head}_h)W^{O} $$
Here, \(d_{\text{model}}\) is the input embedding dimension, \(h\) is the number of attention heads, and \(d_k and d_v\) are \(d_{model}/h\). Each projection matrix \(W_i^Q, W_i^K, W_i^V\) has shape \(\mathbb{R}^{d_{model}\times d_k}\).
LoRA applies low-rank adaptation directly to the full projection matrices \(W_q, W_k, W_v\) treating each as a unified matrix, rather than splitting per head. These matrices are later split across heads as part of the standard multi-head attention computation.