Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen
An important paradigm of natural language processing consists of large-scale pretraining on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example – deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than finetuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.
An important paradigm of natural language processing consists of large-scale pretraining on general domain data and adap...
Many applications in natural language processing rely on adapting one large-scale, pre-trained language model to multipl...
While our proposal is agnostic to training objective, we focus on language modeling as our motivating use case. Below is...
The problem we set out to tackle is by no means new. Since the inception of transfer learning, dozens of works have soug...
We describe the simple design of LoRA and its practical benefits. The principles outlined here apply to any dense layers...
In principle, we can apply LoRA to any subset of weight matrices in a neural network to reduce the number of trainable p...
We evaluate the downstream task performance of LoRA on RoBERTa (Liu et al., 2019), DeBERTa (He et al., 2021), and GPT-2 ...
RoBERTa (Liu et al., 2019) optimized the pre-training recipe originally proposed in BERT and boosted the latter's task p...
Having shown that LoRA can be a competitive alternative to full fine-tuning on NLU, we evaluate whether LoRA still preva...
Transformer Language Models: Transformer (Vaswani et al., 2017) is a sequence-to-sequence architecture that makes heavy ...
Given the empirical advantage of LoRA, we hope to further explain the properties of the low-rank adaptation learned from...
Given a limited parameter budget, which types of weights should we adapt with LoRA to obtain the best performance on dow...
We turn our attention to the effect of rank $r$ on model performance. We adapt $\{W_q, W_v\}$, $\{W_q, W_k, W_v, W_o\}$,...
We further investigate the relationship between $\Delta W$ and $W$. In particular, does $\Delta W$ highly correlate with...
Fine-tuning enormous language models is prohibitively expensive in terms of the hardware required and the storage/switch...
Few-shot learning, or prompt engineering, is very advantageous when we only have a handful of training samples. However,...
Adapter layers are external modules added to a pre-trained model in a sequential manner, whereas our proposal, LoRA, can...
GLUE Benchmark is a wide-ranging collection of natural language understanding tasks. It includes MNLI (inference), SST-2...
D.1 RoBERTa: We train using AdamW with a linear learning rate decay schedule. We sweep learning rate, number of training...
LoRA can be naturally combined with existing prefix-based approaches. LoRA+PrefixEmbed (LoRA+PE) combines LoRA with pref...
F.1 Additional Experiments on GPT-2: We repeat our experiment on DART and WebNLG following the setup of Li & Liang (2021...
In this paper we use the measure $\phi(A, B, i, j) = \psi(U_A^i, U_B^j) = \frac{\|U_A^{i\top} U_B^j\|_F^2}{\min\{i,j\}}$...
H.1 Correlation Between LoRA Modules: See Figure 6 and Figure 7 for how the results presented in Figure 3 and Figure 4 g...