LoRA: Low-Rank Adaptation of Large Language Models

Edward Hu; Yelong Shen; Phillip Wallis; Zeyuan Allen-Zhu; Yuanzhi Li; Shean Wang; Lu Wang; Weizhu Chen

LoRA: Low-Rank Adaptation of Large Language Models

Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen

Abstract

An important paradigm of natural language processing consists of large-scale pretraining on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example – deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than finetuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.

Abstract

p.1

An important paradigm of natural language processing consists of large-scale pretraining on general domain data and adap...

1 Introduction

p.1

Many applications in natural language processing rely on adapting one large-scale, pre-trained language model to multipl...

2 Problem Statement

Mathp.2

While our proposal is agnostic to training objective, we focus on language modeling as our motivating use case. Below is...

3 Aren't Existing Solutions Good Enough?

p.3

The problem we set out to tackle is by no means new. Since the inception of transfer learning, dozens of works have soug...

4.1 Low-Rank-Parametrized Update Matrices

Mathp.4

We describe the simple design of LoRA and its practical benefits. The principles outlined here apply to any dense layers...

4.2 Applying LoRA to Transformer

p.5

In principle, we can apply LoRA to any subset of weight matrices in a neural network to reduce the number of trainable p...

5 Empirical Experiments — Overview and Baselines

p.5

We evaluate the downstream task performance of LoRA on RoBERTa (Liu et al., 2019), DeBERTa (He et al., 2021), and GPT-2 ...

5.2–5.3 Results on RoBERTa, DeBERTa

p.6

RoBERTa (Liu et al., 2019) optimized the pre-training recipe originally proposed in BERT and boosted the latter's task p...

5.4–5.5 Results on GPT-2 and GPT-3 175B

p.7

Having shown that LoRA can be a competitive alternative to full fine-tuning on NLU, we evaluate whether LoRA still preva...

6 Related Works

p.8

Transformer Language Models: Transformer (Vaswani et al., 2017) is a sequence-to-sequence architecture that makes heavy ...

7 Understanding the Low-Rank Updates — Overview

p.9

Given the empirical advantage of LoRA, we hope to further explain the properties of the low-rank adaptation learned from...

7.1 Which Weight Matrices in Transformer Should We Apply LoRA To?

p.10

Given a limited parameter budget, which types of weights should we adapt with LoRA to obtain the best performance on dow...

7.2 What Is the Optimal Rank r for LoRA?

Mathp.10

We turn our attention to the effect of rank $r$ on model performance. We adapt $\{W_q, W_v\}$, $\{W_q, W_k, W_v, W_o\}$,...

7.3 How Does the Adaptation Matrix ΔW Compare to W?

p.12

We further investigate the relationship between $\Delta W$ and $W$. In particular, does $\Delta W$ highly correlate with...

8 Conclusion and Future Work

p.12

Fine-tuning enormous language models is prohibitively expensive in terms of the hardware required and the storage/switch...

A Large Language Models Still Need Parameter Updates

p.16

Few-shot learning, or prompt engineering, is very advantageous when we only have a handful of training samples. However,...

B Inference Latency Introduced by Adapter Layers

p.17

Adapter layers are external modules added to a pre-trained model in a sequential manner, whereas our proposal, LoRA, can...

C Dataset Details

p.17

GLUE Benchmark is a wide-ranging collection of natural language understanding tasks. It includes MNLI (inference), SST-2...

D Hyperparameters Used in Experiments

p.18

D.1 RoBERTa: We train using AdamW with a linear learning rate decay schedule. We sweep learning rate, number of training...

E Combining LoRA with Prefix Tuning

p.20

LoRA can be naturally combined with existing prefix-based approaches. LoRA+PrefixEmbed (LoRA+PE) combines LoRA with pref...

F Additional Empirical Experiments

p.21

F.1 Additional Experiments on GPT-2: We repeat our experiment on DART and WebNLG following the setup of Li & Liang (2021...

G Measuring Similarity Between Subspaces

Mathp.22

In this paper we use the measure $\phi(A, B, i, j) = \psi(U_A^i, U_B^j) = \frac{\|U_A^{i\top} U_B^j\|_F^2}{\min\{i,j\}}$...

H Additional Experiments on Low-Rank Matrices

p.24

H.1 Correlation Between LoRA Modules: See Figure 6 and Figure 7 for how the results presented in Figure 3 and Figure 4 g...