LoRA (Low-Rank Adaptation)
LoRA (Low-Rank Adaptation) is a technique for efficiently fine-tuning large language models by freezing the original model weights and injecting small trainable matrices into each transformer layer.
Instead of updating billions of parameters during fine-tuning, LoRA only trains a fraction (typically 0.1-1% of the original model size). It decomposes weight updates into two small matrices (low-rank decomposition), dramatically reducing memory usage and training time.
Key advantages include: training on consumer GPUs (vs. requiring multiple A100s), storing multiple fine-tuned versions as small adapter files (a few hundred MB vs. copying the entire model), and easy switching between adapters at inference time.
QLoRA further reduces memory by quantizing the base model to 4-bit precision while keeping the LoRA adapters in higher precision. This enables fine-tuning 65B+ parameter models on a single GPU.
Want to learn more?
Explore more developer terms or read in-depth articles on the blog.
Browse all terms