Andreas Kirsch 🇺🇦
Andreas Kirsch 🇺🇦
@BlackHC
Jul 8 10 months ago 8 tweets Read on X
AI Summary

This thread explains GaLore, a new way to train large language models more efficiently. Unlike previous methods, it updates a low-rank projection in gradient space every T steps, saving memory and adapting better during training. The approach is promising and opens up new research ideas for making AI training faster and cheaper.

1/ I just read the fascinating GaLore paper on memory-efficient LLM training using gradient low-rank projection. Kudos to the authors for this insightful work!

My TL;DR and some thoughts below (as a little paper review) 🧵

@AnimaAnandkumar
For the first time, we show that the Llama 7B LLM can be trained on a single consumer-grade GPU (RTX 4090) with only 24GB memory. This represents more than 82.5% reduction in memory for storing optimizer states during training.

Training LLMs from scratch currently requires huge

2/ GaLore takes a novel approach compared to methods like LoRA and ReLoRA. Instead of doing low-rank projection in weight space (W = W_0 + A) and baking this into the weights every T steps, GaLore performs low-rank projection in gradient space: G = P @ G' @ Q^T.

Tweet image 1
Tweet image 2

3/ One key aspect of GaLore is that it recomputes the projection matrices P and Q every T steps using the SVD of the full gradient. This allows the low-rank subspace to maximally adapt to the changing gradient distribution during training.

Tweet image 1

4/ The key design choice for me is that it computes full gradients first and uses them to update the projection matrices every T steps. This means we have only one matrix for the gradients (G) instead of two like in LoRA. Well, P and Q must be stored, but they aren't optimized!

Tweet image 1

5/ The memory savings in GaLore come from the reduction in optimizer state and the lack of newly introduced weights. In their implementation, they only use P for further savings. This is different from LoRA where the savings only come from the low-rank approx of the weights.

Tweet image 1

7/ While there are similarities to ReLoRA in terms of alternating between low-rank updates and full-rank updates, the specifics of how it's done in GaLore lead to different (better) training dynamics and memory characteristics.

Tweet image 1

8/ The adaptive choice of low-rank subspace based on gradient SVD is a particularly novel and promising aspect of GaLore. There are exciting research questions one could work on here.

Tweet image 1

9/ I'm very excited to see further analysis and extensions of these ideas. The low-rank training landscape is a fascinating area of research with a lot of potential for more efficient LLM training.

Tweet image 1

Missing some Tweet in this thread? You can try to Update

Unroll Another Thread

Convert any Twitter threads to an easy-to-read article instantly

Have you tried our Twitter bot?

You can now unroll any thread without leaving Twitter/X. Here's how to use our Twitter bot to do it.

  • Give us a follow on Twitter. follow us
  • Drop a comment, mentioning us @unrollnow on the thread you want to Unroll.
  • Wait For Some Time, We will reply to your comment with Unroll Link.
UnrollNow Twitter Bot
Modal Image
0:00 / 0:00