A JAX transform for LoRA-fying functions #15840

davisyoshida · 2023-05-03T05:41:51Z

davisyoshida
May 3, 2023

I wrote a transformation to automate using LoRA for JAX models: Lorax (I didn't only do this because of the naming opportunity)

LoRA basically replaces products like Wx with (W + BA)x where A and B are skinny, allowing you to save memory by not updating W. Lorax also supports some convs and gathers in addition to the basic matmul. Anytime there's an op which Lorax doesn't know how to handle it will raise a warning and just directly calculate the value of W + BA.

Minimal example:

import jax
import jax.numpy as jnp
import lorax

def f(w, x):
    return w @ x

k1, k2, k3 = jax.random.split(jax.random.PRNGKey(0), 3)
M, N, K = 1000, 2000, 16
w = jax.random.normal(k1, (M, N))
x = jax.random.normal(k2, (N,))

# Transform the function (by default only the first argument is transformed)
lorafied = lorax.lora(f)

# Create the low rank params
# First argument is a pytree of params, second is a corresponding pytree indicating how to handle each param
frozen_params, trainable_params = lorax.init_lora(param_tree=w, spec=K, rng=k3)

# Usage
lora_fn_output = lorafied((frozen_params, trainable_params), x)

def loss_fn(trainable_params, frozen_params, x):
    return lorafied((frozen_params, trainable_params), x)

grad_fn = jax.grad(loss_fn)

# Etc.

After, you can train by differentiating a loss function w.r.t. trainable_params only.

I tested with my personal haiku models while writing this, and have an example using it with a HuggingFace Flax model as well.

I'm generally open to any feedback, since I definitely felt like I was fumbling around a bit getting this working.

jakevdp · 2023-05-03T16:24:37Z

jakevdp
May 3, 2023
Maintainer

This is really cool, thanks for sharing! It's a great idea, and an amazing package name too 😁

I looked through the source code – the jaxpr interpreter approach looks really solid. I have some very minor comments about how I'd structure things to make it more easily extensible (e.g. keep all the env logic in lora_interpreter, move lora_funs to a global dict so new rules can be more easily registered without editing the source, etc.) but for the most part the approach looks solid.

Do you see this as kind of a one-off experiment, or something that you're hoping to put more time & development effort into? If it's the latter, I'd be happy to do a deeper code review if you'd find it helpful.

Thanks again for sharing!

3 replies

davisyoshida May 3, 2023
Author

Thanks a ton for the pointers! I'm definitely interested in putting in more time to improve this, so a code review would be amazing.

I'm glad to hear I'm on the right track with the interpreter, I was pretty uncertain about a lot of it. By keeping the env logic in lora_interpreter, do you mean modifying it to just take a flat set of arguments instead of handling the jaxpr invar -> argument mapping externally?

jakevdp May 3, 2023
Maintainer

do you mean modifying it to just take a flat set of arguments instead of handling the jaxpr invar -> argument mapping externally?

Yeah, that's what I had in mind. You can use tree_util to flatten your args and kwargs so that the outer function can have a more flexible API, while the inner jaxpr interpreter always gets a flat list of arguments.

davisyoshida May 3, 2023
Author

Got it. I applied your suggested fixes (I think), although I still had to do a bit more hands on handling to merge the frozen and tunable param trees together. I might be able to do it with a tree transpose instead, but I think this is probably more readable.

sh0416 · 2023-05-04T00:38:51Z

sh0416
May 4, 2023

I've just want to give you a different question.
Which one is efficient..? x(W+AB) or xW+xAB? (sorry for the complicated notation, I usually use matmul for implementation.)

I also implemented this features, but the former is simple to implement and without modifying original code, but I am suspicious about its efficiency.
In terms of addition flops, the former is (in_features * out_features) and the latter is (batch_size * out_features).
In terms of multiplication flops, the former is (in_features * lora_features * out_features + batch_size * in_features * out_features) and the latter is (batch_size * in_features * out_features + batch_size * in_features * lora_features + batch_size * lora_features * out_features) = (batch_size * in_features * out_features + batch_size * lora_features * (in_features + out_features)).

Based on the above multiplication analysis, when in_features * out_features < batch_size * (in_features + out_features), the former implementation is more efficient than latter even though the analysis is simple. Let in_features = out_features = 4096 as usually billion-scale transformers take, when 2048 < batch_size, the former will operate better.. Also, for the addition analysis, batch size has to be more than 4096.

I just want to know this kind of analysis works in practice and want to share your experience if you implement both approaches. Thank you

3 replies

davisyoshida May 4, 2023
Author

I'm using this in a regime where batch_size << features / 2, so I've implemented the latter option. If you want to get the former, you can use the init_lora method to split up the params, and just call the merge_params helper inside your loss function to combined them into weights you can use with the original model (no transformation needed). It'd actually pretty easy to do that in pure JAX so using this library just to accomplish that is probably overkill.

I think even for large batch sizes you need to be careful with the FLOPs calculation, since if you were just doing data parallelism you'd need to calculate W + AB once per device, so it's pretty unlikely you'd ever have a per-device batch size high enough to outweigh that.

sh0416 May 4, 2023

Great discussion thank for your reply and the implemented code!

daskol Aug 11, 2024

It is quite interesting question. It turns out that matmul order significantly impacts overall performance [1]. There are multiple combinations for weight tensors and output tensors as well as gradient tensors. This finding the best combination requires bruteforce testing.

[1] Cherniuk et al — Run LoRA Run: Faster and Lighter LoRA Implementations // 2023.

vvvm23 · 2023-05-08T12:52:51Z

vvvm23
May 8, 2023

Hey @davisyoshida, I was in the early stages of working on a library with exactly the same goal in mind 😅 But I am a Jax noob so your implementation is leagues ahead of mine (and the library name is much cooler!) so I think I'll move on to do something else. Super cool stuff and I'll study the code closely!

Would you say this library is ready for general use? Or are there still a few sharp edges? The API shown in the README looks pretty solid.

8 replies

vvvm23 May 21, 2023

I've been playing with lorax today on a little project finetuning RoBERTa. I based it off your GPT-2 example. I noticed that by default the script will train all parameters using LoRA, with the exception of the embedding layers which are trained in full.

Have you tested your library on other cases such as in the original paper? For example, only on attention weights, or only the QKV projections, or a subset of this?

See this excerpt from the paper:

davisyoshida May 21, 2023
Author

Yeah when I've been tuning Llama I've been doing Lora on some matrices with no updates to the other parameters. I did find that including the MLP matrices gives better performance than just doing Wq and Wv though.

I haven't done an evaluation of this myself but I've heard from some other people using LoRa that including the embeddings is important for performance.

davisyoshida May 21, 2023
Author

Oh I should maybe mention that you can LoRA the embeddings themselves rather than fully tuning them. This works because an embedding lookup is (mathematically) just a multiplication by a one-hot vector.

vvvm23 May 22, 2023

Okay nice :) was just checking if you had tested it in this more typical case. I think including more should give better finetuning performance but at the cost of a larger optimiser state / LoRA checkpoint size?

davisyoshida May 22, 2023
Author

Yeah. You can increase capacity either by increasing the lora dimension or by adding more parameters in to tune. I found that sometimes the performance seems to saturate if all you do is increase the dimension, so it can be more efficient to use a smaller dimension but tune more matrices. (This is specifically for language modeling, maybe it's not the case for other tasks).

davidshen84 · 2023-09-06T10:16:08Z

davidshen84
Sep 6, 2023

Hi @davisyoshida ,

Thanks for sharing the jax-lora library.

In the LORA paper, the author mentioned:

we reduce that VRAM usage by up to 2=3 if r << d_model as we do not need to store the optimizer states for the frozen parameters

I read through your implementation, but I could not figure out how you avoid storing optimizer states for the frozen parameters. I think your implementation is focusing on transforming the original model. However, I still needed help understanding how the optimization part works in your examples.

2 replies

davisyoshida Sep 6, 2023
Author

lorax.wrap_optimizer will modify an Optax optimizer so that it doesn't update the LoRA W parameters. You can see an example here. It uses an Optax MultiTransform so that a dummy optimizer is used for the frozen parameters. That means there's *no state created for them.

(Edited to say what I meant instead of the opposite of what I meant)

davidshen84 Sep 7, 2023

I see. The trick is in here, https://github.com/davisyoshida/qax/blob/master/qax/common/utils.py#L41. Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A JAX transform for LoRA-fying functions #15840

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 16 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

A JAX transform for LoRA-fying functions #15840

Replies: 4 comments · 16 replies

jakevdp May 3, 2023 Maintainer

davisyoshida May 3, 2023 Author

jakevdp May 3, 2023 Maintainer

davisyoshida May 3, 2023 Author

davisyoshida May 4, 2023 Author

davisyoshida May 21, 2023 Author

davisyoshida May 21, 2023 Author

davisyoshida May 22, 2023 Author

davisyoshida Sep 6, 2023 Author

Replies: 4 comments 16 replies

jakevdp
May 3, 2023
Maintainer

davisyoshida May 3, 2023
Author

jakevdp May 3, 2023
Maintainer

davisyoshida May 3, 2023
Author

davisyoshida May 4, 2023
Author

davisyoshida May 21, 2023
Author

davisyoshida May 21, 2023
Author

davisyoshida May 22, 2023
Author

davisyoshida Sep 6, 2023
Author