Stop using implicit style differentiating #221

ablaom · 2023-04-17T08:00:54Z

It seems the style used here is being deprecated and won't work with Flux 0.14:

Line 37 in 452c09d

gs = Flux.gradient(parameters) do

edit After discussion below, I suggest we wait on

WeightDecay for L1 norm Optimisers.jl#159

and refactor to use a optimiser-based solution to weight regularisation, which will avoid current limitations of explicit differentiation outlined in the discussion. Note, this will likely mean the reported training_loss must change, as it will no longer include the weight penalty. So this will be breaking.

The text was updated successfully, but these errors were encountered:

mcabbott · 2023-07-25T17:01:00Z

Relatedly, it would be nice if the MLJFlux models listed here https://github.com/FluxML/model-zoo#examples-elsewhere could be updated to use latest Flux, and avoid implicit gradients.

Examples of similar upgrades: https://github.com/FluxML/model-zoo/issues?q=is%3Aclosed+label%3Aupdate+explicit

In the end, Flux 0.14 did not drop support for implicit gradients, but 0.15 should.

ablaom · 2023-07-30T23:52:21Z

@pat-alt Would you have any time and interest in addressing this issue?

pat-alt · 2023-07-31T07:00:43Z

That actually syncs well with some of my other outstanding issues and I think I'll have to address this very same thing in CounterfactualExplanations.jl soon. So yes, please feel free to assign to this one to me and I'll look at it in the coming weeks 👍

pat-alt · 2023-08-01T15:11:03Z

I have added a draft for this with very minor changes here #230:

function train!(model::MLJFlux.MLJFluxModel, penalty, chain, optimiser, X, y)
    opt_state = Flux.setup(optimiser, chain)
    loss = model.loss
    n_batches = length(y)
    training_loss = zero(Float32)
    parameters = Flux.params(chain)
    for i in 1:n_batches
        batch_loss, gs = Flux.withgradient(chain) do m
            yhat = m(X[i])
            pen = penalty(parameters) / n_batches
            loss(yhat, y[i]) + pen
        end
        training_loss += batch_loss
        Flux.update!(opt_state, chain, gs[1])
    end
    return training_loss / n_batches
end

Currently, the following test fails:

[ Info: regularization has an effect:
[ Info: acceleration = CPU1{Nothing}(nothing)
regularization has an effect (typename(CPU1)): Test Failed at /Users/patrickaltmeyer/code/MLJFlux.jl/test/integration.jl:25
  Expression: !(loss2 ≈ loss3)
   Evaluated: !(0.8354643267207931 ≈ 0.8354643267207931)

I'm not quite sure what's happening. @mcabbott can you spot anything obviously wrong this?

ToucheSir · 2023-08-01T15:31:31Z

That's because the regularization term is still using implicit params. Something like FluxML/Flux.jl#2040 (comment) will be needed for explicit params.

mcabbott · 2023-08-01T15:35:14Z

parameters = Flux.params(chain) outside the gradient context will only work in the implicit style -- changing the explicit local m will not change pen. (Edit -- as ToucheSir says, while I was typing!)

What is penalty? For L2 it will be better to use WeightDecay like this: http://fluxml.ai/Flux.jl/stable/training/training/#Regularisation

pat-alt · 2023-08-01T16:21:54Z

Thanks both!

What is penalty? For L2 it will be better to use WeightDecay like this: http://fluxml.ai/Flux.jl/stable/training/training/#Regularisation

Currently, penalty functions are explicitly defined callable objects in MLJFlux (see here). I saw the note on WeightDecay in the Flux docs and was wondering if it's worth changing that.

In any case, I can't really get either of the approaches you suggest to work in this particular case, so we may indeed want to rethink the implementation of the penalty functions, for example by using WeightDecay instead. Will require a little extra work, but should be doable. @ablaom what do you think?

ToucheSir · 2023-08-01T16:41:30Z

I can't really get either of the approaches you suggest to work in this particular case

Can you elaborate? I'm not sure I understand why/how they wouldn't work.

pat-alt · 2023-08-01T16:54:15Z

Sure!

Moving the params call inside as follows

function train!(model::MLJFlux.MLJFluxModel, penalty, chain, optimiser, X, y)
    opt_state = Flux.setup(optimiser, chain)
    loss = model.loss
    n_batches = length(y)
    training_loss = zero(Float32)
    for i in 1:n_batches
        batch_loss, gs = Flux.withgradient(chain) do m
            yhat = m(X[i])
            pen = penalty(Flux.params(m)) / n_batches
            loss(yhat, y[i]) + pen
        end
        training_loss += batch_loss
        Flux.update!(opt_state, chain, gs[1])
    end
    return training_loss / n_batches
end

the tests just seem to get stuck at some point. I may try and commit this now, but at least locally on my machine things get stuck.

Alternatively, using the approach in FluxML/Flux.jl#2040 (comment) as follows

function train!(model::MLJFlux.MLJFluxModel, penalty, chain, optimiser, X, y)
    opt_state = Flux.setup(optimiser, chain)
    loss = model.loss
    n_batches = length(y)
    training_loss = zero(Float32)
    for i in 1:n_batches
        batch_loss, gs = Flux.withgradient(chain) do m
            yhat = m(X[i])
            l = loss(yhat, y[i])
            reg = Functors.fmap(penalty, m; exclude=Flux.trainable)
            l + reg / n_batches
        end
        training_loss += batch_loss
        Flux.update!(opt_state, chain, gs[1])
    end
    return training_loss / n_batches
end

I get the following error:

[ Info: acceleration = CPU1{Nothing}(nothing)
┌ Warning: Layer with Float32 parameters got Float64 input.
│   The input will be converted, but any earlier layers may be very slow.
│   layer = Dense(5 => 15)      # 90 parameters
│   summary(x) = "5×20 Matrix{Float64}"
└ @ Flux ~/.julia/packages/Flux/n3cOc/src/layers/stateless.jl:60
fit! and dropout (typename(CPU1)): Error During Test at /Users/patrickaltmeyer/code/MLJFlux.jl/test/test_utils.jl:38
  Got exception outside of a @test
  TypeError: non-boolean (NamedTuple{(:layers,), Tuple{Tuple{Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dropout{Float64, Colon, Random.TaskLocalRNG}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}}) used in boolean context

Perhaps it has to do with the fact that the penalizers aren't Functors?

ToucheSir · 2023-08-01T18:12:04Z

Yeah I wouldn't try the first version you have there, was referring to the second one or @mcabbott's suggestion about moving things to the optimization step.

I get the following error: ...

Pretty sure that's due to a typo in the original example code snippet. See FluxML/Flux.jl#2040 (comment)

pat-alt · 2023-08-11T09:30:30Z

hmm in that case I get the following error: MethodError: no method matching Dense(::Float32, ::Float32, ::typeof(identity)). Any ideas?

ablaom · 2023-08-13T23:38:16Z

Thanks @pat-alt for this work!

In any case, I can't really get either of the approaches you suggest to work in this particular case, so we may indeed want to rethink the implementation of the penalty functions, for example by using WeightDecay instead. Will require a little extra work, but should be doable. @ablaom what do you think?

WeightDecay only provides a mechanism for L2 regularisation, but the current implementation provides for a combination of both L1 regularisation (good for feature selection) and L2 regularisation. It seems a pity to drop support of a feature to accomodate the new explicit syntax.

I don't know what the source of your current issue is.

ablaom · 2023-09-05T20:08:13Z

@pat-alt I don't think your use of Functors.fmap is valid here. The penalty function takes a tuple of matrices, as returned by Flux.params(chain), and returns a single aggregate number.

Your first suggestion (with params) actually works but is 3600 times slower than the implicit style code on the dev branch, when tested on a small model / dataset.

@ToucheSir To implement mixed L1/L2 penalties (not just L2 ones) I don't really see how to avoid the params in the withgradient block. (And this is after all a suggestion in the Flux documentation - second code block here). Am I to conclude that explicit-Zygote style AD is just no good on this problem?

ToucheSir · 2023-09-05T22:21:48Z

To implement mixed L1/L2 penalties (not just L2 ones) I don't really see how to avoid the params in the withgradient block. (And this is after all a suggestion in the Flux documentation - second code block here). Am I to conclude that explicit-Zygote style AD is just no good on this problem?

It's arguably better, but it requires some helper functionality that isn't currently nicely packaged up in a library. FluxML/Optimisers.jl#57 is one example of how to do this and how we're thinking about packaging it up going forwards, but the problem with general solutions is that they take time. For this work, you may be better served by implementing a similar but more constrained version on top of Functors.jl and Optimisers.jl which only includes as much as MLJFlux needs for regularization. If you do, feel free to ping me for input.

ablaom · 2023-09-05T22:44:17Z

@ToucheSir Thanks for the prompt response and offer of help.

So, with the apparatus you describe (Functors.jl, etc ) what code replaces the following to avoid the params call, working for a generic Flux model, chain, and so that differentiating chain -> penalty is free of issues?

# function to return penalty on an array:
f(A) = 0.01*sum(abs2, A) + 0.02*sum(abs, A)

f(ones(2,3))
# 0.6000000000000001

chain = Chain(Dense(3=>5), Dense(5=>1, relu))
penalty = sum(f.(Flux.params(chain)))

ablaom · 2023-09-05T22:45:15Z

Or if you prefer, how should the regularisation example in the Flux documentation be re-written (without the weight-decay trick , which does not work for L1 penalty)?

ToucheSir · 2023-09-06T03:35:59Z

f(A) = ...
penalty = mytotal(f, chain)

Where mytotal is a simplified form or direct copy of Optimisers.total as I mentioned earlier.

...(without the weight-decay trick , which does not work for L1 penalty)?

Side note, but I remembered looking into this a few months back and coming across https://stackoverflow.com/questions/42704283/l1-l2-regularization-in-pytorch/66630301#66630301, which suggests that L1 could be implemented using a similar trick. Whether that would be compatible with MLJFlux's API I'm not sure, but we could consider adding it to Optimisers.jl.

ablaom · 2023-09-07T23:49:19Z

Thanks for the help @ToucheSir . Unfortunately, Optimisers.total is not working for me. I've tried some variations on that approach but without any luck.

I suggest we wait on the WeightDecay extension referenced above and switch that approach, which is likely more performant anyhow.

pat-alt self-assigned this Aug 1, 2023

pat-alt linked a pull request Aug 1, 2023 that will close this issue

221 stop using implicit style differentiating #230

Closed

2 tasks

pat-alt mentioned this issue Aug 1, 2023

221 stop using implicit style differentiating #230

Closed

2 tasks

mcabbott mentioned this issue Sep 6, 2023

WeightDecay for L1 norm FluxML/Optimisers.jl#159

Merged

2 tasks

ablaom mentioned this issue Sep 6, 2023

Add total(f, model) to replace implicit sum(f, Flux.params(model)) FluxML/Optimisers.jl#57

Open

ablaom added the next breaking release label Sep 7, 2023

ablaom mentioned this issue Apr 30, 2024

Omnibus PR, including switch to explicit style differentiation #251

Merged

6 tasks

ablaom closed this as completed in #251 Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop using implicit style differentiating #221

Stop using implicit style differentiating #221

ablaom commented Apr 17, 2023 •

edited

mcabbott commented Jul 25, 2023

ablaom commented Jul 30, 2023

pat-alt commented Jul 31, 2023

pat-alt commented Aug 1, 2023

ToucheSir commented Aug 1, 2023 •

edited

mcabbott commented Aug 1, 2023 •

edited

pat-alt commented Aug 1, 2023 •

edited

ToucheSir commented Aug 1, 2023

pat-alt commented Aug 1, 2023

ToucheSir commented Aug 1, 2023

pat-alt commented Aug 11, 2023

ablaom commented Aug 13, 2023

ablaom commented Sep 5, 2023 •

edited

ToucheSir commented Sep 5, 2023

ablaom commented Sep 5, 2023

ablaom commented Sep 5, 2023 •

edited

ToucheSir commented Sep 6, 2023

ablaom commented Sep 7, 2023

Stop using implicit style differentiating #221

Stop using implicit style differentiating #221

Comments

ablaom commented Apr 17, 2023 • edited

mcabbott commented Jul 25, 2023

ablaom commented Jul 30, 2023

pat-alt commented Jul 31, 2023

pat-alt commented Aug 1, 2023

ToucheSir commented Aug 1, 2023 • edited

mcabbott commented Aug 1, 2023 • edited

pat-alt commented Aug 1, 2023 • edited

ToucheSir commented Aug 1, 2023

pat-alt commented Aug 1, 2023

ToucheSir commented Aug 1, 2023

pat-alt commented Aug 11, 2023

ablaom commented Aug 13, 2023

ablaom commented Sep 5, 2023 • edited

ToucheSir commented Sep 5, 2023

ablaom commented Sep 5, 2023

ablaom commented Sep 5, 2023 • edited

ToucheSir commented Sep 6, 2023

ablaom commented Sep 7, 2023

ablaom commented Apr 17, 2023 •

edited

ToucheSir commented Aug 1, 2023 •

edited

mcabbott commented Aug 1, 2023 •

edited

pat-alt commented Aug 1, 2023 •

edited

ablaom commented Sep 5, 2023 •

edited

ablaom commented Sep 5, 2023 •

edited