Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplifying constants in result or during search #128

Open
johanbluecreek opened this issue Sep 10, 2022 · 4 comments
Open

Simplifying constants in result or during search #128

johanbluecreek opened this issue Sep 10, 2022 · 4 comments
Labels
feature (importance: low) Low-importance feature

Comments

@johanbluecreek
Copy link
Contributor

I have used this package for a work-flow that is basically: Solve a differential equation (DE) using DifferentialEquations.jl and then giving that solution to SymbolicRegression.jl to find the analytical expression. From the DE it is quite obvious that the resulting expression should only contain integers or rationals. The optimization that is carried out for constants in the expressions gets very close, but it would be nice if there was either:

  • a function that can, post-search, fix numerical constants to some exact numerical constant, or
  • an option for the search that would act with some such function, as described above, on the candidate expressions during the search

Is there such functionality already in SymbolicRegression.jl that I have missed, or would it be useful to have?

Simple examples would be having solutions like 5.0000000002*x1, or whatever, rounded to 5.0 * x1, more advanced being the rationals having 0.200000002 * x2 fixed to x2 / 5.0 or similar, or more advanced still sin(x1 * 0.31830988618454) to sin(x1 / π). In SymPy there is a function nsimplify that can handle the numerical part of such functionality, and works quite well, e.g.

>>> from sympy import nsimplify, GoldenRatio, pi, E
>>> nsimplify(0.865255979442955, [GoldenRatio, pi, E], full=True)
E/pi

but I don't know if there is a Julia package that does the same thing.

@MilesCranmer
Copy link
Owner

MilesCranmer commented Sep 10, 2022

Interesting question. I see two options:

1. Post-process expressions.

You could do is use SymbolicRegression.jl like an nsimplify on steroids. For example, let's say we want to find an Ramanujan-like approximation to pi. To do this, you pass integers as features, and set complexity_of_constants=100 - this will make real constants too complex to use.

options = Options(
    # Make constants prohibitively expensive:
    complexity_of_constants=100, 
    unary_operators=(sqrt, square), 
    binary_operators=(+, *, /, -), 
    mutationWeights=[0.0, 0.47, 0.79, 5.1, 1.7, 0.0020, 0.00023, 0.21],
    # ^ Set p(mutate_constant)=0.0
    shouldOptimizeConstants=false,
    # ^ Set constant optimization off (so we don't waste cycles)
    parsimony=0.001,
)
# Integers from 1 to 10:
X = reshape(collect(1.0:10.0), 10, 1)
y = [float(pi)]

EquationSearch(X, y; options=options, multithreading=true, niterations=1000)

This gives me as output:
Screen Shot 2022-09-10 at 1 13 11 PM
which is actually pretty good, recovering many approximations of pi! https://en.wikipedia.org/wiki/Approximations_of_%CF%80 (I wonder if some of these are even known...?)

You could do something similar for other constants you wish to include - you could also set varMap to make the printout give the constants in such a case.

2. Search directly for integer/rationals.

One option is to basically apply the solution 1., but use it for the search itself. (i.e., concatenate the constants/integers with the data features, at each row)

More generally (say you want to include any integer), this is a bit tricky, especially because Optim.jl only can optimize float-like constants. But you could basically rely on the mutate_constant function (here) to explore the space of integers.

Right now, the library assumes that Node{T} and Dataset{T} have the same type T. In other words: expression constants have the same type as the element type of the dataset. However, this isn't really necessary - one could rewrite the backend to allow for Node{T1}, Dataset{T2}, which would allow you directly store integers or rational numbers in the expressions, while still computing with float precision. One could then look at using JuMP.jl in place of Optim.jl for explicit integer/rational optimization, if such a thing is possible.

Another option is to implement a function

function val(tree::Node{T})::Int
    convert(Int, tree.val)
end

and make it so that every time tree.val is called, you would instead call val(tree). You could then set that output type (here, an Int), to some user-specified type. I think the Node{T1} / Dataset{T2} solution is more elegant though.

@qwertyjl

This comment was marked as off-topic.

@MilesCranmer
Copy link
Owner

Hi @qwertyjl - I think this question is unrelated to the use of SymbolicRegression.jl/PySR - apologies but I do not have time to answer general math/science questions.
Best,
Miles

@MilesCranmer MilesCranmer added the feature (importance: low) Low-importance feature label Sep 24, 2022
@dan-blank
Copy link

dan-blank commented Jun 6, 2024

@MilesCranmer

Right now, the library assumes that Node{T} and Dataset{T} have the same type T. In other words: expression constants have the same type as the element type of the dataset. However, this isn't really necessary - one could rewrite the backend to allow for Node{T1}, Dataset{T2}, which would allow you directly store integers or rational numbers in the expressions, while still computing with float precision. One could then look at using JuMP.jl in place of Optim.jl for explicit integer/rational optimization, if such a thing is possible.

That would indeed be more elegant I think!

Nelder Mead, the simplex optimization, looks like it could be made to work on integers (although I assume I will not be optimal). An alternative would be to implement a really simple optimization for integers ("add/substract 1 to a constant") together with a warning and offer the user a way to supply a custom optimizer if they want (i.e. they can write a function that takes the same arguments as the call of Optim.jl). What do you think about that?

In any case, the splitting into Node{T1}, Dataset{T2} needs to deal with the optimization too I think. Also with the simplification, since stuff like "const1 / const2" might introduce rationals into a formula that should only contain integers.

I just came across Julia and this project, so I hope I do not talk too much nonsense! :)

Thank you for creating this great project!


@johanbluecreek Locally, I made SymbolicRegression.jl generate any integer by using these adaptions:

  • (Like Miles hinted) I changed mutate_constant to basically this:
    newval = node.val + convert(T, rand(rng, -5:5))
    node.val = newval
  • make_random_leaf now looks like this:
function make_random_leaf(
    nfeatures::Int, ::Type{T}, ::Type{N}, rng::AbstractRNG=default_rng()
) where {T<:DATA_TYPE,N<:AbstractExpressionNode}
    if rand(rng, Bool)
        return constructorof(N)(; val=convert(T, rand(rng, -50:50))) # changed to return integers
    else
        return constructorof(N)(T; feature=rand(rng, 1:nfeatures))
    end
end
  • Also set should_simplify=false, since otherwise constants might get optimised into non-integers. However I noticed that one sometimes can leave it on, since those (mostly small) results of simplification can get dominated by newly introduced integer constants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature (importance: low) Low-importance feature
Projects
None yet
Development

No branches or pull requests

4 participants