Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Ensure constant optimization for output formulas #379

Open
Nithouson opened this issue Jul 12, 2023 · 12 comments
Open

[Feature]: Ensure constant optimization for output formulas #379

Nithouson opened this issue Jul 12, 2023 · 12 comments
Labels
enhancement New feature or request

Comments

@Nithouson
Copy link

Feature Request

Hi, thanks for developing the helpful tool.
I thought that constants in hall-of-fame formulas have always been optimized. However, I found it is not the case. This happens even after I set optimize_probability=1.0. For example, one resulting formula on my data was
52.16375*x2*(x14 + 0.2914053*x24)/x0**2, loss=219.1645146
And it had been listed in "Hall of Fame" for multiple iterations before the symbolic regression process finished. I wrote some code to optimize the constant a,b in lambda x: a*x[2]*(x[14] + b*x[24])/x[0]**2 using BFGS in scipy.optimize.minimize , with initial position [52.16375, 0.2914053], maxiter = 8, and the same loss function as symbolic regression. I got a better set of constants:
68.0186345*x2*(x14 + 0.1651678*x24)/x0**2, loss=217.6017091
I think this is because of the age-based regularization. Although constants are optimized after each iteration (when optimize_probability=1.0), the hall-of-fame formula may have been removed out of the population without being optimized.
Hence, I suggest an option to re-optimize the constants for the hall-of-fame formulas after the symbolic regression process. Although users may implement this in a similar way as Discussion #255, I think it is more convenient to incorporate this feature.

@Nithouson Nithouson added the enhancement New feature or request label Jul 12, 2023
@MilesCranmer
Copy link
Owner

Thanks for the detailed note on this, it is quite helpful!

Just to double check: when the probability of optimizing is 1.0, this issue goes away? In other words, you mean it’s because the constants did not get optimized?

Or, if not, have you tried the other optimization hyperparameters? Maybe it is the specific optimizer used (BFGS with 3 steps) rather than the evolutionary strategy itself?

@Nithouson
Copy link
Author

This issue does not go way when probability of optimization is 1.0. The case I report happened when optimize_probability=1.0.

If I understand right, the default constant optimizer in PySR is BFGS, and the number of iterations is 8 (which I didn't change in PySR). I used the same setting on the output formula and found improvement, so I suspect this formula had not been optimized by PySR. Otherwise PySR would find a better set of constants, too.

@MilesCranmer
Copy link
Owner

What is the default optimizer in SciPy?

@MilesCranmer
Copy link
Owner

This is the function optimization call: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/0f47c6baf783b436ccecd5e635f692516c92d963/src/ConstantOptimization.jl#L29 (Julia code). It would be insightful to see whether it actually gets stuck at those incorrect constants, or if it does actually find them with enough steps.

@Nithouson
Copy link
Author

The optimize.minimize() function in SciPy supports multiple optimizers. If not specified, the software would choose one of BFGS, L-BFGS-B, SLSQP, depending on whether or not the problem has constraints or bounds.
In the case I reported, I specified BFGS to align with PySR.

@MilesCranmer
Copy link
Owner

MilesCranmer commented Jul 12, 2023

Hm, so if the optimizer is the same, and if there is a 100% chance of optimization occurring, then what could make up the difference?

Do you have a full MWE (ie code) of this issue?

Note that there is deterministic=True, random_state=0, multithreading=False, procs=0 for reproducible debugging.

@Nithouson
Copy link
Author

I cannot make my data public currently, but I will try to work out a reproducible code of this issue with synthetic data.

I suspect this issue is caused by the age-based regularization. PySR optimizes the constants after one iteration, which contains multiple rounds of population evolution. The oldest expression is replaced at each round of evolution. Thus, the hall-of-fame formula may have been removed out of the population when an iteration finishes, thus does not get optimized.

@MilesCranmer
Copy link
Owner

Hm, it should be copied when saved to the hall of fame. ie, there shouldn’t be a data race. But if there is then we need to fix it.

@MilesCranmer
Copy link
Owner

Yeah, mutations always occur on a copied tree: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/727493db3e9c5f17335313fc56f2612b7c82bc32/src/Mutate.jl#L100

So not sure what’s going on. If you follow up with a MWE I can have a look.

@Nithouson
Copy link
Author

Here is a reproducible script where the constants in output formulas are not optimal.

from pysr import PySRRegressor
import numpy as np
from scipy import optimize

np.random.seed(130801)
X = np.abs(np.random.randn(1000, 5)) + 1.0
y = 13.0801 * (1 + 0.4 * np.random.randn(1000)) * X[:, 2] ** 1.5617 * X[:, 3] / X[:, 0] ** 1.7554

model = PySRRegressor(
    niterations=100,
    binary_operators=["+", "-", "*", "/", "^"],
    unary_operators=["exp", "log"],
    complexity_of_operators={"exp": 2, "log": 2},
    loss="loss(prediction, target) = abs(prediction - target)",
    optimize_probability=1.0,  # ensure constant optimization after each iteration
    deterministic=True,
    random_state=0,
    multithreading=False,
    procs=0,
)

model.fit(X, y)
df = model.equations_
print(df['sympy_format'][7], df['loss'][7])
# (x2**0.8403867/x0)**1.7233436*(x3*(x2 + 12.663361) - x4) 8.56536


def L1_loss(theta):
    a, b, c = theta
    expr = lambda x: (x[2]**a/x[0])**b*(x[3]*(x[2] + c) - x[4])
    return sum([abs(y[i]-expr(X[i])) for i in range(len(y))])/len(y)


init_param = [0.8403867, 1.7233436, 12.663361]
print(L1_loss(init_param))  # 8.565360387966996
res = optimize.minimize(L1_loss, np.asarray(init_param), method="BFGS",
                        options={"disp": True, "maxiter":8})
print(res.x, res.fun)
# [ 0.84992441  1.71813267 12.65474661] 8.562296280088715

In this example, constants in most output equations (other than this with id=7) seem to be OK. On my real dataset, however, constants in most equations can be further improved, and the discrepancy in loss values is more notable.

@MilesCranmer
Copy link
Owner

Thanks for putting in the time to make this, it is quite helpful!

I guess now we need to rule out various causes:

  1. Is it the case that the optimization simply does not occur as frequently as it should, as you suspect? Or, is it instead:
  2. The optimization routine used in the search differs in such a way that it finds non-optimal constants?

The other difference between scipy.optimize and the Julia backend is the use of a backtracking line search (here)

            algorithm = Optim.BFGS(; linesearch=LineSearches.BackTracking())

this backtracking is used to deal with non-linearities and discontinuities in potential expressions. But it is possible that it is resulting in non-optimal constants in some cases.

To really test (2) we could simply see if calling Optim.jl with these settings differs from scipy in the constants it finds.

@Nithouson
Copy link
Author

Currently I assume the difference comes from the optimization routine (cause 2) . One possibility is that the default optimizer_iterations=8 is inadequate for convergence (finding the optimal constants). It may be the different settings of BFGS algorithm as well, as you pointed out.
I think I have misunderstood the workflow of PySR previously. The Hall of Fame is updated at the end of each iteration, which happens after the ‘simplify and optimize‘ operation. Hence, if we set optimize_probability=1.0, every equation in the Hall of Fame should have been optimized (although the number of iterations may be inadequate). If this is correct, cause 1 should not happen.
Increase optimizer_iterations may produce better constant optimization results, yet it largely increases the running time. Hence I request to add an option which optimize the constants in the final Hall of Fame before PySR exits, possibly with more iterations than the parameter optimizer_iterations used in the expression searching process.
Besides, I'm sorry that I cannot test or debug Julia codes for now, as I am not familiar with the language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants