New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Ensure constant optimization for output formulas #379
Comments
Thanks for the detailed note on this, it is quite helpful! Just to double check: when the probability of optimizing is 1.0, this issue goes away? In other words, you mean it’s because the constants did not get optimized? Or, if not, have you tried the other optimization hyperparameters? Maybe it is the specific optimizer used (BFGS with 3 steps) rather than the evolutionary strategy itself? |
This issue does not go way when probability of optimization is 1.0. The case I report happened when optimize_probability=1.0. If I understand right, the default constant optimizer in PySR is BFGS, and the number of iterations is 8 (which I didn't change in PySR). I used the same setting on the output formula and found improvement, so I suspect this formula had not been optimized by PySR. Otherwise PySR would find a better set of constants, too. |
What is the default optimizer in SciPy? |
This is the function optimization call: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/0f47c6baf783b436ccecd5e635f692516c92d963/src/ConstantOptimization.jl#L29 (Julia code). It would be insightful to see whether it actually gets stuck at those incorrect constants, or if it does actually find them with enough steps. |
The optimize.minimize() function in SciPy supports multiple optimizers. If not specified, the software would choose one of BFGS, L-BFGS-B, SLSQP, depending on whether or not the problem has constraints or bounds. |
Hm, so if the optimizer is the same, and if there is a 100% chance of optimization occurring, then what could make up the difference? Do you have a full MWE (ie code) of this issue? Note that there is |
I cannot make my data public currently, but I will try to work out a reproducible code of this issue with synthetic data. I suspect this issue is caused by the age-based regularization. PySR optimizes the constants after one iteration, which contains multiple rounds of population evolution. The oldest expression is replaced at each round of evolution. Thus, the hall-of-fame formula may have been removed out of the population when an iteration finishes, thus does not get optimized. |
Hm, it should be copied when saved to the hall of fame. ie, there shouldn’t be a data race. But if there is then we need to fix it. |
Yeah, mutations always occur on a copied tree: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/727493db3e9c5f17335313fc56f2612b7c82bc32/src/Mutate.jl#L100 So not sure what’s going on. If you follow up with a MWE I can have a look. |
Here is a reproducible script where the constants in output formulas are not optimal.
In this example, constants in most output equations (other than this with id=7) seem to be OK. On my real dataset, however, constants in most equations can be further improved, and the discrepancy in loss values is more notable. |
Thanks for putting in the time to make this, it is quite helpful! I guess now we need to rule out various causes:
The other difference between scipy.optimize and the Julia backend is the use of a backtracking line search (here) algorithm = Optim.BFGS(; linesearch=LineSearches.BackTracking()) this backtracking is used to deal with non-linearities and discontinuities in potential expressions. But it is possible that it is resulting in non-optimal constants in some cases. To really test (2) we could simply see if calling Optim.jl with these settings differs from scipy in the constants it finds. |
Currently I assume the difference comes from the optimization routine (cause 2) . One possibility is that the default |
Feature Request
Hi, thanks for developing the helpful tool.
I thought that constants in hall-of-fame formulas have always been optimized. However, I found it is not the case. This happens even after I set
optimize_probability=1.0
. For example, one resulting formula on my data was52.16375*x2*(x14 + 0.2914053*x24)/x0**2, loss=219.1645146
And it had been listed in "Hall of Fame" for multiple iterations before the symbolic regression process finished. I wrote some code to optimize the constant
a,b
inlambda x: a*x[2]*(x[14] + b*x[24])/x[0]**2
using BFGS inscipy.optimize.minimize
, with initial position[52.16375, 0.2914053]
,maxiter = 8
, and the same loss function as symbolic regression. I got a better set of constants:68.0186345*x2*(x14 + 0.1651678*x24)/x0**2, loss=217.6017091
I think this is because of the age-based regularization. Although constants are optimized after each iteration (when
optimize_probability=1.0
), the hall-of-fame formula may have been removed out of the population without being optimized.Hence, I suggest an option to re-optimize the constants for the hall-of-fame formulas after the symbolic regression process. Although users may implement this in a similar way as Discussion #255, I think it is more convenient to incorporate this feature.
The text was updated successfully, but these errors were encountered: