Computing AIC, computing convergence, beginner tips #312

ntromas · 2023-04-21T21:00:11Z

ntromas
Apr 21, 2023

Hi Miles,

Hope all's good! I would love to test PySR despite my pretty poor skills in Python/Julia (I work much more on R but I should definitely change that). I was a Eureqa user and I really love the interface/tool. I also got pretty cool results with Eureqa that were confirmed later with new data.
I would like to test PySR with my dataset (taxa abundance and gene frequency) to predict toxicity. I am trying to follow the PySR doc with Docker but I am not sure how to define the predictive var and the response var. Is there any tutorial (for beginners) that you know of?
I can see how to set up the model and the search space. Not sure if there is a convergence or any search parameters to tell the model to stop searching (a bit what there was with Eureqa and convergence)?
Is there a way to select (after the search) the simplest model based on AIC for example?

Sorry if these are very naive questions!

Thanks for your help,

Nico

MilesCranmer · 2023-04-21T22:34:30Z

MilesCranmer
Apr 21, 2023
Maintainer

Hi Nico,

Thanks for reaching out!

For beginner tutorials, you could try the online tutorial here: https://colab.research.google.com/github/MilesCranmer/PySR/blob/master/examples/pysr_demo.ipynb

I can see how to set up the model and the search space. Not sure if there is a convergence or any search parameters to tell the model to stop searching (a bit what there was with Eureqa and convergence)?

There are no convergence tests available (sometimes the model might look like its converged, but then find a new branch of the evolutionary tree, and continue from there). However, there are some ways you can trigger an early stop; see the "Stopping Criteria" section of the API reference page: https://astroautomata.com/PySR/api/#stopping-criteria

Is there a way to select (after the search) the simplest model based on AIC for example?

Not by default, although you can set up your own selection strategies. After the search, the Pareto front is stored in model.equations_, which is a pandas dataframe with columns for the loss, complexity, and the equation.

For example, to implement AIC, you could do this as follows:

equations = model.equations_

import re
number_matching_pattern = r"(?<![a-zA-Z0-9_.])[+-]?(\d+\.\d+|\.\d+|\d+\.|\d+)(?:[eE][-+]?\d+)?"

# Count number of constants:
equations["number_constants"] = [len(re.findall(number_matching_pattern, eq)) for eq in equations["equation"]]

# Compute log likelihood (for example)
equations["log_like"] = - equations["loss"] * len(X)

# Compute AIC:
equations["aic"] = 2 * equations["number_constants"] - 2 * equations["log_like"]

# Find best AIC:
best_row = equations["aic"].argmin()

# Use in different contexts:
model.sympy(index=best_row)  # SymPy version
model.latex(index=best_row)  # LaTeX version
model.predict(X, index=best_row)  # Make predictions with equation on some data `X`

0 replies

ntromas · 2023-04-21T22:59:54Z

ntromas
Apr 21, 2023
Author

Hi Miles, Thanks!! Wonder if you would know tutorials with real data using docker and PySR? Ex: importing a file, selecting predictive and response var, setting up the model, run it, selecting best equations, ... Cheers! Nico Le ven. 21 avr. 2023 18 h 34, Miles Cranmer ***@***.***> a écrit :

…

Hi Nico, Thanks for reaching out! For beginner tutorials, you could try the online tutorial here: https://colab.research.google.com/github/MilesCranmer/PySR/blob/master/examples/pysr_demo.ipynb I can see how to set up the model and the search space. Not sure if there is a convergence or any search parameters to tell the model to stop searching (a bit what there was with Eureqa and convergence)? There are no convergence tests available (sometimes the model might look like its converged, but then find a new branch of the evolutionary tree, and continue from there). However, there are some ways you can trigger an early stop; see the "Stopping Criteria" section of the API reference page: https://astroautomata.com/PySR/api/#stopping-criteria Is there a way to select (after the search) the simplest model based on AIC for example? Not by default, although you can set up your own selection strategies. After the search, the Pareto front is stored in model.equations_, which is a pandas dataframe <https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe> with columns for the loss, complexity, and the equation. For example, to implement AIC, you could do this as follows: equations = model.equations_ import renumber_matching_pattern = r"(?<![a-zA-Z0-9_.])[+-]?(\d+\.\d+|\.\d+|\d+\.|\d+)(?:[eE][-+]?\d+)?" # Count number of constants:equations["number_constants"] = [len(re.findall(number_matching_pattern, eq)) for eq in equations["equation"]] # Compute log likelihood (for example)equations["log_like"] = - equations["loss"] * len(X) # Compute AIC:equations["aic"] = 2 * equations["number_constants"] - 2 * equations["log_like"] # Find best AIC:best_row = equations["aic"].argmin() # Use in different contexts:model.sympy(index=best_row) # SymPy versionmodel.latex(index=best_row) # LaTeX versionmodel.predict(X, index=best_row) # Make predictions with equation on some data `X` — Reply to this email directly, view it on GitHub <#312 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY5D6DBZ7BZDITCBMSX3RDXCMDQFANCNFSM6AAAAAAXHKB3EE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

2 replies

MilesCranmer Apr 21, 2023
Maintainer

For loading a file, I would use:

import pandas as pd
data = pd.read_csv("mydata.csv")

Then, in the model.fit(X, y) call as given on the README, you can set X with for example data[[“col1”, “col2”]] and y equal to data[“output_col”] where “output_col” is the column of the csv you want to predict.

(Btw I typed this on my phone which uses “ instead of ", so be sure to change those characters.)

MilesCranmer Apr 21, 2023
Maintainer

I don’t know what predictive and response variable are but I assume predictive=X and response=y.

MilesCranmer · 2023-04-21T23:10:02Z

MilesCranmer
Apr 21, 2023
Maintainer

I gave a talk+tutorial here: https://www.youtube.com/watch?v=q6tjKXmhiMs, although it also ventures into some deep learning stuff. The accompanying tutorial code is here: https://github.com/MilesCranmer/pysr_tutorial, which uses docker.

0 replies

ntromas · 2023-04-22T01:48:46Z

ntromas
Apr 22, 2023
Author

Fantastic, I will give a try ASAP!! Thanks a lot for the help, highly appreciated! Cheers, Nico Le ven. 21 avr. 2023 à 19:29, Miles Cranmer ***@***.***> a écrit :

…

I don’t know what predictive and response variable are but I assume predictive=X and response=y. — Reply to this email directly, view it on GitHub <#312 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY5D6BWYLUZOXYF7B4CLLDXCMJ7DANCNFSM6AAAAAAXHKB3EE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- *************************************************************** Nicolas Tromas PhD LS2N/Université de Montréal E-mail: ***@***.*** ***@***.***> Researchgate: NTromasPage <https://www.researchgate.net/profile/Nicolas_Tromas> Web: http://www.shapirolab.ca/ ***************************************************************

0 replies

ntromas · 2023-04-24T14:16:18Z

ntromas
Apr 24, 2023
Author

Hi Miles, I think I made it working properly! However I wonder how to avoid that the search space is stuck. With Eureqa, I repeated each run 10x, then I kept the best formula based on AICc. I also wonder how to improve the process time, taking into account that I might have ~1000 observations, and 5000 predictive variables. To start, I used a simpler dataset with only 40 predictive var andthe following parameters: default_pysr_params = dict(populations=80,model_selection="best",) from pysr import PySRRegressor model = PySRRegressor(niterations=40, binary_operators=["+", "*", "-","/"],unary_operators=["exp","inv(x) = 1/x",],extra_sympy_mappings={"inv": lambda x: 1 / x},**default_pysr_params) model.fit(X,Y) Thanks for your time and advices!! Cheers, Nico Le ven. 21 avr. 2023 à 21:48, Nicolas Tromas ***@***.***> a écrit :

…

Fantastic, I will give a try ASAP!! Thanks a lot for the help, highly appreciated! Cheers, Nico Le ven. 21 avr. 2023 à 19:29, Miles Cranmer ***@***.***> a écrit : > I don’t know what predictive and response variable are but I assume > predictive=X and response=y. > > — > Reply to this email directly, view it on GitHub > <#312 (reply in thread)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABY5D6BWYLUZOXYF7B4CLLDXCMJ7DANCNFSM6AAAAAAXHKB3EE> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> > -- *************************************************************** Nicolas Tromas PhD LS2N/Université de Montréal E-mail: ***@***.*** ***@***.***> Researchgate: NTromasPage <https://www.researchgate.net/profile/Nicolas_Tromas> Web: http://www.shapirolab.ca/ ***************************************************************

-- *************************************************************** Nicolas Tromas PhD LS2N/Université de Montréal E-mail: ***@***.*** ***@***.***> Researchgate: NTromasPage <https://www.researchgate.net/profile/Nicolas_Tromas> Web: http://www.shapirolab.ca/ ***************************************************************

1 reply

MilesCranmer Apr 24, 2023
Maintainer

Maybe have a look at https://astroautomata.com/PySR/tuning/ as well as the API reference page? There are some other discussions on improving performance which might be useful too.

ntromas · 2023-04-24T22:37:10Z

ntromas
Apr 24, 2023
Author

Hi Miles, Thanks a lot! Awesome 👌 Is there a way to get R2 for each equations? I got really cool results for my dataset that were confirmed with other analysis :) Nico Le lun. 24 avr. 2023 10 h 46, Miles Cranmer ***@***.***> a écrit :

…

Maybe have a look at https://astroautomata.com/PySR/tuning/ as well as the API reference page? There are some other discussions on improving performance which might be useful too. — Reply to this email directly, view it on GitHub <#312 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY5D6AUJ26HN4KN2POI4ELXC2G2XANCNFSM6AAAAAAXHKB3EE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

MilesCranmer Apr 25, 2023
Maintainer

You can do this with sklearn:

...
model.fit(X, y)

import sklearn

equation_index = 10  # choose an equation
y_pred = model.predict(X, equation_index)
r2 = sklearn.metrics.r2_score(y, y_pred)

ntromas · 2023-04-25T21:25:04Z

ntromas
Apr 25, 2023
Author

Hi Miles, Thanks!! It was not clear if PySR automatically split the dataset into training and testing dataset. If not, I should generate it and then use y_pred with my testing df that would have a similar number of columns than X. Cheers!! Nico Le mar. 25 avr. 2023 à 07:48, Miles Cranmer ***@***.***> a écrit :

…

You can do this with sklearn: import sklearn equation_index = 10 # choose an equation y_pred = model.predict(X, equation_index) r2 = sklearn.metrics.r2_score(y, y_pred) — Reply to this email directly, view it on GitHub <#312 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY5D6CAJG6ZSEQ4IGYOMDLXC62ZDANCNFSM6AAAAAAXHKB3EE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- *************************************************************** Nicolas Tromas PhD LS2N/Université de Montréal E-mail: ***@***.*** ***@***.***> Researchgate: NTromasPage <https://www.researchgate.net/profile/Nicolas_Tromas> Web: http://www.shapirolab.ca/ ***************************************************************

1 reply

MilesCranmer Apr 25, 2023
Maintainer

PySR does not do this split. You should only give your training data to PySR.

ntromas · 2023-04-28T11:55:45Z

ntromas
Apr 28, 2023
Author

Hi Miles, Hope all's good! I would have 2 questions for you: 1/ When I change from niteration=100 to =200, the time process is respectively 20min to >1 day...Is it normal? 2/When I compared R2 (with the best equation based on score) from training and R2 from testing, I observe huge difference (e.g 90% vs 2%). I though SR was robust to overfitting? Cheers, Nico default_pysr_params = dict(populations=100,model_selection="best",) model = PySRRegressor(niterations=100, binary_operators=["+", "*", "-","/"],unary_operators=["exp","inv(x) = 1/x",],extra_sympy_mappings={"inv": lambda x: 1 / x},**default_pysr_params) Le mar. 25 avr. 2023 à 19:48, Miles Cranmer ***@***.***> a écrit :

…

PySR does not do this split. You should only give your training data to PySR. — Reply to this email directly, view it on GitHub <#312 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY5D6B5LMLSJB6KARHQ63DXDBPFNANCNFSM6AAAAAAXHKB3EE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- *************************************************************** Nicolas Tromas PhD LS2N/Université de Montréal E-mail: ***@***.*** ***@***.***> Researchgate: NTromasPage <https://www.researchgate.net/profile/Nicolas_Tromas> Web: http://www.shapirolab.ca/ ***************************************************************

1 reply

MilesCranmer Apr 28, 2023
Maintainer

When I change from niteration=100 to =200, the time process is respectively 20min to >1 day...Is it normal?

No, definitely not normal at all. Please submit a bug report: https://github.com/MilesCranmer/PySR/issues/new/choose

When I compared R2 (with the best equation based on score) from training and R2 from testing, I observe huge difference (e.g 90% vs 2%). I though SR was robust to overfitting?

It is robust to overfitting, but not immune :) It is completely dependent on the problem! That R2 difference is so big I might think it is a bug in your code or data pipeline perhaps? e.g., test set and training set are being scaled differently? You could try testing the other models (model.predict(X, i) for test equation i).

mnky9800n · 2023-06-05T18:45:50Z

mnky9800n
Jun 5, 2023

Beginner materials is super awesome! I have really enjoyed the tutorials thus far.

I am trying to interpret how PySR can do multiple equation regression. For example, a 2d pendulum, a falling sliding ladder, etc. I may be misunderstanding, but there should be two equations of motion x(t) and y(t), for example. But all the examples I find seem to only ever solve for one equation. Maybe I am not looking at the right examples?

If these examples do not exist, but there is some code somewhere in how to implement them. I would be happy to write the tutorial to make a 2d code example.

1 reply

MilesCranmer Jun 6, 2023
Maintainer

Yes that sounds very useful! Want to start a new discussion thread to discuss your ideas for it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Computing AIC, computing convergence, beginner tips #312

{{title}}

Replies: 9 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Computing AIC, computing convergence, beginner tips #312

ntromas Apr 21, 2023

Replies: 9 comments · 7 replies

MilesCranmer Apr 21, 2023 Maintainer

ntromas Apr 21, 2023 Author

MilesCranmer Apr 21, 2023 Maintainer

MilesCranmer Apr 21, 2023 Maintainer

MilesCranmer Apr 21, 2023 Maintainer

ntromas Apr 22, 2023 Author

ntromas Apr 24, 2023 Author

MilesCranmer Apr 24, 2023 Maintainer

ntromas Apr 24, 2023 Author

MilesCranmer Apr 25, 2023 Maintainer

ntromas Apr 25, 2023 Author

MilesCranmer Apr 25, 2023 Maintainer

ntromas Apr 28, 2023 Author

MilesCranmer Apr 28, 2023 Maintainer

mnky9800n Jun 5, 2023

MilesCranmer Jun 6, 2023 Maintainer

ntromas
Apr 21, 2023

Replies: 9 comments 7 replies

MilesCranmer
Apr 21, 2023
Maintainer

ntromas
Apr 21, 2023
Author

MilesCranmer Apr 21, 2023
Maintainer

MilesCranmer Apr 21, 2023
Maintainer

MilesCranmer
Apr 21, 2023
Maintainer

ntromas
Apr 22, 2023
Author

ntromas
Apr 24, 2023
Author

MilesCranmer Apr 24, 2023
Maintainer

ntromas
Apr 24, 2023
Author

MilesCranmer Apr 25, 2023
Maintainer

ntromas
Apr 25, 2023
Author

MilesCranmer Apr 25, 2023
Maintainer

ntromas
Apr 28, 2023
Author

MilesCranmer Apr 28, 2023
Maintainer

mnky9800n
Jun 5, 2023

MilesCranmer Jun 6, 2023
Maintainer