Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMBER2022 - DV/DL=0.000 with lambda scheduling #242

Open
DrDomenicoMarson opened this issue Sep 26, 2022 · 10 comments
Open

AMBER2022 - DV/DL=0.000 with lambda scheduling #242

DrDomenicoMarson opened this issue Sep 26, 2022 · 10 comments
Labels

Comments

@DrDomenicoMarson
Copy link
Contributor

With the new AMBER2022 code, if the user adopts the new lambda scheduling, the DV/DL is supposed to be exactly 0.00000 at lambda==1.0 and lambda==0.0.

As far as I understand, a user still should want to perform computations at lambda==0 or 1, to collect u_nk data, but here our parsers and our workflows have some problems in dealing with dHdl extraction/analysis.

a) The first problem is a parser one, as AMBER doesn't output a value of DV/DL when it is exactly 0.0, so we'll have files in simulations performed at lambda 0.0 and 1.0 with MBAR data, but no DV/DL data.

Regarding this issue, I already asked in the amber mailing list if they can change the behavior of AMBER and make it print DV/DL values even when they are exactly 0.0, if a TI simulation is performed.
Otherwise, the solution can be to make some careful checks whether we are in a simulation with lambda scheduling & lambda==0.0 or 1.0, and if that's the case, we can add 0.0 as the dHdl value each time MBAR data is found.
Here the issue is that the check could be not so simple, as this assumption is right only if the standard lambda scheduling is used, but if a custom scheduling it's used different cases can arise. On this point, advanced lambda scheduling should be used only by advanced users, so maybe just throwing a warning and stating that we are adding 0.0 to dHdl could be enough.

b) if we collect a Series of 0.000000 values at lambda==0 or 1, common workflows in which the user uses subsampling/equilibrium detection will break, because when such functions are run on a series of 0.0, they will have a st.dev==0.0 and an exception rises.

I encountered problems a)+b) right now, and I "resolved" temporarily in a very inelegant and quick way by adding very small values (randint(-1000,1000)/100000000)) instead of 0.0 at lambda==0 or 1 in dHdl when MBAr data is collected, in such a way equilibrium detection and subsampling don't complain and the results are produced with my current workflow (results which are really nice by the way with this new method :-) ).

Problems a) and b) (especially b)) are an issue just if we want to maintain the compatibility with the current workflows users could have implemented. If a user is aware of the edge cases at lambda==0 or 1, he/she could easily adapt its workflow to just skip equilibrium detection/subsampling for dHdl at these windows.

I don't know which could be the best route to follow now. It's not urgent as we should wait also to see if AMBER will print DV/DL==0 values with a future patch, but I wanted to raise the attention on this issue now so we can discuss at least for case b).

@DrDomenicoMarson
Copy link
Contributor Author

A quick update, it seems AMBER developers are interested in having AMBER output DV/DL data regardless of the particular DV/DL value, so our parsers should work as they are.

If that will be the case, we'll have only to address the problem of having Series of exactly 0 values given to the preprocessor, but that will be much easier I think!

@orbeckst
Copy link
Member

Perhaps failing if we detect a problematic series?

Is there a way to detect an AMBER2022 output file and then fail? (possibly with a link to this issue?)

@orbeckst orbeckst added parsers AMBER MD engine labels Oct 25, 2022
@orbeckst
Copy link
Member

With the new AMBER2022 code, if the user adopts the new lambda scheduling, the DV/DL is supposed to be exactly 0.00000 at lambda==1.0 and lambda==0.0.

I am not familiar with how AMBER does things. Can you explain why $\partial V/\partial\lambda$ should be exactly 0 at the end points? At least in my simple understanding of TI, this derivative is not guaranteed to be 0. For the trivial example of $H(\lambda) = (1-\lambda) H_0 + \lambda H_1$, the derivative is $\partial V/\partial\lambda = H_1 - H_0$ and that is pretty much guaranteed to be different if $H_0 \neq H_1$.

@DrDomenicoMarson DrDomenicoMarson mentioned this issue Oct 25, 2022
6 tasks
@DrDomenicoMarson
Copy link
Contributor Author

The reason is the implementation of "smoothstep functions" and "lambda-scheduling" in the recent version of AMBER (amber22). These are further changes to the soft-core potentials that should give smooth and easier-to-integrate curves. I can confirm the effectiveness in providing better curves, as I tried this method (with the default options) in a tricky system I was working on.

The best thing I can do is paste here some extracts from the relevant part of the Amber manual, but the take-home message (confirmed by the developer of these new functionalities, Taisung Lee) is that:

  • DV/DL values can be perfectly 0.0 and as such, they are not printed in the output file
  • reading where DV/DL (in lambda-space) is expected to be 0.0 is not trivial, as it depends on the kind of lambda-scheduling that is requested by the user
  • with the default lambda-scheduling, DV/DL is exactly 0.0 at the endpoints (I can certainly confirm that).

In my opinion, is bad behavior for the AMBER code to not print the DV/DL value when it's exactly 0.0 & TI/MBAR is running; if I'm running a TI calculation, I'm expecting the code to print a DV/DL value. Looking into the code, the issue seems that the logic to whether to print the value was done in a "lazy" way, as simply the code is (translated to pseudocode)
if dv_dl: add_dv_dl_to_output
relying on the fact that dv_dl is != 0.0 when TI is run, while it should be
if running_TI: add_dv_dl_to_output
I asked AMBER developers whether they could make this change with a bugfix, and the answer was that it could be possible, but they were a little vague, so I think we should not rely on that.

@DrDomenicoMarson
Copy link
Contributor Author

Screenshot 2022-10-26 at 18 35 27

Screenshot 2022-10-26 at 18 36 25

@orbeckst
Copy link
Member

This looks like an extension of the capabilities of the AMBER parser. That doesn't need to hold up a 1.0 release.

@DrDomenicoMarson
Copy link
Contributor Author

I was trying to figure out what the user-case scenario could be.
In the "worst" case scenario I have run multiple simulation for each lambda value (to comply maybe with time limitations imposed by HPC centers). So:

  • I read and concatenate all the output files for each lambda
  • I perform equilibrium detection / subsampling for each lambda
  • I concatenate the resulting dfs and fit the TI estimator

The safest thing we could do maybe is just warn the user if an AMBER output file is detected that "has" TI/MBAR turned on and has no DV/DL values (instead of raising an error as it is right now).
Then, we prompt the user to create (after the proper subsampling of the other datasets) by hand a df with just one line,
('time, lambda, 0.0' ) to be concatenated with the others, and then continue with the fitting of the estimator.

This is the safest approach for us, as it doesn't change the API and requires low effort on our part. The main problems that I see are:

  • the user should also correctly create the attrs for the new dfs, coherently with the dfs he obtains parsing the output files
  • I'm not sure some of the other functions (like forward_backward_convergence and plot_convergence) will work with such dfs as for the lambdas with no DV/DL values there will be just one 0.0 element.

Maybe I'm missing something, what do you think about this scenario?

@orbeckst
Copy link
Member

Initially I'd just fail with "NotImplementedError" and a link to this issue.

Then we can figure out a way to handle the situation. Start with an example file and demonstrate that the work-around works. Once this piece of documentation exists, we could add a kwarg that switches from fail to warn and allows advanced users to manually change the df. Once there's a workaround, we have time to figure out a more robust solution.

My opinion is (learned over many years) that in general, it's better not to guess or try to be too smart unless you really know that you can deal with everything coming your way. "In the face of uncertainty, resist the temptation to guess." is a good motto for robust libraries.

@DrDomenicoMarson
Copy link
Contributor Author

Sorry for the late reply! Yeah, I like this approach, I'll work on some examples as soon as I have some free time!

@xiki-tempula
Copy link
Collaborator

I don't know much about amber but I don't think subsampling/equilibrium detection will be an issue.
The reason for subsampling is that the estimator needs decorrelated data, if one does lambda scheduling, then the data are not correlated. The equilibrium detection is more tricky but pymbar.equilibrium_detection operates to maximise the number of decorrelated samples. So the pymbar.equilibrium_detection would not work anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants