alchemical analysis features

alchemical-analysis announced in late 2017 that its features will be moved to alchemlyb. This is going to be a long and tedious process but it will have the following advantages

tested (all code coming into alchemlyb is tested to > 90% coverage, with 95% as a goal)
modular (functionality as library functions)
Python 3 (and Python 2)

User input

The alchemlyb team welcomes user input: please raise an issue in the Issue Tracker for any alchemical-analysis features that you would really like to have in alchemlyb.

Please also feel free to edit this wiki page and contribute to the discussion.

Desired alchemical-analysis features

Please describe the feature and make a case for why you want it included. Add your name/GitHub handle; feel free to add yourself to any existing entries, too. Popular features are more likely to be migrated. (See issue #54 to discuss the process.)

Possibility to access the statistical inefficiencies

alchemical-analysis offers information about the statistical inefficiencies of the input datasets - it would be nice to have this information accessible also when using the alchemlyb implementation

Uncorrelation threshold

In alchemical-analysis it is possible to specify a threshold for the number of samples to keep in the uncorrelation process - this is currently not possible in alchemlyb

Implementation of the dHdl, dHdl_all, dE uncorrelation methods

In principle these methods determine the series that is used in the statistical_inefficiency() function. They should take the datasets as argument and return a series that can be used to do the autocorrelation analysis. It is much easier to reproduce alchemical-analysis calculations when these methods are implemented.

More estimators

@orbeckst

-m METHODS, --methods=METHODS
missing estimators
- ~~TI_CUBIC #21~~ LOW PRIORITY (closed)
- DEXP #22
- IEXP #23
- GINS #24
- GDEL #25
- ~~UBAR #26~~ LOW PRIORITY (closed)
- ~~RBAR #27~~ NOT REALLY NEEDED (closed)
- ~~BAR #28~~ DONE in PR #60
Are some of these estimators more important than others? [See @mrshirts remarks below; DLM agrees.]

@mrshirts

BAR is high priority, because sometimes MBAR can't be done if we don't have energies at all i+1's. The BAR solution can be made very fast (significantly faster than MBAR) (as it is is in the pyMBAR package). DONE in PR #60
DEXP and IEXP are single state perturbation. Worth including for comparison. There are two because there are essentially two ways to calculate if you have a series of lambda points.
TI-CUBIC is essentially a higher order integration of the <dH/dl> using cubic splines. Experience (non-exhaustive) has shown that it's not really much better than TI and has a larger chance of failing because of locally high curvature. I think this is lower priority, especially since it's a pain to handle the uncertainties correctly in the code. There could easily be better integration formulas. IF equally spaced, one could to simpsoms, or romberg, but there doesn't appear to be a general integration algorithm that works well for predefined spacing (as opposed to adaptive spacing). So could be cut. [DLM: Agree; propose cutting.]
GINS and GDEL are the Gaussian approximations to insertion and deletion FEP. We included them because people kept saying that the Gaussian versions worked, and they really only work for linear problems (charging, etc), and we had to have a testbed to show them. Low overhead to put in. [DLM: So low priority (not much value) but worth including.]
UBAR is BAR without optimizing the constant. The only reason one would ever do this is because you don't want to maintain a history to adaptively update everything each iteration, which would only happen if you were running this adaptively, i.e. maintaining the accumulated averages (O(1) operation) each step, so you have a cheap estimate each step without running a nonlinear optimization. BUT not very accurate in most cases. [DLM: Thus I propose we not do this unless someone needs it for something.]
RBAR is interesting, since you calculate the UBAR for a series of 'trial' free energies, and choose the one that best satisfies the equations. One can get a very accurate answer with no iteration each step if you know the range to start out with. PROBABLY not worth supporting, since one is not going to be using alchemlyb adaptively, in the sense that you would need to keep K sets of averages around in between alchemlyb runs. If one were implemented a code where it was tightly integrated, it could be very useful, but likely not in postanalysis code. [DLM: Propose skipping.]

Overlap matrix

Added in PR #107

@orbeckst

-w, --overlap Print out and plot the overlap matrix.
unique functionality, quite useful in visual analysis of the data quality

@mrshirts: yes, very useful. [@davidlmobley: agree] Very easy to implement once MBAR has been called, requires MBAR to be called first. How would that dependency be enforced? Try a call to see if the object exists, generate if it doesn't? [@davidlmobley: Implementation detail; can deal with separately.]

Other things available

@davidlmobley

Graphics: Visualization of TI/BAR free energy estimates (as a function of lambda and as a function of time); convergence graphs; ~~visualization of overlap matrix~~ (e.g. DOI 10.1007/s10822-015-9840-9)
Graphical visualization for comparison of forward and reverse estimates of free energies (see also #104)
Breakout of individual components of free energy (if not already available; not clear to me without trying)
Graphical cross-comparison of analysis techniques when multiple techniques are applied, e.g. DOI 10.1007/s10822-015-9840-9 figure 4.
Easy consistency checking across techniques (one of the very valuable things about running multiple analysis techniques is that they often agree, except when there is a problem)

Existing features

The following features already exist

MBAR and TI estimators
subsampling (with preprocessing.subsampling.statistical_inefficiency() (Does this correspond to the -n UNCORR, --uncorr=UNCORR feature??)
discarding of initial time (-s EQUILTIME, --skiptime=EQUILTIME) and more flexible slicing with preprocessing.subsampling.slicing()
Extract the energy data from the backward direction (-e, --backward) can be done with preprocessing.subsampling.slicing() (... I think ... check!) [@davidlmobley: We would want to make sure it's obvious how to do this, and how to graphically visualize.]
overlap matrix and its visualization (since PR #107)

Features in alchemlyb but not in alchemical-analysis

The following features only exist in alchemlyb

equilibrium detection with preprocessing.subsampling.equilibrium_detection()

Features in considered for alchemical-analysis that should go in alchemlyb

@mrshirts:

Estimation of uncertainties and covariances by bootstrapping. Very useful to diagnose if things go wrong in the error estimates, generally more reliable error estimates in regime of low sampling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly