-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize select_atoms by coordinates #134
Comments
Would you mind doing your test with the included test files from MDAnalysis.tests import datafiles as data
u = mda.Universe(data.GRO_MEMPROT, data.XTC_MEMPROT) so that we can easily check locally and also write a proper test? Thank you! |
Of course, it gives me this : Serial implementationu = mda.Universe(data.GRO_MEMPROT, data.XTC_MEMPROT)
atom_groups_through_frames = [get_atoms_below10(u.atoms) for trj in u.trajectory]
print(atom_groups_through_frames[0][0].position)
print(atom_groups_through_frames[4][5].position)
>> [ 9.83 48.320004 97.73001 ]
>> [13.18 46.05 48.47] PMDA implementationWith n_jobs > 1, I still have the RuntimeError about Universe hashes. parallel_groups = AnalysisFromFunction(get_atoms_below10, u, u.atoms).run(n_blocks = 1, n_jobs = 1)
atom_groups_through_frames = parallel_groups.results
print(atom_groups_through_frames[0][0].position)
print(atom_groups_through_frames[4][5].position)
>> [16.64 49.6 88.55]
>>[ 8.56 40.610004 48.300003] parallel_groups = AnalysisFromFunction(get_atoms_below10, u, u.atoms).run(n_blocks = 2, n_jobs = 1)
atom_groups_through_frames = parallel_groups.results
print(atom_groups_through_frames[0][0].position)
print(atom_groups_through_frames[4][5].position)
>> [16.760002 50.43 89.3 ]
>> [ 8.56 40.610004 48.300003] For n_blocks = 3 and n_blocks = 4, I have the following error :
And for n_blocks = 5, this one :
The first error with n_blocks doesn't occur if I use the NewAnalysis class instead of AnalysisFromFunction with _conclude redefinition. I also check the position values during _single_frame computation : class NewAnalysis(ParallelAnalysisBase):
def __init__(self, universe, atomgroup):
self._ag = atomgroup
super(NewAnalysis, self).__init__(universe,
self._ag)
def _single_frame(self, ts, agroups):
print(f"= single frame {ts.frame}")
selection = agroups[0].select_atoms(f"prop x >= 0 and prop x < 10")
print(selection[0].position)
return agroups[0].select_atoms(f"prop x >= 0 and prop x < 10")
def _conclude(self):
self.results = [frame for block in self._results for frame in block]
na = NewAnalysis(u, [u.atoms]).run(n_jobs=1, n_blocks = 3)
atom_groups_through_frames = na.results
for i in range(5):
print(f"= final frame {i}")
print(atom_groups_through_frames[i][0].position) Returns :
This behavior seems to appear as soon as _single_frame returns AtomGroup. |
For the "wrong position" error with n_jobs = 1, a solution for my specific problem is to return just atoms positions instead of all AtomGroup. I got the same kind of problem in my own pipeline with Pool, with AtomGroup saved inside objects and wrong positions when reading at the end. I guess it's something related with universe iteration and "not be at the good frame" when reading. Since I just need atoms positions I didn't check further. |
Thank you for the detailed error report. I agree that something is not right. I don't immediately see what the problem is, though. I currently don't have time to dig into this. Help is welcome. |
There's an another very interesting question (aside from how PMDA does wrong), which I guess we should make it clearer-- u = mda.Universe(data.GRO_MEMPROT, data.XTC_MEMPROT)
atom_groups_through_frames = [get_atoms_below10(u.atoms) for trj in u.trajectory] # the u.trajectory.ts will be reset to 0 after iteration.
print(atom_groups_through_frames[0][0].position)
print(atom_groups_through_frames[4][5].position)
>> [ 9.83 48.320004 97.73001 ]
>> [13.18 46.05 48.47] # not correct. It prints the position of the atom[5] in timestep[**0**]. The for i, trj in enumerate(u.trajectory):
print(atom_groups_through_frames[i][0].position) which I think is confusing. Maybe there's a better way to do it? I mean only returning the position is ofc one option. EDIT: Another interesting thing related to # how `AtomGroup` behaves with `numpy` funcs:
>>> atomgroup_list = [u.atoms]
>>> atomgroup_array = np.asarray(atomgroup_list) # such conversion will happen any numpy funcs.
>>> atomgroup_array
array([[<Atom 1: N of type N of resname TYR, resid 7 and segid SYSTEM>,
<Atom 2: H1 of type H of resname TYR, resid 7 and segid SYSTEM>,
<Atom 3: H2 of type H of resname TYR, resid 7 and segid SYSTEM>,
...,
<Atom 43478: H16X of type H of resname POPG, resid 572 and segid SYSTEM>,
<Atom 43479: H16Y of type H of resname POPG, resid 572 and segid SYSTEM>,
<Atom 43480: H16Z of type H of resname POPG, resid 572 and segid SYSTEM>]],
dtype=object) I would expect it to be |
Thanks you're right I always forget this AtomGroup/Universe link. Here |
As a temporary fix, I found my similar issue was resolved by rolling back to dask=2.9.2 |
Hello,
To simplify my problem, I want to select atoms based on coordinates through all frames. I tried to do this with pmda but it doesn't work with n_jobs > 1 and it gives results for n_jobs = 1 but atoms positions aren't the expected ones.
Expected behaviour
The expected behaviour is this one, provided by the serial implementation.
Gives me this :
Actual behaviour
Here is the code I tested with pmda
I got the following error
If I try with n_jobs = 1 :
I got results but not as expected :
I noticed that if we change the number of blocks, results change too. I think all atoms at position i in the same block end with the same coordinates during _reduce step, or something like that.
Finally, I managed to do my parallel computation without pmda with python multiprocessing.Pool, by slicing the trajectories into n chunks, with one copy of the universe per chunk.
Currently version of MDAnalysis:
mda version : 1.0.0
pmda version : 0.3.0+17.g13fa3b5
dask version : 2.21.0
The text was updated successfully, but these errors were encountered: