Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

randomstats not cleaning up #67

Open
brentp opened this issue Nov 30, 2012 · 9 comments
Open

randomstats not cleaning up #67

brentp opened this issue Nov 30, 2012 · 9 comments

Comments

@brentp
Copy link
Contributor

brentp commented Nov 30, 2012

I dont know why this is happening, because a look at the code shows that it is calling close_or_delete but randomstats is leaving a ton of pybedtools.tmp* files in my tmp dir. and calling cleanup() does not remove them.
Perhaps what's getting sent to close_or_delete is a filehandle?
I've tried calling randomstats with an object and with object.fn and it never cleans up the files.

My call looks like this:

        res = bed.randomstats(loh.fn, 100, processes=25)
@daler
Copy link
Owner

daler commented Nov 30, 2012

A while ago I did a major overhaul on the randomization stuff, implementing a new method (BedTool._randomintersection rather than BedTool.randomintersection) that fixed this.

Looks like I never made this method the default for BedTool.randomstats().

To use the new method, you can specify new=True and provide a genome_fn to BedTool.randomstats. To see the difference (both in syntax and cluttering of the temp dir), check out test/prevent_open_file_regression.

So for your example, this should do the trick:

gfn = pybedtools.chromsizes_to_file(pybedtools.chromsizes('hg19'))
res = bed.randomstats(loh.fn, 100, processes=25, new=True, genome_fn=gfn)

(side note: If you take a look at the leftover temp files, I think they should all be genome files)

@brentp
Copy link
Contributor Author

brentp commented Dec 4, 2012

that does the trick. can genome_fn be a required argument to avoid this?

@daler
Copy link
Owner

daler commented Dec 4, 2012

Yeah, that's probably best. I still need to do a little more cleaning up and "officially" deprecate the old randomstats method; when that happens the genome_fn will be required.

@brentp
Copy link
Contributor Author

brentp commented Dec 4, 2012

got it.

would you consider adding _orig_pool kwag to random_op. it'd be nice be able to keep re-using a pool if I'm running this across multiple pairs of bed files.

@daler
Copy link
Owner

daler commented Dec 4, 2012

Sure.

Implementation-wise, would you rather create your own pool and use it for various parallel calls like

mypool = multiprocessing.Pool(25)
bt.randomstats(_orig_pool=mypool, *args, **kwargs)
bt.random_op(_orig_pool=mypool, *args, **kwargs)
bt.random_jaccard(_orig_pool=mypool, *args, **kwargs)

or have a BedTool._pool instance variable that, if None, will initialize with n processes, but subsequent calls (when _orig_pool=True) re-use that auto-created one?

# initializes a pool, BedTool._pool = multiprocessing.Pool(25)
bt.randomstats(_orig_pool=True, processes=25, *args, **kwargs)

# subsequent calls re-use BedTool._pool
bt.randomstats(_orig_pool=True, processes=25, *args, **kwargs)

# set to None to re-initialize w/ different nprocs
bt._pool = None
bt.randomstats(_orig_pool=True, processes=500, *args, **kwargs)

@brentp
Copy link
Contributor Author

brentp commented Dec 4, 2012

I much prefer the former.

@brentp
Copy link
Contributor Author

brentp commented Dec 4, 2012

sorry for putting this in this thread, but it's another open file error. if i stream, it must be leaving open the process?

from pybedtools import BedTool

a = BedTool('chr1 1 2', from_string=True)
b = BedTool('chr1 1 2', from_string=True)

for i in range(10000):
    print i
    c = a.intersect(b, stream=True)

is that expected to leak?

@daler
Copy link
Owner

daler commented Dec 5, 2012

In this case, I think the answer is yes:

The way streaming bedtools are closed is by hitting a StopIteration (see cbedtools.IntervalIterator). Since c in this example is never iterated over, it never gets a chance to raise a StopIteration to close the stream.

But it would be nice if the garbage collector saw that the streaming BedTool from iteration i-1 no longer has any references, and cleans it up (would a __del__ method be called then?). But this starts to get to the reference counting part of Python & Cython that I don't have a handle on yet. Any ideas?

@brentp
Copy link
Contributor Author

brentp commented Dec 5, 2012

i tried a number of things including __del__, but can't get it to work. it doesn't collect them until the program terminates...
Streaming over the results does prevent the error in this case.
I'm getting another file handles open error that I haven't been able to create a small test-case for..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants