Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BedTool.sequence hangs on a BedTool made with generator? #77

Closed
yarden opened this issue Mar 11, 2013 · 3 comments
Closed

BedTool.sequence hangs on a BedTool made with generator? #77

yarden opened this issue Mar 11, 2013 · 3 comments

Comments

@yarden
Copy link

yarden commented Mar 11, 2013

I am following Ryan's template for creating a BedTool on the fly to avoid temporary files (as outlined here #55). I'm parsing through a BedTool corresponding to a BED file a. As I iterate through the file, I'm creating a new BedTool b using create_interval_from_list. The new BedTool b represents some shuffles of each BED interval in a. Once the new BedTool b is created, I call b.sequence() to generate a FASTA from it.

I found that the sequence() call hangs after outputting a large portion of the FASTA, probably because of some issue with the way I'm making the iterator. Strangely, if I call len(b) before b.sequence(...), then the problem goes away. I've looked at bedtool.py and noticed that __len__() calls self.count(), which is apparently what resolves the iterator. Any idea what might be causing this halting condition? Do I need to raise a StopIteration when feeding a generator to BedTool?

Here is my code for making the generator, it's extremely simple:

def get_shuffled_clusters_intervals(clusters,
                                    gene_ids_to_coords,
                                    num_shuffles):
    # 'clusters' is a BedTool
    for cluster_bed in clusters:
        ...
        # Make some intervals related to the current one
        intervals = \
            bedtools_utils.sample_intervals(...)
        if intervals is None:
            continue
        # Record these as BED intervals
        for shuffle_num, curr_interval in enumerate(intervals):
            ...
            curr_interval_bed = \
                pybedtools.create_interval_from_list([cluster_bed.chrom,
                                                      str(curr_interval[0]),
                                                      str(curr_interval[1]),
                                                      interval_name,
                                                      ".",
                                                      str(cluster_bed.strand)])
            yield curr_interval_bed
    return

Now I create a BedTool using this function and call sequence():

    sampled_clusters_intervals = \
        get_shuffled_clusters_intervals(clusters,
                                        gene_ids_to_coords,
                                        num_shuffles)
    # Make BEDTool out of cluster intervals
    sampled_clusters_bed = pybedtools.BedTool(sampled_clusters_intervals)
    # KEY LINE: without len(sampled_clusters_bed), this hangs
    num_sampled_clusters = len(sampled_clusters_bed)
    sampled_clusters_bed.sequence(fi=genome_seq_fname,
                                                 fo=sampled_clusters_fname,
                                                 name=True,
                                                 s=True)

Any thoughts on what might cause sequence to hang? The BedTool seems to be made fine from the generator, it's the sequence line that actually makes use of the generator that hangs. Thanks!

Update: further confirmation that it's a problem somewhere in the sequence() call or downstream of it - if I take the BedTool that I made using the generator (sampled_clusters_bed) and write it to a .bed first:

with open("test.bed", "w") as file_out:
  for bed in sampled_clusters_bed: file_out.write(str(bed))

And then load test.bed as a BedTool instance, and call that object's .sequence() method, then it all works. So it must be something about how sequence() parses the generator it seems.

@brentp
Copy link
Contributor

brentp commented Mar 11, 2013

calling len() will cause the iterator to exhaust itself and then be empty by the time it gets to the .sequence() call.

if you really want to iterate over it multiple times, then use:

sampled_cluster_intervals = list(get_shuffled_clusters_intervals(...))

maybe if you used "raise StopIteration", the error would be more helpful.

@yarden
Copy link
Author

yarden commented Mar 12, 2013

Sure about len, didn't meant to imply that it fixes the issue, but I just found it surprising that len can exhaust the iterator (thus ending in an empty sequence output as you write) whereas sequence cannot. I definitely don't want to load it in to memory so I avoided list - there's no need for me to iterate through it multiple times, it's only made into a BedTool as a generator so that it can be written to a file line by line.

I tried using raise StopIteration at the end of the function but it gets stuck prior to that. Where were you thinking of placing it? Thanks.

@daler
Copy link
Owner

daler commented Mar 13, 2013

I think this boils down to a question about whether .sequence() can handle an iterator. Here's a standalone example using the pybedtools example files:

import pybedtools

def generator():
    seqsize = 3
    for i in range(100):
        strand = '+'
        if i % 2 == 0:
            strand = '-'
        yield pybedtools.create_interval_from_list([
            'chr1',
            str(seqsize * i),
            str(seqsize * i + seqsize),
            'region_%s' % i,
            '.',
            strand
        ])
    return

x = pybedtools.BedTool(generator())

x.sequence(
    fi=pybedtools.example_filename('test.fa'),
    fo='example',
    name=True,
    s=True)

print open('example').read()

This works fine. Are you able to modify this (perhaps by using your bedtools_utils.sample_intervals) so that it hangs?

It's possible that if bedtools_utils.sample_intervals is using subprocess.Popen and reading from subprocess.PIPE, you may be running into something like what's described in #49. If this is the case, the solution is simply to write to an intermediate file that can be deleted at the end of the generator function.

@daler daler closed this as completed Nov 18, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants