make file_type a settable attribute #170

daler · 2016-04-27T00:57:34Z

Imagine we have two bed files, and we do this:

z = x.intersect(y, wao=True)

The resulting file could look like this, is incorrectly guessed to be SAM format:

chr1  0      11447  0  none  0  0  0  chr1  0      11447  0  11447
chr1  11447  11502  1  1     1  0  0  chr1  11447  11502  2  55
chr1  11502  11675  0  none  0  0  0  chr1  11502  11675  0  173
chr1  31291  31431  0  none  0  0  0  .     -1     -1     .  0

When we try to iterate over it, we get an OverflowError. Currently the fix is to make the name field a non-integer before doing the intersection:

def fix(f):
    f.name = f.name + '.'
    return f

z = x.each(fix).intersect(y, wao=True)

While we could get more fancy with detecting SAM, I don't want to go the route of checking against a regex for every field in every line of a file for pathological cases like this. Instead, it would be useful to set the filetype on the BedTool object and use that to short-circuit the create_interval_from_fields heuristics. So then you could do this:

z = x.intersect(y, wao=True)
z.file_type = 'bed'
print(z)  # no longer raises OverflowError

The text was updated successfully, but these errors were encountered:

gpratt · 2016-04-27T01:03:12Z

Why not take a biopython SeqIO approach (http://biopython.org/wiki/SeqIO)

Where the constructor for the bedtool has an optional file_type arg, and
you just inherit down through modifications?

Something like:

pybedtools.Bedtool(fn, file_type="bed")

Might take slightly more engineering for a very rare edge case though.

Gabriel Pratt
Bioinformatics Graduate Student, Yeo Lab
University of California San Diego

On Tue, Apr 26, 2016 at 5:57 PM, Ryan Dale [email protected] wrote:

Imagine we have two bed files, and we do this:

z = x.intersect(y, wao=True)

The resulting file could look like this, is incorrectly guessed to be SAM
format:

chr1 0 11447 0 none 0 0 0 chr1 0 11447 0 11447
chr1 11447 11502 1 1 1 0 0 chr1 11447 11502 2 55
chr1 11502 11675 0 none 0 0 0 chr1 11502 11675 0 173
chr1 31291 31431 0 none 0 0 0 . -1 -1 . 0

When we try to iterate over it, we get an OverflowError. Currently the
fix is to make the name field a non-integer before doing the intersection:

def fix(f):
f.name = f.name + '.'
return f

z = x.each(fix).intersect(y, wao=True)

While we could get more fancy with detecting SAM, I don't want to go the
route of checking against a regex for every field in every line of a file
for pathological cases like this. Instead, it would be useful to set the
filetype on the BedTool object and use that to short-circuit the
create_interval_from_fields heuristics. So then you could do this:

z = x.intersect(y, wao=True)
z.file_type = 'bed'print(z) # no longer raises OverflowError

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#170

daler · 2016-04-27T01:37:10Z

I like that idea, and should probably add it. For it to work for this example, the results from each BedTool method would inherit the file type of self. Is that always true though? Actually now that I think about it, it's not -- take for example intersect(bed=True), where the input is BAM but the output is BED. To handle cases like that, you'd have to keep track of the kwargs and which ones result in which kinds of files. I'd be worried about that getting out of date with actual BEDTools commands.

But the case where you have that problematic file already on disk and then want to make a BedTool out of it, that's where the constructor would come in handy.

gpratt · 2016-04-28T01:15:10Z

Hmm... I was about to agree with you chasing down all the places were you'll need to manually define changes in type is going to be a huge pain.

I actually just ran into a bug that would break your proposed solution. I'm using an intersect bed command -wo, which long story short outputs the bedline which pybedtools thinks is a sam file.

chr1 876577 876589 0 -0.963772824612645 + chr1 876524 876686 ENSG00000187634.6 0 + 12

Both the parent bed files are fine, but the child breaks on implicit construction. Can you define the type of the object after creation, but before running though the create_interval_from_list function?

daler · 2016-04-28T02:21:58Z

Yep, that's what the original proposal was: you could set the file_type after creation but before create_interval_from_list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make file_type a settable attribute #170

make file_type a settable attribute #170

daler commented Apr 27, 2016

gpratt commented Apr 27, 2016

daler commented Apr 27, 2016

gpratt commented Apr 28, 2016

daler commented Apr 28, 2016

make file_type a settable attribute #170

make file_type a settable attribute #170

Comments

daler commented Apr 27, 2016

gpratt commented Apr 27, 2016

daler commented Apr 27, 2016

gpratt commented Apr 28, 2016

daler commented Apr 28, 2016