Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make file_type a settable attribute #170

Open
daler opened this issue Apr 27, 2016 · 4 comments
Open

make file_type a settable attribute #170

daler opened this issue Apr 27, 2016 · 4 comments

Comments

@daler
Copy link
Owner

daler commented Apr 27, 2016

Imagine we have two bed files, and we do this:

z = x.intersect(y, wao=True)

The resulting file could look like this, is incorrectly guessed to be SAM format:

chr1  0      11447  0  none  0  0  0  chr1  0      11447  0  11447
chr1  11447  11502  1  1     1  0  0  chr1  11447  11502  2  55
chr1  11502  11675  0  none  0  0  0  chr1  11502  11675  0  173
chr1  31291  31431  0  none  0  0  0  .     -1     -1     .  0

When we try to iterate over it, we get an OverflowError. Currently the fix is to make the name field a non-integer before doing the intersection:

def fix(f):
    f.name = f.name + '.'
    return f

z = x.each(fix).intersect(y, wao=True)

While we could get more fancy with detecting SAM, I don't want to go the route of checking against a regex for every field in every line of a file for pathological cases like this. Instead, it would be useful to set the filetype on the BedTool object and use that to short-circuit the create_interval_from_fields heuristics. So then you could do this:

z = x.intersect(y, wao=True)
z.file_type = 'bed'
print(z)  # no longer raises OverflowError
@gpratt
Copy link

gpratt commented Apr 27, 2016

Why not take a biopython SeqIO approach (http://biopython.org/wiki/SeqIO)

Where the constructor for the bedtool has an optional file_type arg, and
you just inherit down through modifications?

Something like:

pybedtools.Bedtool(fn, file_type="bed")

Might take slightly more engineering for a very rare edge case though.

Gabriel Pratt
Bioinformatics Graduate Student, Yeo Lab
University of California San Diego

On Tue, Apr 26, 2016 at 5:57 PM, Ryan Dale [email protected] wrote:

Imagine we have two bed files, and we do this:

z = x.intersect(y, wao=True)

The resulting file could look like this, is incorrectly guessed to be SAM
format:

chr1 0 11447 0 none 0 0 0 chr1 0 11447 0 11447
chr1 11447 11502 1 1 1 0 0 chr1 11447 11502 2 55
chr1 11502 11675 0 none 0 0 0 chr1 11502 11675 0 173
chr1 31291 31431 0 none 0 0 0 . -1 -1 . 0

When we try to iterate over it, we get an OverflowError. Currently the
fix is to make the name field a non-integer before doing the intersection:

def fix(f):
f.name = f.name + '.'
return f

z = x.each(fix).intersect(y, wao=True)

While we could get more fancy with detecting SAM, I don't want to go the
route of checking against a regex for every field in every line of a file
for pathological cases like this. Instead, it would be useful to set the
filetype on the BedTool object and use that to short-circuit the
create_interval_from_fields heuristics. So then you could do this:

z = x.intersect(y, wao=True)
z.file_type = 'bed'print(z) # no longer raises OverflowError


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#170

@daler
Copy link
Owner Author

daler commented Apr 27, 2016

I like that idea, and should probably add it. For it to work for this example, the results from each BedTool method would inherit the file type of self. Is that always true though? Actually now that I think about it, it's not -- take for example intersect(bed=True), where the input is BAM but the output is BED. To handle cases like that, you'd have to keep track of the kwargs and which ones result in which kinds of files. I'd be worried about that getting out of date with actual BEDTools commands.

But the case where you have that problematic file already on disk and then want to make a BedTool out of it, that's where the constructor would come in handy.

@gpratt
Copy link

gpratt commented Apr 28, 2016

Hmm... I was about to agree with you chasing down all the places were you'll need to manually define changes in type is going to be a huge pain.

I actually just ran into a bug that would break your proposed solution. I'm using an intersect bed command -wo, which long story short outputs the bedline which pybedtools thinks is a sam file.

chr1 876577 876589 0 -0.963772824612645 + chr1 876524 876686 ENSG00000187634.6 0 + 12

Both the parent bed files are fine, but the child breaks on implicit construction. Can you define the type of the object after creation, but before running though the create_interval_from_list function?

@daler
Copy link
Owner Author

daler commented Apr 28, 2016

Yep, that's what the original proposal was: you could set the file_type after creation but before create_interval_from_list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants