Chunking for uBAM #140

rhpvorderman · 2024-10-07T14:14:12Z

Currently chunking only works for FASTQ. See marcelm/cutadapt#811

marcelm · 2024-10-07T14:20:14Z

Oh, interesting, I guess this needs to be done on the bgzip-level?

rhpvorderman · 2024-10-07T17:14:15Z

No, not really. Bgzip is just concatenated gzips. There is no requirement for the bgzips to be split at the bam record level. A bam record can start in one block and end in another, even if it could fit entirely in a block of its own. Nanopore records often will exceed the maximum size of a bgzip block.

So we can just decompress the whole thing as one big filestream and parse the records out. We already do this for single-end. For chunking we can make use of the fact that BAM records store their block sizes at the beginning. So there is no need to read the entire block. Chunking should be much faster than for FASTQ.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking for uBAM #140

Chunking for uBAM #140

rhpvorderman commented Oct 7, 2024

marcelm commented Oct 7, 2024

rhpvorderman commented Oct 7, 2024

Chunking for uBAM #140

Chunking for uBAM #140

Comments

rhpvorderman commented Oct 7, 2024

marcelm commented Oct 7, 2024

rhpvorderman commented Oct 7, 2024