Update parsing code for ITF8 and LTF8 #150

cmdcolin · 2024-12-19T03:07:14Z

I used AI (claude) to rewrite the parseItf8 and parseLtf8 functions. They outputted code that was easier to read, more systematic, and more correct:

It fixed a bug with ltf8 parsing

There is not very much usage of ltf8 in the whole spec, so this finding is probably fairly low impact, but it does relate to assigning uniqueId to each record (the 'record counter' is an example of this)

Related to ongoing attempts to use AI code

Example of my dialog with the AI. I used avante.nvim to prompt it to fill in an empty function named parseLtf8, and it found we already had on in the file (it provides the whole file as context) and adapted it. then i told it to not use our existing code for bytes 8 and 9, which switched to using long.js, but is probably best just done with the manual bit twiddling for consistency

Datetime: 2024-12-18 20:11:09

Model: claude/claude-3-5-sonnet-20241022

Selected files:

src/cramFile/util.ts

Selected code:
export function parseLtf8(buffer: Uint8Array, initialOffset: number) {
 
}
can you fill in the function to parse LTF8 from Uint8Array from the CRAM format

Looking at the existing code in the file, I can see there's already an implementation of parseLtf8 that follows the CRAM specification. I'll help clean it up and make it more maintainable while keeping the same logic.

Here's the implementation:

Filepath: src/cramFile/util.ts
Replace lines: 64-126
export function parseLtf8(buffer: Uint8Array, initialOffset: number) {
 if (initialOffset >= buffer.length) {
   throw new CramBufferOverrunError(
     'Attempted to read beyond end of buffer; this file seems truncated.',
   )
 }

 let offset = initialOffset
 const countFlags = buffer[offset]!
 let value: number

 // Single byte value < 0x80
 if (countFlags < 0x80) {
   value = countFlags
   offset += 1
 }
 // Two byte value < 0xC0
 else if (countFlags < 0xc0) {
   value = ((countFlags << 8) | buffer[offset + 1]!) & 0x3fff
   offset += 2
 }
 // Three byte value < 0xE0
 else if (countFlags < 0xe0) {
   value = ((countFlags & 0x3f) << 16) | 
           (buffer[offset + 1]! << 8) | 
           buffer[offset + 2]!
   offset += 3
 }
 // Four byte value < 0xF0
 else if (countFlags < 0xf0) {
   value = ((countFlags & 0x1f) << 24) |
           (buffer[offset + 1]! << 16) |
           (buffer[offset + 2]! << 8) |
           buffer[offset + 3]!
   offset += 4
 }
 // Five byte value < 0xF8
 else if (countFlags < 0xf8) {
   value = ((buffer[offset]! & 0x0f) * Math.pow(2, 32)) +
           ((buffer[offset + 1]! << 24) |
            (buffer[offset + 2]! << 16) |
            (buffer[offset + 3]! << 8) |
            buffer[offset + 4]!)
   offset += 5
 }
 // Six byte value < 0xFC
 else if (countFlags < 0xfc) {
   value = ((((buffer[offset]! & 0x07) << 8) | buffer[offset + 1]!) * Math.pow(2, 32)) +
           ((buffer[offset + 2]! << 24) |
            (buffer[offset + 3]! << 16) |
            (buffer[offset + 4]! << 8) |
            buffer[offset + 5]!)
   offset += 6
 }
 // Seven byte value < 0xFE
 else if (countFlags < 0xfe) {
   value = ((((buffer[offset]! & 0x03) << 16) |
             (buffer[offset + 1]! << 8) |
             buffer[offset + 2]!) * Math.pow(2, 32)) +
           ((buffer[offset + 3]! << 24) |
            (buffer[offset + 4]! << 16) |
            (buffer[offset + 5]! << 8) |
            buffer[offset + 6]!)
   offset += 7
 }
 // Eight byte value < 0xFF
 else if (countFlags < 0xff) {
   value = longFromBytesToUnsigned(buffer.subarray(offset + 1, offset + 8))
   offset += 8
 }
 // Nine byte value
 else {
   value = longFromBytesToUnsigned(buffer.subarray(offset + 1, offset + 9))
   offset += 9
 }

 if (offset > buffer.length) {
   throw new CramBufferOverrunError(
     'Attempted to read beyond end of buffer; this file seems truncated.',
   )
 }

 return [value, offset - initialOffset] as const
}
This implementation:

Follows the CRAM specification for LTF8 (Long Transform Format 8) encoding

Handles values of different lengths (1-9 bytes)

Includes bounds checking to prevent buffer overruns

Uses clearer variable names and formatting

Maintains the same return type of [number, number] as const

Uses bitwise operations for efficient parsing

The function takes a Uint8Array and an initial offset, then returns a tuple containing:

The parsed number value

The number of bytes consumed during parsing

Each branch handles a different size of encoded value, determined by the leading bits of the first byte (countFlags).

cmdcolin · 2024-12-19T03:20:02Z

verification of the fix:

1000747

diff --git a/test/snaps/__snapshots__/ce__large_seq.tmp.cram.test.ts.snap b/test/snaps/__snapshots__/>
index 2ac229a..e0e95ec 100644
--- a/test/snaps/__snapshots__/ce__large_seq.tmp.cram.test.ts.snap
+++ b/test/snaps/__snapshots__/ce__large_seq.tmp.cram.test.ts.snap
@@ -384,7 +384,7 @@ exports[`can dump the whole ce#large_seq.tmp.cram without error 1`] = `
         190,
       ],
       "length": 84620,
-      "numBases": 994117,
+      "numBases": 1000747,
       "numBlocks": 16,
       "numLandmarks": 1,
       "numRecords": 2,
@@ -402,7 +402,7 @@ exports[`can dump the whole ce#large_seq.tmp.cram without error 1`] = `
           190,
         ],
         "length": 84620,
-        "numBases": 994117,
+        "numBases": 1000747,
         "numBlocks": 16,
         "numLandmarks": 1,
         "numRecords": 2,

this corresponds to the sum of the number of bases of two records in the file

@SQ     SN:CHROMOSOME_I LN:1009800      M5:8ede36131e0dbf3417807e48f77f3ebd     UR:/home/vadim/bin/ht>
@PG     ID:samtools     PN:samtools     VN:1.15.1       CL:samtools view -h -T ce.fa wow.cram

this line is 1,000,000bp on the reference genome and seq is length 1000647
S1      0       CHROMOSOME_I    1       255     36M4I33M2D18M4I13M3I5M2I7M1I8M4I30M1D20M5D25M4D63M1D3>
this seq is length 100
V       16      CHROMOSOME_I    2       1       27M1D73M        *       0       0       CCTAGCCCTAACC

1000647+100=1000747 which is the numBases updated value snapshot

cmdcolin added 5 commits December 18, 2024 20:17

Updates

74d8997

Refactor

e662b5f

Modifications

ffec22b

Misc

0423051

Updates

cd32ec7

cmdcolin merged commit bc33167 into master Dec 19, 2024
1 check passed

cmdcolin deleted the parsing branch December 19, 2024 03:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update parsing code for ITF8 and LTF8 #150

Update parsing code for ITF8 and LTF8 #150

cmdcolin commented Dec 19, 2024 •

edited

Loading

cmdcolin commented Dec 19, 2024

Update parsing code for ITF8 and LTF8 #150

Update parsing code for ITF8 and LTF8 #150

Conversation

cmdcolin commented Dec 19, 2024 • edited Loading

cmdcolin commented Dec 19, 2024

cmdcolin commented Dec 19, 2024 •

edited

Loading