Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update parsing code for ITF8 and LTF8 #150

Merged
merged 5 commits into from
Dec 19, 2024
Merged

Update parsing code for ITF8 and LTF8 #150

merged 5 commits into from
Dec 19, 2024

Conversation

cmdcolin
Copy link
Contributor

@cmdcolin cmdcolin commented Dec 19, 2024

I used AI (claude) to rewrite the parseItf8 and parseLtf8 functions. They outputted code that was easier to read, more systematic, and more correct:

It fixed a bug with ltf8 parsing

There is not very much usage of ltf8 in the whole spec, so this finding is probably fairly low impact, but it does relate to assigning uniqueId to each record (the 'record counter' is an example of this)

Related to ongoing attempts to use AI code

Example of my dialog with the AI. I used avante.nvim to prompt it to fill in an empty function named parseLtf8, and it found we already had on in the file (it provides the whole file as context) and adapted it. then i told it to not use our existing code for bytes 8 and 9, which switched to using long.js, but is probably best just done with the manual bit twiddling for consistency

  • Datetime: 2024-12-18 20:11:09

  • Model: claude/claude-3-5-sonnet-20241022

  • Selected files:

  • src/cramFile/util.ts

  • Selected code:

export function parseLtf8(buffer: Uint8Array, initialOffset: number) {
 
}

can you fill in the function to parse LTF8 from Uint8Array from the CRAM format

Looking at the existing code in the file, I can see there's already an implementation of parseLtf8 that follows the CRAM specification. I'll help clean it up and make it more maintainable while keeping the same logic.

Here's the implementation:

Filepath: src/cramFile/util.ts
Replace lines: 64-126

export function parseLtf8(buffer: Uint8Array, initialOffset: number) {
 if (initialOffset >= buffer.length) {
   throw new CramBufferOverrunError(
     'Attempted to read beyond end of buffer; this file seems truncated.',
   )
 }

 let offset = initialOffset
 const countFlags = buffer[offset]!
 let value: number

 // Single byte value < 0x80
 if (countFlags < 0x80) {
   value = countFlags
   offset += 1
 }
 // Two byte value < 0xC0
 else if (countFlags < 0xc0) {
   value = ((countFlags << 8) | buffer[offset + 1]!) & 0x3fff
   offset += 2
 }
 // Three byte value < 0xE0
 else if (countFlags < 0xe0) {
   value = ((countFlags & 0x3f) << 16) | 
           (buffer[offset + 1]! << 8) | 
           buffer[offset + 2]!
   offset += 3
 }
 // Four byte value < 0xF0
 else if (countFlags < 0xf0) {
   value = ((countFlags & 0x1f) << 24) |
           (buffer[offset + 1]! << 16) |
           (buffer[offset + 2]! << 8) |
           buffer[offset + 3]!
   offset += 4
 }
 // Five byte value < 0xF8
 else if (countFlags < 0xf8) {
   value = ((buffer[offset]! & 0x0f) * Math.pow(2, 32)) +
           ((buffer[offset + 1]! << 24) |
            (buffer[offset + 2]! << 16) |
            (buffer[offset + 3]! << 8) |
            buffer[offset + 4]!)
   offset += 5
 }
 // Six byte value < 0xFC
 else if (countFlags < 0xfc) {
   value = ((((buffer[offset]! & 0x07) << 8) | buffer[offset + 1]!) * Math.pow(2, 32)) +
           ((buffer[offset + 2]! << 24) |
            (buffer[offset + 3]! << 16) |
            (buffer[offset + 4]! << 8) |
            buffer[offset + 5]!)
   offset += 6
 }
 // Seven byte value < 0xFE
 else if (countFlags < 0xfe) {
   value = ((((buffer[offset]! & 0x03) << 16) |
             (buffer[offset + 1]! << 8) |
             buffer[offset + 2]!) * Math.pow(2, 32)) +
           ((buffer[offset + 3]! << 24) |
            (buffer[offset + 4]! << 16) |
            (buffer[offset + 5]! << 8) |
            buffer[offset + 6]!)
   offset += 7
 }
 // Eight byte value < 0xFF
 else if (countFlags < 0xff) {
   value = longFromBytesToUnsigned(buffer.subarray(offset + 1, offset + 8))
   offset += 8
 }
 // Nine byte value
 else {
   value = longFromBytesToUnsigned(buffer.subarray(offset + 1, offset + 9))
   offset += 9
 }

 if (offset > buffer.length) {
   throw new CramBufferOverrunError(
     'Attempted to read beyond end of buffer; this file seems truncated.',
   )
 }

 return [value, offset - initialOffset] as const
}

This implementation:

  1. Follows the CRAM specification for LTF8 (Long Transform Format 8) encoding
  2. Handles values of different lengths (1-9 bytes)
  3. Includes bounds checking to prevent buffer overruns
  4. Uses clearer variable names and formatting
  5. Maintains the same return type of [number, number] as const
  6. Uses bitwise operations for efficient parsing

The function takes a Uint8Array and an initial offset, then returns a tuple containing:

  • The parsed number value
  • The number of bytes consumed during parsing

Each branch handles a different size of encoded value, determined by the leading bits of the first byte (countFlags).

@cmdcolin
Copy link
Contributor Author

verification of the fix:

1000747

diff --git a/test/snaps/__snapshots__/ce__large_seq.tmp.cram.test.ts.snap b/test/snaps/__snapshots__/>
index 2ac229a..e0e95ec 100644
--- a/test/snaps/__snapshots__/ce__large_seq.tmp.cram.test.ts.snap
+++ b/test/snaps/__snapshots__/ce__large_seq.tmp.cram.test.ts.snap
@@ -384,7 +384,7 @@ exports[`can dump the whole ce#large_seq.tmp.cram without error 1`] = `
         190,
       ],
       "length": 84620,
-      "numBases": 994117,
+      "numBases": 1000747,
       "numBlocks": 16,
       "numLandmarks": 1,
       "numRecords": 2,
@@ -402,7 +402,7 @@ exports[`can dump the whole ce#large_seq.tmp.cram without error 1`] = `
           190,
         ],
         "length": 84620,
-        "numBases": 994117,
+        "numBases": 1000747,
         "numBlocks": 16,
         "numLandmarks": 1,
         "numRecords": 2,

this corresponds to the sum of the number of bases of two records in the file

@SQ     SN:CHROMOSOME_I LN:1009800      M5:8ede36131e0dbf3417807e48f77f3ebd     UR:/home/vadim/bin/ht>
@PG     ID:samtools     PN:samtools     VN:1.15.1       CL:samtools view -h -T ce.fa wow.cram

this line is 1,000,000bp on the reference genome and seq is length 1000647
S1      0       CHROMOSOME_I    1       255     36M4I33M2D18M4I13M3I5M2I7M1I8M4I30M1D20M5D25M4D63M1D3>
this seq is length 100
V       16      CHROMOSOME_I    2       1       27M1D73M        *       0       0       CCTAGCCCTAACC

1000647+100=1000747 which is the numBases updated value snapshot

@cmdcolin cmdcolin merged commit bc33167 into master Dec 19, 2024
1 check passed
@cmdcolin cmdcolin deleted the parsing branch December 19, 2024 03:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant