Down::ChunkedIO#pos returns wildly incorrect values #73

pond · 2022-06-14T23:36:01Z

We are using Shrine for handling large CSV uploads that are then stream-processed. We have a progress meter for this which works off the underlying IO object's #pos values. For local files, this works perfectly. Once we went into our Staging environment with S3 as the storage engine, using Down under-the-hood, it all broke. It seems that after the first 1K of data, Down::ChunkedIO#pos starts returning values much, much higher than they should be - far beyond the end of the file.

For a particular test file of only 3669 bytes comprising around 55 CSV rows plus header, the size reported by the IO object was consistently correct. However, inside the CSV row iterator, the results of #pos were:

The start offset is 0. The 1024 offset was presumed to be a chunk size from the CSV processor, but if I tried to rewind to zero and read 1024 bytes, I actually got a very strange 1057 bytes, perfectly aligned to a row end, instead. In any event, it then sits at 1024 for a while and once the CSV parsing seems to have gone past that first "chunk" - be it 1024 or 1057 bytes - then the positions reported become, as you can see, very wrong.

The above was generated with no rewinding or other shenanigans; in psuedocode we have:

# shrine_file is our Shrine subclass instance representing the S3 object. The
# encoding specifier is typically UTF-8.
#
# Inside the iterator, io_obj is the Down::ChunkedIO instance. CSV options are:
#
#   {:headers=>true, :header_converters=>:symbol, :liberal_parsing=>true}
#
shrine_file.open(encoding: encoding_specifier) do | io_obj |
  csv = CSV.new(io_obj, **options)

  csv.each do |row|
    puts io_obj.pos
  end
end

The text was updated successfully, but these errors were encountered:

janko · 2022-06-15T21:29:01Z

Could you write a self-contained example that reproduces the problem? Does it happen only when used with Shrine::Storage::S3, or can you reproduce it with vanilla Down (without Shrine)?

pond · 2022-06-16T23:51:32Z

Found the issue & can create a PR with test cases, but there's a bit of discussion about how to handle it since we have two overlapping issues now.

Detail of problem

As far as I can see the bug I've encountered here is specific to #gets and only occurs when both a separator and limit are given.

We know now from #74 that the entire Ruby CSV system tries to be fully encoding-aware, even down to looking for line ending characters in correct encoding. It uses IO#gets rather than IO#read because #gets is meant to work on characters not bytes and, were it not for a single incorrect word in the standard library docs, there would probably have been less confusion here.

The CSV parser reads 1K only in its initial line ending scan, it transpires. It's trying to determine the CSV file line ending type automatically, reading in 1K chunks - and of course, usually finding an end-of-line well inside that amount, so only needing to read a first 1K. During that time, it is passing nil as the separator character to #gets.

In Down::ChunkedIO, #gets calls down to #readpartial. When there's no separator character, both methods are thinking "read the exact amount given in the limit parameter" and - bytes vs characters aside - they both agree. It all works. Thereafter, however, CSV switches mode - it really does seem a wildly overcomplicated implementation! - into reading with 8K chunks. It retains the first 1K, but then starts calling gets with 8K limits and a separator character that it found from the initial 1K scan. Typically, this is "\n".

This is where we go wrong.

readpartial just adds an up-to-8K amount to @position.
gets does line, extra = data.split(separator, 2), which is a bit expensive but means we definitely have two chunks of data - the bit we want, and some stuff at the end which is of no interest.
gets fails to subtract extra.bytesize from @position, although it does adjust a cache position if present and does something I haven't dug into with @buffer which I'll assume is correct.

At first glance, I appear to have fixed the position offset problem with:

      line, extra = data.split(separator, 2)
      line << separator if data.include?(separator)

      data.clear # deallocate data

      if extra
        @position -= extra.bytesize # <--- Fix "position"
        # ...rest of the implementation as-is...

...though I realise that there may be deeper subtleties of the implementation that I've overlooked. In a test case with CSV parsing, the offset previously went into hundreds of thousands before; with the fix above, the offset progresses up nicely until eventually precisely matching the input file size once the CSV system has read all data.

Question arising

While I can do a PR with that one-line fix and some tests, the trouble here is that the use of #readpartial by #gets is problematic, since #readpartial works in bytes but we need to fix #gets so that it works with characters. Doing so would fix a lot of issues specifically with encoding-aware CSV parsing since it will become impossible for CSV to ever receive a partial character by multibyte split at a buffer boundary.

We thus have something that overlaps. This issue and #74 are closely related. Would you prefer me to:

Implement the changes that make #gets character-aware by changing Do not ignore encoding just because a read length is given #74, independently create a PR for this issue (Down::ChunkedIO#pos returns wildly incorrect values #73) that's the one line fix above, then update PR 74 again (or add a followup PR) to resolve the conflict that would arise, or...
...Just update PR 74 so that it fixes both the position issue and the character-vs-byte issue in one go?

pond mentioned this issue Jun 16, 2022

Do not ignore encoding just because a read length is given #74

Open

pond mentioned this issue Jun 20, 2022

Allow encoding to be specified with S3 storage class shrinerb/shrine#585

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Down::ChunkedIO#pos returns wildly incorrect values #73

Down::ChunkedIO#pos returns wildly incorrect values #73

pond commented Jun 14, 2022

janko commented Jun 15, 2022

pond commented Jun 16, 2022 •

edited

Loading

Down::ChunkedIO#pos returns wildly incorrect values #73

Down::ChunkedIO#pos returns wildly incorrect values #73

Comments

pond commented Jun 14, 2022

janko commented Jun 15, 2022

pond commented Jun 16, 2022 • edited Loading

Detail of problem

Question arising

pond commented Jun 16, 2022 •

edited

Loading