Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Down::ChunkedIO#pos returns wildly incorrect values #73

Open
pond opened this issue Jun 14, 2022 · 2 comments
Open

Down::ChunkedIO#pos returns wildly incorrect values #73

pond opened this issue Jun 14, 2022 · 2 comments

Comments

@pond
Copy link

pond commented Jun 14, 2022

We are using Shrine for handling large CSV uploads that are then stream-processed. We have a progress meter for this which works off the underlying IO object's #pos values. For local files, this works perfectly. Once we went into our Staging environment with S3 as the storage engine, using Down under-the-hood, it all broke. It seems that after the first 1K of data, Down::ChunkedIO#pos starts returning values much, much higher than they should be - far beyond the end of the file.

For a particular test file of only 3669 bytes comprising around 55 CSV rows plus header, the size reported by the IO object was consistently correct. However, inside the CSV row iterator, the results of #pos were:

0
1024
1024
1024
1024
1024
1024
1024
1024
1024
1024
1024
1024
1024
3736
6268
8732
11134
13466
15730
17923
20045
22103
24087
26017
27888
29698
31455
33155
34794
36363
37878
39313
40687
41998
43249
44431
45549
46598
47581
48498
49349
50137
50861
51519
52117
52656
53138
53562
53924
54220
54465
54647
54774
54840
54840

The start offset is 0. The 1024 offset was presumed to be a chunk size from the CSV processor, but if I tried to rewind to zero and read 1024 bytes, I actually got a very strange 1057 bytes, perfectly aligned to a row end, instead. In any event, it then sits at 1024 for a while and once the CSV parsing seems to have gone past that first "chunk" - be it 1024 or 1057 bytes - then the positions reported become, as you can see, very wrong.

The above was generated with no rewinding or other shenanigans; in psuedocode we have:

# shrine_file is our Shrine subclass instance representing the S3 object. The
# encoding specifier is typically UTF-8.
#
# Inside the iterator, io_obj is the Down::ChunkedIO instance. CSV options are:
#
#   {:headers=>true, :header_converters=>:symbol, :liberal_parsing=>true}
#
shrine_file.open(encoding: encoding_specifier) do | io_obj |
  csv = CSV.new(io_obj, **options)

  csv.each do |row|
    puts io_obj.pos
  end
end
@janko
Copy link
Owner

janko commented Jun 15, 2022

Could you write a self-contained example that reproduces the problem? Does it happen only when used with Shrine::Storage::S3, or can you reproduce it with vanilla Down (without Shrine)?

@pond
Copy link
Author

pond commented Jun 16, 2022

Found the issue & can create a PR with test cases, but there's a bit of discussion about how to handle it since we have two overlapping issues now.

Detail of problem

As far as I can see the bug I've encountered here is specific to #gets and only occurs when both a separator and limit are given.

We know now from #74 that the entire Ruby CSV system tries to be fully encoding-aware, even down to looking for line ending characters in correct encoding. It uses IO#gets rather than IO#read because #gets is meant to work on characters not bytes and, were it not for a single incorrect word in the standard library docs, there would probably have been less confusion here.

The CSV parser reads 1K only in its initial line ending scan, it transpires. It's trying to determine the CSV file line ending type automatically, reading in 1K chunks - and of course, usually finding an end-of-line well inside that amount, so only needing to read a first 1K. During that time, it is passing nil as the separator character to #gets.

In Down::ChunkedIO, #gets calls down to #readpartial. When there's no separator character, both methods are thinking "read the exact amount given in the limit parameter" and - bytes vs characters aside - they both agree. It all works. Thereafter, however, CSV switches mode - it really does seem a wildly overcomplicated implementation! - into reading with 8K chunks. It retains the first 1K, but then starts calling gets with 8K limits and a separator character that it found from the initial 1K scan. Typically, this is "\n".

This is where we go wrong.

  • readpartial just adds an up-to-8K amount to @position.
  • gets does line, extra = data.split(separator, 2), which is a bit expensive but means we definitely have two chunks of data - the bit we want, and some stuff at the end which is of no interest.
  • gets fails to subtract extra.bytesize from @position, although it does adjust a cache position if present and does something I haven't dug into with @buffer which I'll assume is correct.

At first glance, I appear to have fixed the position offset problem with:

      line, extra = data.split(separator, 2)
      line << separator if data.include?(separator)

      data.clear # deallocate data

      if extra
        @position -= extra.bytesize # <--- Fix "position"
        # ...rest of the implementation as-is...

...though I realise that there may be deeper subtleties of the implementation that I've overlooked. In a test case with CSV parsing, the offset previously went into hundreds of thousands before; with the fix above, the offset progresses up nicely until eventually precisely matching the input file size once the CSV system has read all data.

Question arising

While I can do a PR with that one-line fix and some tests, the trouble here is that the use of #readpartial by #gets is problematic, since #readpartial works in bytes but we need to fix #gets so that it works with characters. Doing so would fix a lot of issues specifically with encoding-aware CSV parsing since it will become impossible for CSV to ever receive a partial character by multibyte split at a buffer boundary.

We thus have something that overlaps. This issue and #74 are closely related. Would you prefer me to:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants