Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not ignore encoding just because a read length is given #74

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pond
Copy link

@pond pond commented Jun 15, 2022

In the current implementation, Down::ChunkedIO has a very curious piece of behaviour where it deliberately ignores requested character encoding in #read if and only if a read length is given. This breaks our application.

I'm not sure why this is done. Perhaps there was a suspicion that Ruby's #force_encoding method might change the byte length of the data, leading to the length parameter not being honoured - but this is 100% not the case:

"First, it is possible to set the Encoding of a string to a new Encoding without changing the internal byte representation of the string, with String#force_encoding."

It would make no sense for #force_encoding to change the data itself, since the whole point is to merely indicate via metadata how that identical byte representation is to be interpreted for character presentation. The documentation for Down::ChunkedIO#read specifically says bytes not characters in its documentation:

"With length argument returns exactly that number of bytes if they're available."

...so encoding can and should be respected; #force_encoding is the right way to do it; and the presence of a length parameter should not influence that behaviour.

This PR changes behaviour accordingly, with appropriately updated tests.

@jrochkind
Copy link
Collaborator

I don't know the original motivation, but keep in mind if reading an exact number of bytes from multi-byte encodings, you may wind up with invalid data in your read -- the first byte only of a UTF-8 surrogate, for instance. This will be ! valid_encoding? and the final character will be unintelligible until/unless the next byte is read and joined to it.

@pond
Copy link
Author

pond commented Jun 15, 2022

I don't know the original motivation, but keep in mind if reading an exact number of bytes from multi-byte encodings, you may wind up with invalid data in your read -- the first byte only of a UTF-8 surrogate, for instance. This will be ! valid_encoding? and the final character will be unintelligible until/unless the next byte is read and joined to it.

Yes. If I'm reading in chunks and I ask for a multibyte encoding, the chunk might end in the middle of a character, but not in the middle of a byte and the exact byte count will be properly read.

Ignoring the encoding is catastrophic when stream processing CSV files with UTF-8 BOM markers. The BOM gets read in as binary and the CSV parser then falls over. See also shrinerb/shrine#585.

This is critical to our workflow. There is no reason to ignore encoding if specified; the developer is aware of the issues. Simply ignoring the encoding in an undocumented way is IMHO a really bad thing to do! It leaves the developer powerless to knowledgeably implement their required stream processing workflow and leaves them diving deep down into source having to spot a very-hard-to-see little unless length inside the workings of the gem.

I would implore the devs to please merge this - ignoring encoding is really bad.

(EDIT: Or of course - if there's another way that Shrine and Down can work together to respect encoding when loading files - that'd work too. The use case is to get the Shrine IO object & pass to CSV for reading; CSV input has BOM; "\xEF" from ASCII-8BIT to UTF-8 exception results from incorrect encoding handling).

@janko
Copy link
Owner

janko commented Jun 15, 2022

This is actually how Ruby's File class behaves, it returns content in binary encoding when you want to read a certain amount of bytes:

file = File.open("Gemfile")
file.read.encoding #=> #<Encoding:UTF-8>

file = File.open("Gemfile")
file.read(10).encoding #=> #<Encoding:ASCII-8BIT>

With Down::ChunkedIO I tried to retain the same behaviour, so that's why the requested encoding is ignored. Wouldn't using a File object instead of Down::ChunkedIO break your app as well then?

@pond
Copy link
Author

pond commented Jun 16, 2022

A very good question and after a few hours I finally figured it out.

The TL;DR (ish)

CSV uses IO#gets. It appears that this does obey encoding with files because it's trying to get to read separator characters and apparently thinks it needs to interpret both separator character and therefore input data in the context of a specified encoding (we could argue that it could convert the separator to an 8-bit byte sequence and just look for that in the input byte stream, but there may be a risk of it being very unlucky and mis-matching on a byte sequence that isn't aligned at a character boundary if it were to do so).

In particular, CSV uses gets(nil, chunk_size), meaning that the gets call is going to ignore any separators and return what Ruby IO docs describe as a number of bytes, but actually returns a number of characters. If I create a test file with a mutibyte character split at the 1024 byte boundary then (NB f.rewind calls in lines below omitted for brevity)

f=File.open("text.csv", rewindable: false, encoding: Encoding::UTF_8)
f.size
=> 1188
f.gets(nil).encoding
=> #<Encoding:UTF-8>
f.gets(nil).bytes.size
=> 1188

...so far so good...

f.gets(nil,1024).bytes.size
=> 1026

...oops! Compare with:

f.read(1024).encoding
=> #<Encoding:ASCII-8BIT>
f.read(1024).bytes.size
=> 1024

So contrary to strict reading of the Ruby core library documentation, IO#gets with a length limit is treating that limit as characters and works accurately with the requested encoding, while IO#read ignores the requested encoding and returns literal 8-bit bytes.

Down::ChunkedIO implements gets as a call to read, so behaves differently.

This suggests that my PR needs changes to match the IO behaviour. Do you agree?

Side note

That 1026 is more interesting that it looks. At offset 0 to 2 inclusive are the BOM bytes. At offset 1023 - the last byte of the first 1024 of the file - is the first byte of a 3-byte UTF-8 character sequence. When we examine the string that gets returns, we see that the BOM is present as an invisible first character, yet the string ends precisely at the fully-decoded character which was split. If gets had stripped the BOM before trying to read 1024 bytes, that last character would have fitted within a 1024 byte chunk and be followed by a single byte character afterwards. Since we see no such character in the returned string, we can deduce that the 1024 byte limit was enforced including the BOM, but gets was happy to read beyond that until reaching the end of a full character sequence.

The investigation

  • Part of the problem was that I used header converters. The built-in CSV converters force encoding to UTF-8 and that's actually where my original binary->UTF-8 BOM-related exception came from. Hard to track down as Ruby's caller backtrace is rather unhelpful when an exception arises within an invoked lambda.

  • Encodings based around something other than 8-bit bytes, multibyte or otherwise, are unlikely to work in Ruby core CSV. UTF-16 is a case in point. Rationale below if you're interested.

Once I'd given up on letting CSV anywhere near encodings and accepting that we're only going to be able to process encodings based on 8-bit bytes and where "\r" is always 0x13 and "\n" is always 0x10 - which is the case for the vast majority of common legacy 8-bit encodings and the case for UTF-8 - then things can be made to work.

  • Open the Shrine stream with file.open(encoding: Encoding::BINARY) in io_obj- which is of course how your original code would behave anyway given that it'd ignore encoding when Shrine's S3 back-end causes#read` to be invoked with a fixed length.
  • Provide encoding: Encoding::BINARY to CSV.new... with the io_obj (whether Down::ChunkedIO or otherwise)
  • Ensure that CSV.new either has no header or value converters or converts as described below
  • Process the file (see below)

If interested: CSV processing

Our CSV system regularly encounters files in random encodings including Windows CP1252 where top-bit set characters need to be handled specially in order to convert successfully to UTF-8. I found it unavoidable to have a preprocessing step which reads the CSV file data up to some defined byte limit (at which point it gives up with whatever it's determined is the correct encoding and just hopes that's OK for the whole file). The pseudocode process is:

  • Open everything as a byte stream
  • Headerless liberal CSV parsing with no converters
  • Iterate over every row and column
  • Take the value there, force encoding to whatever I'm trying to check - it cascades through UTF-8, CP1252, ISO8859-1 - and ask if the result is in #valid_encoding?; if not, restart with the next-thing-to-try (I suspect the code in practice never gets past CP1252 and moves to 8859-1 because I doubt Ruby evers return invalid-encoding for CP1252; it certainly doesn't for undefined characters such as 0x93 in ISO8859-1 (smart double opening quote in CP1252), so that's why I put the more-commonly-encountered CP1252 first).
  • I use #pos to determine the bail-out-early condition (and, yes, I'm still trying to figure out how to replicate Down::ChunkedIO#pos returns wildly incorrect values #73 cleanly for you there; the call is working reliably during the encoding scan - suspect it's something to do with rewindable & seeking, but not sure yet)

Once the internal encoding is determined as best we can, the process can be repeated but now we take any header or value converters specified in CSV options, flatten those into an array and unshift into first place an encoding coercion step which looks something like this:

lambda do |value|
  if value.is_a?(String)
    value.force_encoding(internal_encoding).encode(Encoding::UTF_8, invalid: :replace, undef: :replace)
  else
    value
  end
end

If interested: On Ruby CSV and UTF-16

The CSV code in Ruby code library "looks ahead" for line endings via this code:

    def resolve_row_separator(separator)
      if separator == :auto
        cr = "\r".encode(@encoding)
        lf = "\n".encode(@encoding)

...so it is encoding the line terminators into whatever the prevailing encoding might be, and it was seeing this that reminded me that I was incorrectly expecting certain control characters to be single-byte in every encoding. The code then goes on to do this:

        if @input.is_a?(StringIO)
          # ...not our use case...
        elsif @input.respond_to?(:gets)
          if @input.is_a?(File)
            chunk_size = 32 * 1024  # <--- My test file is only around 4K, so this would read the whole file
          else
            chunk_size = 1024 # <--- This is what happens with S3 / Down::ChunkedIO in play
          end
          begin
            while separator == :auto
              #
              # if we run out of data, it's probably a single line
              # (ensure will set default value)
              #
              break unless sample = @input.gets(nil, chunk_size)
              # ...etc...
          rescue IOError
            # do nothing:  ensure will set default
          end
        end

This at first glance looks correct, if the stream encoding is set (ignoring multibyte boundaries for a moment). The CSV parser does encode CR and LF and if the inbound data stream is encoded correctly you'd think it would compare. Otherwise, it would fail because Ruby has no way to compare an 8-bit binary stream with UTF-16:

irb(main):030:0> lf = "\n".encode("UTF-16")
=> "\uFEFF\n"
irb(main):031:0> "This\nand that\n".force_encoding("ASCII-8BIT").end_with?(lf)
(irb):32:in `end_with?': incompatible character encodings: ASCII-8BIT and UTF-16 (Encoding::CompatibilityError)

...we see here that if the developer was expecting UTF-16 multibyte input, then processing a CSV file would be impossible via a stream object unless the stream data was being read correctly. You don't see it with UTF-8 as Ruby is able to figure out the conversions automatically:

irb(main):032:0> "This\nand that\n".force_encoding("ASCII-8BIT").end_with?(lf.encode("UTF-8"))
=> true

...but it gets even worse. When Ruby encodes to UTF-16, it adds in the 0xFEFF BOM. Always. So in the core library CSV parser, when @encoding is UTF-16, then:

irb(main):033:0> lf.bytes
=> [254, 255, 0, 10]

...and in that case, if the data we read in were correctly set to UTF-16 encoding too, it doesn't work because Ruby's String#end_with? things it is being literally asked if the string ends with a BOM as well as the end-of-line character, which cannot ever be the case except for a possible end-of-line character at the very start of the entire file.

irb(main):034:0> "This\nand that\n".encode("UTF-16").bytes
=> [254, 255, 0, 84, 0, 104, 0, 105, 0, 115, 0, 10, 0, 97, 0, 110, 0, 100, 0, 32, 0, 116, 0, 104, 0, 97, 0, 116, 0, 10]
irb(main):036:0> "This\nand that\n".encode("UTF-16").end_with?(lf)
=> false  <--- (facepalm)

...so we come full circle. At least for some encodings, Ruby CSV will simply be incapable of processing the file, because Ruby string operations and the way CSV drives them with its encoded line ending types just breaks. We fortunately haven't encountered anyone in the real world rash enough to give us a UTF-16 encoded CSV file yet! 😂

@pond
Copy link
Author

pond commented Jun 16, 2022

As far as the error in Ruby documentation goes: https://bugs.ruby-lang.org/issues/18833

@janko
Copy link
Owner

janko commented Jun 16, 2022

This suggests that my PR needs changes to match the IO behaviour. Do you agree?

I really appreciate your research into this. Yes, I agree Down::ChunkedIO#gets should match Ruby's IO behavior, I'm just not clear whether it should match the documented (to return at most length+1 bytes) or the actual behavior.

@jrochkind
Copy link
Collaborator

jrochkind commented Jun 16, 2022

(to return at most length+1 bytes)

Wait, is this it?

I understood the report as different -- that it would return length codepoints, not length bytes.

That could easily exceedlength bytes by more than 1, no? It depends on the nature of the encoding and codepoints included. There could be more than one multi-byte codepoint in the source, but also note that even in UTF-8, some codepoints can be 3 or even 4 bytes! Not just 2.

@pond
Copy link
Author

pond commented Jun 16, 2022

@jrochkind:

it would return length codepoints, not length bytes.

Exactly right. The IO#gets core library documentation on the limit  parameter has only one word which is incorrect - it should say "characters", not "bytes". And it must do, because that's the only way it can correctly implement the sep parameter. Meanwhile, IO#reads is working in bytes.

@janko:

I'm just not clear whether it should match the documented (to return at most length+1 bytes) or the actual behavior

The documentation is definitely wrong, relatively seriously so for such a low-level method, which is why I submitted https://bugs.ruby-lang.org/issues/18833 and will try to figure out a PR soon if there's no traction on the issue. When I look at the number of bugs in the Ruby bug tracker, it worries me greatly - they don't seem to be keeping on top of things (1791 open issues).

@pond
Copy link
Author

pond commented Jun 20, 2022

The Ruby issue got rejected because they say that gets etc. do read in bytes, but do not split across multibyte chunks and it seems that this will happily lead to more bytes being read than requested (but only to the end of an already-started multibyte character). This is of course impossible to implement without being encoding-aware. They refer us to:

IMHO the information is very hard to notice because it's hidden at the start of the file without subsequent reference, but I don't have the energy to try and wade through the Ruby PR process for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants