Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aws::Xml::Parser::ParsingError: xmlParseCharRef: invalid xmlChar value 8 #3081

Closed
arianf opened this issue Aug 4, 2024 · 12 comments
Closed
Labels
p2 This is a standard priority issue wontfix We have determined that we will not resolve the issue.

Comments

@arianf
Copy link
Contributor

arianf commented Aug 4, 2024

Describe the bug

When trying to upload to a key with a \b in it, I get a parser error after upload.

object.upload_file(file_path)

Aws::Xml::Parser::ParsingError: xmlParseCharRef: invalid xmlChar value 8

from /usr/local/bundle/ruby/3.2.0/gems/aws-sdk-core-3.201.3/lib/aws-sdk-core/xml/parser/stack.rb:49:in `error'

Here is the full trace:

-> "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n<InitiateMultipartUploadResult>\n    <Bucket>v01</Bucket>\n    <Key>ABC/1234/1234/0000/12345678/file_name&#8;.tar.gz</Key>\n    <UploadId>8b5b9c90-5221-11ef-9ea3-620ac6840503</UploadId>\n</InitiateMultipartUploadResult>\n"
read 265 bytes
Conn keep-alive
Aws::Xml::Parser::ParsingError: xmlParseCharRef: invalid xmlChar value 8

I'm using the NokogiriEngine it seems:

Aws::Xml::Parser.engine
=> Aws::Xml::Parser::NokogiriEngine

Expected Behavior

When I put the same text through Nokogiri, it parses it just fine:

[11] pry(main)> Nokogiri::XML _
=> #(Document:0x7788 {
  name = "document",
  children = [
    #(Element:0x779c {
      name = "InitiateMultipartUploadResult",
      children = [
        #(Text "\n    "),
        #(Element:0x77b0 { name = "Bucket", children = [ #(Text "v01")] }),
        #(Text "\n    "),
        #(Element:0x77c4 { name = "Key", children = [ #(Text "ABC/1234/1234/0000/12345678/file_name.tar.gz")] }),
        #(Text "\n    "),
        #(Element:0x77d8 { name = "UploadId", children = [ #(Text "8b5b9c90-5221-11ef-9ea3-620ac6840503")] }),
        #(Text "\n")]
      })]
  })

Current Behavior

The following error:

Aws::Xml::Parser::ParsingError: xmlParseCharRef: invalid xmlChar value 8

Reproduction Steps

response_xml = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n<InitiateMultipartUploadResult>\n    <Bucket>v01</Bucket>\n    <Key>ABC/1234/1234/0000/12345678/file_name&#8;.tar.gz</Key>\n    <UploadId>8b5b9c90-5221-11ef-9ea3-620ac6840503</UploadId>\n</InitiateMultipartUploadResult>\n"

Nokogiri::XML::SAX::Parser.new(Aws::Xml::Parser::NokogiriEngine.new(Aws::Xml::Parser::Stack.new(Seahorse::Model::Shapes::ShapeRef.new))).parse(response_xml)

# => Aws::Xml::Parser::ParsingError: xmlParseCharRef: invalid xmlChar value 8

The above is mimicking what this line is doing:

Nokogiri::XML::SAX::Parser.new(self).parse(xml)

Possible Solution

Not sure

Additional Information/Context

nokogiri (1.14.0-x86_64-linux)

Gem name ('aws-sdk', 'aws-sdk-resources' or service gems like 'aws-sdk-s3') and its version

aws-sdk-core (3.201.3)

Environment details (Version of Ruby, OS environment)

Ruby 3.2

@arianf arianf added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 4, 2024
@mullermp
Copy link
Contributor

mullermp commented Aug 4, 2024

Thanks for opening an issue. I will investigate. Are you able to get it working purely using the SAX parser in nokogiri? If not then I think this may be a bug report for that project. I saw your note about it parsing just fine but that appears to be a different parser (not using SAX)?

@arianf
Copy link
Contributor Author

arianf commented Aug 4, 2024

Hmm your right, I just tried to recreate it using SAX:

class MyDocument < Nokogiri::XML::SAX::Document
  def start_element(name, attrs = [])
    puts "Start element: #{name}"
  end

  def end_element(name)
    puts "End element: #{name}"
  end
end


good_xml_string = <<-XML
<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n
<SomeKeys>
  <A>AAA</A>
  <B>BBB</B>
  <C>CCC</C>
</SomeKeys>
XML

Nokogiri::XML::SAX::Parser.new(MyDocument.new).parse(good_xml_string)

# Start element: SomeKeys
# Start element: A
# End element: A
# Start element: B
# End element: B
# Start element: C
# End element: C
# End element: SomeKeys

bad_xml_string = <<-XML
<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n
<SomeKeys>
  <A>AAA</A>
  <B>BB\bB</B>
  <C>CCC</C>
</SomeKeys>
XML

Nokogiri::XML::SAX::Parser.new(MyDocument.new).parse(bad_xml_string)

# Start element: SomeKeys
# Start element: A
# End element: A
# Start element: B

And the parser definitely fails after hitting that /b character. I'll open an issue with Nokogiri.

@arianf
Copy link
Contributor Author

arianf commented Aug 4, 2024

Reported here: sparklemotion/nokogiri#3299

@stevecheckoway
Copy link

As I mentioned in that bug report, I think the XML is simply malformed as backspace characters aren't allowed by the standard and libxml2 stopped parsing the input and reported an error when it encountered one.

@mullermp
Copy link
Contributor

mullermp commented Aug 4, 2024

Is the \b character critical in your application? If it's causing issues in parsing, and since s3 is an xml service, perhaps avoid it as a key?

@flavorjones
Copy link

I commented in the Nokogiri issue, but if y'all want to put the SAX parser into recovery mode (or give callers the option to do so), it's possible to set recovery = true on the parser context like this:

p = Nokogiri::XML::SAX::Parser.new(MyDocument.new)
p.parse(bad_xml_string) do |context|
  context.recovery = true
end

In this case, the parser will continue after calling the error callback. By default, however, parse errors are treated as fatal errors.

@mullermp
Copy link
Contributor

mullermp commented Aug 5, 2024

I tried adding recovery and I'm not seeing that work. I started a PR here #3082

@RanVaknin RanVaknin self-assigned this Aug 5, 2024
@RanVaknin RanVaknin removed the needs-triage This issue or PR still needs to be triaged. label Aug 5, 2024
@RanVaknin RanVaknin removed their assignment Aug 5, 2024
@RanVaknin RanVaknin added the p2 This is a standard priority issue label Aug 5, 2024
@flavorjones
Copy link

@mullermp Just to clarify: the recovery option means the parser continues to parse after the error callback is invoked. The error callback is still invoked.

Looking at the test failures it seems like this gem is likely raising an exception when the callback is invoked. Up to you what you want to do there, really.

@mullermp
Copy link
Contributor

mullermp commented Aug 5, 2024

Ah. I see. You are correct. I'm actually not sure I should change this behavior - it doesn't seem consistent with the 5 other parsers. It seems like this is just "bad data" to begin with. Oga, Ox and Rexml seem to handle this case though. LibXML doesn't have a recovery option that I can see.

@mullermp
Copy link
Contributor

mullermp commented Aug 5, 2024

After discussing with the team, I don't think it makes sense to handle this. It's inconsistent across all parsers and also with JRuby's nokogiri. I think it's best (and safest) not to handle this, and keys should not include this character.

@mullermp mullermp closed this as completed Aug 5, 2024
Copy link

github-actions bot commented Aug 5, 2024

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

@mullermp mullermp added wontfix We have determined that we will not resolve the issue. and removed bug This issue is a bug. labels Aug 5, 2024
@arianf
Copy link
Contributor Author

arianf commented Aug 6, 2024

Thanks for all the responses, and looking into it. Yea, I agree we should probably be scrubbing illegal XML characters, it's interesting that ActiveStorage also doesn't scrub any control characters except for \u{202E}

https://api.rubyonrails.org/classes/ActiveStorage/Filename.html#method-i-sanitized

ActiveStorage::Filename.new("\b").sanitized
=> "\b"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p2 This is a standard priority issue wontfix We have determined that we will not resolve the issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants