Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When XML contains \b or  it breaks the SAX Parser #3299

Closed
arianf opened this issue Aug 4, 2024 · 2 comments
Closed

When XML contains \b or  it breaks the SAX Parser #3299

arianf opened this issue Aug 4, 2024 · 2 comments
Labels
state/needs-triage Inbox for non-installation-related bug reports or help requests

Comments

@arianf
Copy link

arianf commented Aug 4, 2024

Please describe the bug

I reported a bug to aws/aws-sdk-ruby#3081 that I think is actually an issue with Nokogiri.

Basically when an XML contains the backspace character, it seems to cause the SAX parser to early exit

# Nokogiri::VERSION
# => "1.16.7" 
# RUBY_VERSION
#  => "3.2.3" 

class MyDocument < Nokogiri::XML::SAX::Document
  def start_element(name, attrs = [])
    puts "Start element: #{name}"
  end

  def end_element(name)
    puts "End element: #{name}"
  end
end


good_xml_string = <<-XML
<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n
<SomeKeys>
  <A>AAA</A>
  <B>BBB</B>
  <C>CCC</C>
</SomeKeys>
XML

Nokogiri::XML::SAX::Parser.new(MyDocument.new).parse(good_xml_string)

# Start element: SomeKeys
# Start element: A
# End element: A
# Start element: B
# End element: B
# Start element: C
# End element: C
# End element: SomeKeys

bad_xml_string = <<-XML
<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n
<SomeKeys>
  <A>AAA</A>
  <B>BB\bB</B>
  <C>CCC</C>
</SomeKeys>
XML

Nokogiri::XML::SAX::Parser.new(MyDocument.new).parse(bad_xml_string)

# Start element: SomeKeys
# Start element: A
# End element: A
# Start element: B

Help us reproduce what you're seeing

#! /usr/bin/env ruby

require 'nokogiri'
require 'minitest/autorun'

class MyDocument < Nokogiri::XML::SAX::Document

  attr_reader :cached_values

  def start_element(name, attrs = [])
    puts "Start element: #{name.inspect}"

    @cached_values ||= []
    @cached_values << name
  end

  def end_element(name)
    puts "End element: #{name.inspect}"

    @cached_values ||= []
    @cached_values << name
  end

  def characters(string)
    puts "Characters: #{string.inspect}"

    @cached_values ||= []
    @cached_values << string
  end
end

class Test < Minitest::Spec
  describe "SAX#parse" do
    it "should call each element (works)" do
      xml = <<~XML
        <?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n
        <SomeKeys><A>AAA</A><B>BBB</B><C>CCC</C></SomeKeys>
      XML

      my_document = MyDocument.new
      Nokogiri::XML::SAX::Parser.new(my_document).parse(xml)

      assert_equal ['SomeKeys', 'A', 'AAA', 'A', 'B', 'BBB', 'B', 'C', 'CCC', 'C', 'SomeKeys'], my_document.cached_values
    end

    it "should call each element (does not work)" do
      xml = <<~XML
        <?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n
        <SomeKeys><A>AAA</A><B>B\bBB</B><C>CCC</C></SomeKeys>
      XML

      my_document = MyDocument.new
      Nokogiri::XML::SAX::Parser.new(my_document).parse(xml)

      assert_equal ['SomeKeys', 'A', 'AAA', 'A', 'B', "B\bBB", 'B', 'C', 'CCC', 'C', 'SomeKeys'], my_document.cached_values
    end
  end
end

Expected behavior

To handle \b or &#8; without early exiting

Environment

# Nokogiri (1.16.7)
    ---
    warnings: []
    nokogiri:
      version: 1.16.7
      cppflags:
      - "-I/Users/arian/.rvm/gems/ruby-3.2.3/gems/nokogiri-1.16.7-arm64-darwin/ext/nokogiri"
      - "-I/Users/arian/.rvm/gems/ruby-3.2.3/gems/nokogiri-1.16.7-arm64-darwin/ext/nokogiri/include"
      - "-I/Users/arian/.rvm/gems/ruby-3.2.3/gems/nokogiri-1.16.7-arm64-darwin/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 3.2.3
      platform: arm64-darwin23
      gem_platform: arm64-darwin-23
      description: ruby 3.2.3 (2024-01-18 revision 52bb2ac0a6) [arm64-darwin23]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - '0009-allow-wildcard-namespaces.patch'
      - 0010-update-config.guess-and-config.sub-for-libxml2.patch
      - 0011-rip-out-libxml2-s-libc_single_threaded-support.patch
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.12.9
      loaded: 2.12.9
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-config.guess-and-config.sub-for-libxslt.patch
      datetime_enabled: true
      compiled: 1.1.39
      loaded: 1.1.39
    other_libraries:
      zlib: 1.3.1
      libiconv: '1.17'
      libgumbo: 1.0.0-nokogiri
@arianf arianf added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Aug 4, 2024
@stevecheckoway
Copy link
Contributor

Backspace characters aren't allowed in well-formed XML documents.

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

I think libxml is doing the right thing here according to the standard.

Indeed, if you add an error method, you can see that libxml has reported the backspace character as invalid

#! /usr/bin/env ruby

require 'nokogiri'
require 'minitest/autorun'

class MyDocument < Nokogiri::XML::SAX::Document

  attr_reader :cached_values

  def start_element(name, attrs = [])
    puts "Start element: #{name.inspect}"

    @cached_values ||= []
    @cached_values << name
  end

  def end_element(name)
    puts "End element: #{name.inspect}"

    @cached_values ||= []
    @cached_values << name
  end

  def characters(string)
    puts "Characters: #{string.inspect}"

    @cached_values ||= []
    @cached_values << string
  end

  def error(err)
    puts "Error: #{err}"
  end
end

class Test < Minitest::Spec
  describe "SAX#parse" do
    it "should call each element (works)" do
      xml = <<~XML
        <?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n
        <SomeKeys><A>AAA</A><B>BBB</B><C>CCC</C></SomeKeys>
      XML

      my_document = MyDocument.new
      Nokogiri::XML::SAX::Parser.new(my_document).parse(xml)

      assert_equal ['SomeKeys', 'A', 'AAA', 'A', 'B', 'BBB', 'B', 'C', 'CCC', 'C', 'SomeKeys'], my_document.cached_values
    end

    it "should call each element (does not work)" do
      xml = <<~XML
        <?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n
        <SomeKeys><A>AAA</A><B>B\bBB</B><C>CCC</C></SomeKeys>
      XML

      my_document = MyDocument.new
      Nokogiri::XML::SAX::Parser.new(my_document).parse(xml)

      assert_equal ['SomeKeys', 'A', 'AAA', 'A', 'B', "B\bBB", 'B', 'C', 'CCC', 'C', 'SomeKeys'], my_document.cached_values
    end
  end
end

This prints

Error: PCDATA invalid Char value 8

I'm not an expert on XML, so it's possible I'm missing something.

@flavorjones
Copy link
Member

Everything @stevecheckoway said is correct.

You can put the SAX parser into "recovery mode" and keep parsing after errors by setting recovery = true on the parser context like this:

p = Nokogiri::XML::SAX::Parser.new(MyDocument.new)
p.parse(bad_xml_string) do |context|
  context.recovery = true
end

In which case the output in your original example is:

Start element: SomeKeys
Start element: A
End element: A
Start element: B
End element: B
Start element: C
End element: C
End element: SomeKeys

I hope this helps! Let me know if you've got more questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state/needs-triage Inbox for non-installation-related bug reports or help requests
Projects
None yet
Development

No branches or pull requests

3 participants