-
-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When XML contains \b
or 
it breaks the SAX Parser
#3299
Comments
Backspace characters aren't allowed in well-formed XML documents.
I think libxml is doing the right thing here according to the standard. Indeed, if you add an #! /usr/bin/env ruby
require 'nokogiri'
require 'minitest/autorun'
class MyDocument < Nokogiri::XML::SAX::Document
attr_reader :cached_values
def start_element(name, attrs = [])
puts "Start element: #{name.inspect}"
@cached_values ||= []
@cached_values << name
end
def end_element(name)
puts "End element: #{name.inspect}"
@cached_values ||= []
@cached_values << name
end
def characters(string)
puts "Characters: #{string.inspect}"
@cached_values ||= []
@cached_values << string
end
def error(err)
puts "Error: #{err}"
end
end
class Test < Minitest::Spec
describe "SAX#parse" do
it "should call each element (works)" do
xml = <<~XML
<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n
<SomeKeys><A>AAA</A><B>BBB</B><C>CCC</C></SomeKeys>
XML
my_document = MyDocument.new
Nokogiri::XML::SAX::Parser.new(my_document).parse(xml)
assert_equal ['SomeKeys', 'A', 'AAA', 'A', 'B', 'BBB', 'B', 'C', 'CCC', 'C', 'SomeKeys'], my_document.cached_values
end
it "should call each element (does not work)" do
xml = <<~XML
<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n
<SomeKeys><A>AAA</A><B>B\bBB</B><C>CCC</C></SomeKeys>
XML
my_document = MyDocument.new
Nokogiri::XML::SAX::Parser.new(my_document).parse(xml)
assert_equal ['SomeKeys', 'A', 'AAA', 'A', 'B', "B\bBB", 'B', 'C', 'CCC', 'C', 'SomeKeys'], my_document.cached_values
end
end
end This prints
I'm not an expert on XML, so it's possible I'm missing something. |
Everything @stevecheckoway said is correct. You can put the SAX parser into "recovery mode" and keep parsing after errors by setting p = Nokogiri::XML::SAX::Parser.new(MyDocument.new)
p.parse(bad_xml_string) do |context|
context.recovery = true
end In which case the output in your original example is:
I hope this helps! Let me know if you've got more questions. |
Please describe the bug
I reported a bug to aws/aws-sdk-ruby#3081 that I think is actually an issue with Nokogiri.
Basically when an XML contains the backspace character, it seems to cause the SAX parser to early exit
Help us reproduce what you're seeing
Expected behavior
To handle
\b
or
without early exitingEnvironment
The text was updated successfully, but these errors were encountered: