Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DOCTYPE support #27

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,10 @@ Emitted when a closing tag is parsed. An object containing the `name` of the tag

Emitted when a processing instruction (such as `<? contents ?>`) is parsed. An object with the `contents` of the processing instruction is passed.

#### `documenttypedefinition`

Emitted when a document type definition (such as `<!DOCTYPE contents >`) is parsed. An object with the `contents` of the document type definition is passed.

#### `text`

Emitted when a text node between two tags is parsed. An object with the `contents` of the text node is passed. You might need to expand XML entities inside the contents of the text node, using `Saxophone.parseEntities`.
Expand All @@ -213,7 +217,6 @@ Emitted when a comment (such as `<!-- contents -->`) is parsed. An object with t

Emitted when a parsing error is encountered while reading the XML stream such that the rest of the XML cannot be correctly interpreted:

* when a DOCTYPE node is found (not supported yet);
* when a comment node contains the `--` sequence;
* when opening and closing tags are mismatched or missing;
* when a tag name starts with white space;
Expand Down
75 changes: 74 additions & 1 deletion lib/Saxophone.js
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,22 @@ const {findIndexOutside} = require('./util');
* @type {ProcessingInstructionNode}
*/

/**
* Information about the document type definition node
* (<!DOCTYPE ... >).
*
* @typedef DocumentTypeDefinitionNode
* @type {object}
* @prop {string} contents The definition contents
*/

/**
* Emitted whenever a document type definition node is encountered.
*
* @event Saxophone#documenttypedefinition
* @type {DocumentTypeDefinitionNode}
*/

/**
* Information about an opened tag
* (<tag attr="value">).
Expand Down Expand Up @@ -111,6 +127,7 @@ const Node = {
comment: 'comment',
markupDeclaration: 'markupDeclaration',
processingInstruction: 'processinginstruction',
documentTypeDefinition: 'documenttypedefinition',
tagOpen: 'tagopen',
tagClose: 'tagclose',
};
Expand Down Expand Up @@ -320,7 +337,63 @@ class Saxophone extends Writable {
continue;
}

// TODO: recognize DOCTYPEs here
if (
'DOCTYPE '.indexOf(input.slice(
chunkPos,
chunkPos + 8
)) > -1
) {
chunkPos += 8;
var dtdPos = chunkPos;

// According to spec. the DTD is followed by the
// name, then by a terminating > or a preceding
// external id (SYSTEM / PUBLIC), with one or two
// strings encapsulated by quotes ('") and a []
// section, whichever comes first we deal with
for (;;) {
for (const dtdChar of '\'"[>') {
const nextDtdPos = input.indexOf(dtdChar, dtdPos);
if (nextDtdPos !== -1) {
dtdPos = nextDtdPos;
break;
}
}

// We are done or need to wait for more data
if (dtdPos === -1 || input[dtdPos] === '>') {
break;
}

// Search for the matching string end '" or ]
dtdPos = input.indexOf(input[dtdPos] === '['
? ']' : input[dtdPos], dtdPos + 1);

if (dtdPos === -1) {
break;
} else {
dtdPos++;
}
}

// Incomplete DTD, we need to wait for upcoming data
if (dtdPos === -1) {
this._wait(
Node.documentTypeDefinition,
input.slice(chunkPos - 10)
);
break;
}

this.emit(
Node.documentTypeDefinition,
{contents: input.slice(chunkPos, dtdPos)}
);

chunkPos = dtdPos + 1;
continue;
}

callback(new Error('Unrecognized sequence: <!' + nextNextChar));
return;
}
Expand Down
104 changes: 97 additions & 7 deletions lib/Saxophone.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,94 @@ test('should not parse unclosed processing instructions', assert => {
);
});

test('should parse minimal document type definition', assert => {
expectEvents(assert,
'<!DOCTYPE DocType>',
[['documenttypedefinition', {contents: 'DocType'}]]
);
});

test('should parse document type definition variation 1', assert => {
expectEvents(assert,
'<!DOCTYPE DocType SYSTEM "file.dtd">',
[['documenttypedefinition', {contents: 'DocType SYSTEM "file.dtd"'}]]
);
});

test('should parse document type definition variation 2', assert => {
expectEvents(assert,
'<!DOCTYPE DocType SYSTEM "file.dtd">',
[['documenttypedefinition', {contents: 'DocType SYSTEM "file.dtd"'}]]
);
});

test('should parse document type definition variation 3', assert => {
expectEvents(assert,
'<!DOCTYPE DocType SYSTEM \'file.dtd\' [ any content ] >',
[['documenttypedefinition', {contents: 'DocType SYSTEM \'file.dtd\' [ any content ] '}]]
);
});

test('should parse document type definition variation 4', assert => {
expectEvents(assert,
'<!DOCTYPE DocType PUBLIC "Public Identifier" \'file.dtd\'>',
[['documenttypedefinition', {contents: 'DocType PUBLIC "Public Identifier" \'file.dtd\''}]]
);
});

test('should parse document type definition variation 5', assert => {
expectEvents(assert,
'<!DOCTYPE DocType PUBLIC \'Public Identifier\' "file.dtd" [ any content ]>',
[['documenttypedefinition', {contents: 'DocType PUBLIC \'Public Identifier\' "file.dtd" [ any content ]'}]]
);
});

test('should parse document type definition variation 6', assert => {
expectEvents(assert,
'<!DOCTYPE DocType [ any content ] >',
[['documenttypedefinition', {contents: 'DocType [ any content ] '}]]
);
});

test('should parse complex document type definition', assert => {
expectEvents(assert,
`<!DOCTYPE DocType PUBLIC "Public Identifier" 'file.dtd' [

<!ELEMENT DocType


(#PCDATA)>
<!-- here is a comment space -->

<!ATTLIST DocType attr


CDATA #REQUIRED>


]

>`,
[['documenttypedefinition', {contents: `DocType PUBLIC "Public Identifier" 'file.dtd' [

<!ELEMENT DocType


(#PCDATA)>
<!-- here is a comment space -->

<!ATTLIST DocType attr


CDATA #REQUIRED>


]

`}]]
);
});

test('should parse simple tags', assert => {
expectEvents(assert,
'<tag></tag>',
Expand Down Expand Up @@ -180,13 +268,6 @@ test('should not parse unclosed tags 3', assert => {
);
});

test('should not parse DOCTYPEs', assert => {
expectEvents(assert,
'<!DOCTYPE html>',
[['error', new Error('Unrecognized sequence: <!D')]]
);
});

test('should not parse invalid tags', assert => {
expectEvents(assert,
'< invalid>',
Expand Down Expand Up @@ -266,6 +347,10 @@ test('should parse a complete document', assert => {
expectEvents(assert,
tags.stripIndent`
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE PersonType [
<!ELEMENT persons (#PCDATA)>
<!ELEMENT person (#PCDATA)>
]>
<persons>
<!-- List of persons -->
<person name="Priscilla Z. Holden" address="320-2518 Taciti Street" />
Expand All @@ -275,6 +360,11 @@ test('should parse a complete document', assert => {
`,
[
['processinginstruction', {contents: 'xml version="1.0" encoding="UTF-8" '}],
['text', {contents: '\n'}],
['documenttypedefinition', {contents: `PersonType [
<!ELEMENT persons (#PCDATA)>
<!ELEMENT person (#PCDATA)>
]`}],
['text', {contents: '\n'}],
['tagopen', {name: 'persons', attrs: '', isSelfClosing: false}],
['text', {contents: '\n '}],
Expand Down