How to get only text from parsed eocument not the markdown code #975

robopzet · 2023-04-20T12:40:02Z

robopzet
Apr 20, 2023

I want to get only the text from a document. The nodes of a parsed document contain Text nodes, but they expose the original text, including markup. Is there a way to get more basic tokens, for example for a link get a node with the URL and one with the content?

This is how the text is parsed, based on these docs.:

$parser = new MarkdownParser(new Environment());
$doc    = $parser->parse($this->getBody());

foreach ($doc->iterator() as $nod) {
    if ($nod instanceof Text) {
        echo 'Node: ' . get_class($nod) . ': ' . $nod->getLiteral() . "\n";
}

Answered by colinodell

Apr 20, 2023

The new Environment() in your code isn't configured with any parsers or extensions, and therefore the engine doesn't know how to parse those tokens, so it assumes everything it sees is plain text. Try adding the CommonMarkCoreExtension:

$environment = new Environment();
$environment->addExtension(new CommonMarkCoreExtension());

$parser = new MarkdownParser($environment);
$doc    = $parser->parse($this->getBody());

foreach ($doc->iterator() as $nod) {
    if ($nod instanceof Text) {
        echo 'Node: ' . get_class($nod) . ': ' . $nod->getLiteral() . "\n";
    }
}

View full answer

colinodell · 2023-04-20T13:26:05Z

colinodell
Apr 20, 2023
Maintainer

The new Environment() in your code isn't configured with any parsers or extensions, and therefore the engine doesn't know how to parse those tokens, so it assumes everything it sees is plain text. Try adding the CommonMarkCoreExtension:

$environment = new Environment();
$environment->addExtension(new CommonMarkCoreExtension());

$parser = new MarkdownParser($environment);
$doc    = $parser->parse($this->getBody());

foreach ($doc->iterator() as $nod) {
    if ($nod instanceof Text) {
        echo 'Node: ' . get_class($nod) . ': ' . $nod->getLiteral() . "\n";
    }
}

0 replies

robopzet · 2023-04-21T07:14:40Z

robopzet
Apr 21, 2023
Author

@colinodell ok, I see. I thought the default configuration would treat the input as markdown.

1 reply

colinodell Apr 21, 2023
Maintainer

Ah, gotcha. It works that way because parsers can only be added, not removed, and some people might not want to support things like blockquote or lists in their applications. (Perhaps using https://commonmark.thephpleague.com/2.4/extensions/inlines-only/#usage)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get only text from parsed eocument not the markdown code #975

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to get only text from parsed eocument not the markdown code #975

robopzet Apr 20, 2023

Replies: 2 comments · 1 reply

colinodell Apr 20, 2023 Maintainer

robopzet Apr 21, 2023 Author

colinodell Apr 21, 2023 Maintainer

robopzet
Apr 20, 2023

Replies: 2 comments 1 reply

colinodell
Apr 20, 2023
Maintainer

robopzet
Apr 21, 2023
Author

colinodell Apr 21, 2023
Maintainer