Select inside <noscript> #123

pitdicker · 2023-03-19T18:24:55Z

Elements within a <noscript> tag seem to be ignored by default. Is there a workaround?

Example

use scraper::{Html, Selector};

fn main() {
    let fragment = Html::parse_fragment("<noscript><h1>Hello, world!</h1></noscript>");
    let selector = Selector::parse("h1").unwrap();

    let h1 = fragment.select(&selector).next().unwrap();

    assert_eq!("<h1>Hello, world!</h1>", h1.html());
}

current output:

thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/main.rs:7:48
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The text was updated successfully, but these errors were encountered:

oovm · 2023-04-18T12:55:49Z

The parsed result is:

Html {
    errors: [],
    quirks_mode: NoQuirks,
    tree: Tree { Fragment => { Element(<html>) => { Element(<noscript>) => { Text(Tendril<UTF8>(owned: "<h1>Hello, world!</h1>")) } } } },
}

It seems that the content in the script will be treated as Text, even though it is legal html

teymour-aldridge · 2023-04-18T13:06:57Z

Might this be an issue in how we configure html5ever?

nathaniel-daniel · 2023-09-27T02:06:32Z

https://docs.rs/html5ever/0.26.0/html5ever/tree_builder/struct.TreeBuilderOpts.html#structfield.scripting_enabled, probably. The parse_* functions would need to take an extra argument though.

LoZack19 · 2024-09-01T14:31:01Z

~~The scripting_enabled flag doesn't solve this issue. At the moment, it is unclear what it does.~~ I don't necessarely see this as a problem, considering that most <noscript> tags just enclose "please enable js" or something like that, which isn't much useful to scrape at all. Anyways, this code will do the job:

use scraper::{Html, Selector};

fn main() {
    // Parse html
    let fragment = Html::parse_fragment("<noscript><h1>Hello, world!</h1></noscript>");

    // Extract noscript inner html
    let noscript_sel = Selector::parse("noscript").unwrap();
    let noscripts: Vec<Html> = fragment
        .select(&noscript_sel)
        .filter_map(|elem| {
            elem.html()
                .strip_prefix("<noscript>")
                .and_then(|str| str.strip_suffix("</noscript>"))
                .map(String::from)
        })
        .map(|frag| Html::parse_fragment(&frag))
        .collect();

    let noscript = noscripts.get(0).expect("<noscript> tag wasn't found");

    // Extract h1 from it
    let h1_sel = Selector::parse("h1").unwrap();
    let h1 = noscript.select(&h1_sel).next().unwrap();

    // Prove that it works
    assert_eq!("<h1>Hello, world!</h1>", h1.html());
}

EDIT: For this solution to work, it is needed to have scripting enabled, otherwise the html method will interpret the tags inside the <noscript> as text. Thus we can choose whether we want to enable it or not.

nathaniel-daniel · 2024-09-05T18:48:50Z

~~The scripting_enabled flag doesn't solve this issue. At the moment, it is unclear what it does.~~ I don't necessarely see this as a problem, considering that most <noscript> tags just enclose "please enable js" or something like that, which isn't much useful to scrape at all. Anyways, this code will do the job:

use scraper::{Html, Selector};

fn main() {
    // Parse html
    let fragment = Html::parse_fragment("<noscript><h1>Hello, world!</h1></noscript>");

    // Extract noscript inner html
    let noscript_sel = Selector::parse("noscript").unwrap();
    let noscripts: Vec<Html> = fragment
        .select(&noscript_sel)
        .filter_map(|elem| {
            elem.html()
                .strip_prefix("<noscript>")
                .and_then(|str| str.strip_suffix("</noscript>"))
                .map(String::from)
        })
        .map(|frag| Html::parse_fragment(&frag))
        .collect();

    let noscript = noscripts.get(0).expect("<noscript> tag wasn't found");

    // Extract h1 from it
    let h1_sel = Selector::parse("h1").unwrap();
    let h1 = noscript.select(&h1_sel).next().unwrap();

    // Prove that it works
    assert_eq!("<h1>Hello, world!</h1>", h1.html());
}

EDIT: For this solution to work, it is needed to have scripting enabled, otherwise the html method will interpret the tags inside the <noscript> as text. Thus we can choose whether we want to enable it or not.

Disabling the scripting_enabled flag definitely fixes the given example (just tested), causing a tree to be built for the contents of the noscript node instead of treating it as text.

Also, your example doesn't seem to work, as it escapes the html. Instead of trying to serialize the node to html and extract the inner text, you should just probably just extract the text directly.

I still think it would be nice to have a config option for disabling scripting, as I'd imagine there would be some cases where parse_fragment would fail while parsing the elements in the context of the page would succeed.

LoZack19 · 2024-09-05T19:06:39Z

Disabling the scripting_enabled flag definitely fixes the given example (just tested), causing a tree to be built for the contents of the noscript node instead of treating it as text.

Interesting, I must have done something wrong, since it didn't work when I first tried it. Thanks for the contribution.

Also, your example doesn't seem to work, as it escapes the html. Instead of trying to serialize the node to html and extract the inner text, you should just probably just extract the text directly.

My example does work if you enable scripting, but if you say that by doing this, you can directly parse the content inside the <noscript> into the tree, that is pretty pointless. I wasn't able to replicate that behavior just by toggling the flag though.

nathaniel-daniel · 2024-09-05T19:16:27Z

My example does work if you enable scripting, but if you say that by doing this, you can directly parse the content inside the into the tree, that is pretty pointless. I wasn't able to replicate that behavior just by toggling the flag though.

With your example, I get:

thread 'main' panicked at src/main.rs:24:46:
called `Option::unwrap()` on a `None` value

Cargo.toml:

[package]
name = "scraper-test"
version = "0.1.0"
edition = "2021"

[dependencies]
scraper = "0.20.0"

Changing .map(|frag| Html::parse_fragment(&frag)) to .map(|frag| Html::parse_fragment(dbg!(&frag))) shows:

[src/main.rs:17:42] &frag = "&lt;h1&gt;Hello, world!&lt;/h1&gt;"

The html function appears to escape the inner text. I think you meant to access the inner text directly:

fn main() {
    let fragment = Html::parse_document("<noscript><h1>Hello, world!</h1>");
    let noscript_sel = Selector::parse("noscript").unwrap();
    let noscripts: Vec<Html> = fragment
        .select(&noscript_sel)
        .filter_map(|elem| {
           elem.text().next()
        })
        .map(|frag| Html::parse_fragment(&frag))
        .collect();
    
    let noscript = noscripts.get(0).expect("<noscript> tag wasn't found");
    
    // Extract h1 from it
    let h1_sel = Selector::parse("h1").unwrap();
    let h1 = noscript.select(&h1_sel).next().unwrap();

    // Prove that it works
    assert_eq!("<h1>Hello, world!</h1>", h1.html());
}

But yeah, I agree that adding a way to change the scripting_enabled flag is the ideal solution.

LoZack19 · 2024-09-07T11:31:00Z

I want to restate that my code works only if scripting is enabled, which is not the case if you used the default version of scraper which you provided in the Cargo.toml.

I agree that you code works even without this hypothesis, since it extracts raw text.

I agree that enabling scripting allows to handle this case better. When I tried, it did not parse the html inside the noscript directly inside the tree, though. But maybe I did something wrong. We could enable this html5ever feature, since I don't see any downside in doing so.

nathaniel-daniel · 2024-09-08T00:55:10Z

I want to restate that my code works only if scripting is enabled, which is not the case if you used the default version of scraper which you provided in the Cargo.toml.

That shouldn't be the case. html5ever sets this to true by default, and scraper just uses the default settings. In fact, all I had to do to get this working is disable the flag:

pub fn parse_document(document: &str) -> Self {
    let parser = driver::parse_document(
        Self::new_document(),
        driver::ParseOpts {
            tree_builder: TreeBuilderOpts {
                scripting_enabled: false,
                ..Default::default()
            },
            ..Default::default()
        },
    );
    parser.one(document)
}

/// Parses a string of HTML as a fragment.
pub fn parse_fragment(fragment: &str) -> Self {
    let parser = driver::parse_fragment(
        Self::new_fragment(),
         driver::ParseOpts {
            tree_builder: TreeBuilderOpts {
                scripting_enabled: false,
                ..Default::default()
            },
            ..Default::default()
        },
        QualName::new(None, ns!(html), local_name!("body")),
        Vec::new(),
    );
    parser.one(fragment)
}

Furthermore, I don't really understand how enabling scripting changes the output format of the html method.

I agree that enabling scripting allows to handle this case better. When I tried, it did not parse the html inside the noscript directly inside the tree, though. But maybe I did something wrong. We could enable this html5ever feature, since I don't see any downside in doing so.

Yeah, I guess that might be easier than adding an option to control it. Is there ever a scenario where someone would want to treat the interior of a <noscript> as text?

LoZack19 added C-wontfix C-feature-request Category: feature request and removed C-wontfix labels Sep 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Select inside <noscript> #123

Select inside <noscript> #123

pitdicker commented Mar 19, 2023

oovm commented Apr 18, 2023

teymour-aldridge commented Apr 18, 2023

nathaniel-daniel commented Sep 27, 2023

LoZack19 commented Sep 1, 2024 •

edited

Loading

nathaniel-daniel commented Sep 5, 2024 •

edited

Loading

LoZack19 commented Sep 5, 2024

nathaniel-daniel commented Sep 5, 2024

LoZack19 commented Sep 7, 2024 •

edited

Loading

nathaniel-daniel commented Sep 8, 2024

Select inside <noscript> #123

Select inside <noscript> #123

Comments

pitdicker commented Mar 19, 2023

Example

oovm commented Apr 18, 2023

teymour-aldridge commented Apr 18, 2023

nathaniel-daniel commented Sep 27, 2023

LoZack19 commented Sep 1, 2024 • edited Loading

nathaniel-daniel commented Sep 5, 2024 • edited Loading

LoZack19 commented Sep 5, 2024

nathaniel-daniel commented Sep 5, 2024

LoZack19 commented Sep 7, 2024 • edited Loading

nathaniel-daniel commented Sep 8, 2024

LoZack19 commented Sep 1, 2024 •

edited

Loading

nathaniel-daniel commented Sep 5, 2024 •

edited

Loading

LoZack19 commented Sep 7, 2024 •

edited

Loading