Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select inside <noscript> #123

Open
pitdicker opened this issue Mar 19, 2023 · 9 comments
Open

Select inside <noscript> #123

pitdicker opened this issue Mar 19, 2023 · 9 comments
Labels
C-feature-request Category: feature request

Comments

@pitdicker
Copy link

Elements within a <noscript> tag seem to be ignored by default. Is there a workaround?

Example

use scraper::{Html, Selector};

fn main() {
    let fragment = Html::parse_fragment("<noscript><h1>Hello, world!</h1></noscript>");
    let selector = Selector::parse("h1").unwrap();

    let h1 = fragment.select(&selector).next().unwrap();

    assert_eq!("<h1>Hello, world!</h1>", h1.html());
}

current output:

thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/main.rs:7:48
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
@oovm
Copy link

oovm commented Apr 18, 2023

The parsed result is:

Html {
    errors: [],
    quirks_mode: NoQuirks,
    tree: Tree { Fragment => { Element(<html>) => { Element(<noscript>) => { Text(Tendril<UTF8>(owned: "<h1>Hello, world!</h1>")) } } } },
}

It seems that the content in the script will be treated as Text, even though it is legal html

@teymour-aldridge
Copy link
Collaborator

Might this be an issue in how we configure html5ever?

@nathaniel-daniel
Copy link

https://docs.rs/html5ever/0.26.0/html5ever/tree_builder/struct.TreeBuilderOpts.html#structfield.scripting_enabled, probably. The parse_* functions would need to take an extra argument though.

@LoZack19
Copy link
Contributor

LoZack19 commented Sep 1, 2024

The scripting_enabled flag doesn't solve this issue. At the moment, it is unclear what it does. I don't necessarely see this as a problem, considering that most <noscript> tags just enclose "please enable js" or something like that, which isn't much useful to scrape at all. Anyways, this code will do the job:

use scraper::{Html, Selector};

fn main() {
    // Parse html
    let fragment = Html::parse_fragment("<noscript><h1>Hello, world!</h1></noscript>");

    // Extract noscript inner html
    let noscript_sel = Selector::parse("noscript").unwrap();
    let noscripts: Vec<Html> = fragment
        .select(&noscript_sel)
        .filter_map(|elem| {
            elem.html()
                .strip_prefix("<noscript>")
                .and_then(|str| str.strip_suffix("</noscript>"))
                .map(String::from)
        })
        .map(|frag| Html::parse_fragment(&frag))
        .collect();

    let noscript = noscripts.get(0).expect("<noscript> tag wasn't found");

    // Extract h1 from it
    let h1_sel = Selector::parse("h1").unwrap();
    let h1 = noscript.select(&h1_sel).next().unwrap();

    // Prove that it works
    assert_eq!("<h1>Hello, world!</h1>", h1.html());
}

EDIT: For this solution to work, it is needed to have scripting enabled, otherwise the html method will interpret the tags inside the <noscript> as text. Thus we can choose whether we want to enable it or not.

@LoZack19 LoZack19 added C-wontfix C-feature-request Category: feature request and removed C-wontfix labels Sep 1, 2024
@nathaniel-daniel
Copy link

nathaniel-daniel commented Sep 5, 2024

The scripting_enabled flag doesn't solve this issue. At the moment, it is unclear what it does. I don't necessarely see this as a problem, considering that most <noscript> tags just enclose "please enable js" or something like that, which isn't much useful to scrape at all. Anyways, this code will do the job:

use scraper::{Html, Selector};

fn main() {
    // Parse html
    let fragment = Html::parse_fragment("<noscript><h1>Hello, world!</h1></noscript>");

    // Extract noscript inner html
    let noscript_sel = Selector::parse("noscript").unwrap();
    let noscripts: Vec<Html> = fragment
        .select(&noscript_sel)
        .filter_map(|elem| {
            elem.html()
                .strip_prefix("<noscript>")
                .and_then(|str| str.strip_suffix("</noscript>"))
                .map(String::from)
        })
        .map(|frag| Html::parse_fragment(&frag))
        .collect();

    let noscript = noscripts.get(0).expect("<noscript> tag wasn't found");

    // Extract h1 from it
    let h1_sel = Selector::parse("h1").unwrap();
    let h1 = noscript.select(&h1_sel).next().unwrap();

    // Prove that it works
    assert_eq!("<h1>Hello, world!</h1>", h1.html());
}

EDIT: For this solution to work, it is needed to have scripting enabled, otherwise the html method will interpret the tags inside the <noscript> as text. Thus we can choose whether we want to enable it or not.

Disabling the scripting_enabled flag definitely fixes the given example (just tested), causing a tree to be built for the contents of the noscript node instead of treating it as text.

Also, your example doesn't seem to work, as it escapes the html. Instead of trying to serialize the node to html and extract the inner text, you should just probably just extract the text directly.

I still think it would be nice to have a config option for disabling scripting, as I'd imagine there would be some cases where parse_fragment would fail while parsing the elements in the context of the page would succeed.

@LoZack19
Copy link
Contributor

LoZack19 commented Sep 5, 2024

Disabling the scripting_enabled flag definitely fixes the given example (just tested), causing a tree to be built for the contents of the noscript node instead of treating it as text.

Interesting, I must have done something wrong, since it didn't work when I first tried it. Thanks for the contribution.

Also, your example doesn't seem to work, as it escapes the html. Instead of trying to serialize the node to html and extract the inner text, you should just probably just extract the text directly.

My example does work if you enable scripting, but if you say that by doing this, you can directly parse the content inside the <noscript> into the tree, that is pretty pointless. I wasn't able to replicate that behavior just by toggling the flag though.

@nathaniel-daniel
Copy link

My example does work if you enable scripting, but if you say that by doing this, you can directly parse the content inside the into the tree, that is pretty pointless. I wasn't able to replicate that behavior just by toggling the flag though.

With your example, I get:

thread 'main' panicked at src/main.rs:24:46:
called `Option::unwrap()` on a `None` value

Cargo.toml:

[package]
name = "scraper-test"
version = "0.1.0"
edition = "2021"

[dependencies]
scraper = "0.20.0"

Changing .map(|frag| Html::parse_fragment(&frag)) to .map(|frag| Html::parse_fragment(dbg!(&frag))) shows:

[src/main.rs:17:42] &frag = "&lt;h1&gt;Hello, world!&lt;/h1&gt;"

The html function appears to escape the inner text. I think you meant to access the inner text directly:

fn main() {
    let fragment = Html::parse_document("<noscript><h1>Hello, world!</h1>");
    let noscript_sel = Selector::parse("noscript").unwrap();
    let noscripts: Vec<Html> = fragment
        .select(&noscript_sel)
        .filter_map(|elem| {
           elem.text().next()
        })
        .map(|frag| Html::parse_fragment(&frag))
        .collect();
    
    let noscript = noscripts.get(0).expect("<noscript> tag wasn't found");
    
    // Extract h1 from it
    let h1_sel = Selector::parse("h1").unwrap();
    let h1 = noscript.select(&h1_sel).next().unwrap();

    // Prove that it works
    assert_eq!("<h1>Hello, world!</h1>", h1.html());
}

But yeah, I agree that adding a way to change the scripting_enabled flag is the ideal solution.

@LoZack19
Copy link
Contributor

LoZack19 commented Sep 7, 2024

I want to restate that my code works only if scripting is enabled, which is not the case if you used the default version of scraper which you provided in the Cargo.toml.

I agree that you code works even without this hypothesis, since it extracts raw text.

I agree that enabling scripting allows to handle this case better. When I tried, it did not parse the html inside the noscript directly inside the tree, though. But maybe I did something wrong. We could enable this html5ever feature, since I don't see any downside in doing so.

@nathaniel-daniel
Copy link

I want to restate that my code works only if scripting is enabled, which is not the case if you used the default version of scraper which you provided in the Cargo.toml.

That shouldn't be the case. html5ever sets this to true by default, and scraper just uses the default settings. In fact, all I had to do to get this working is disable the flag:

pub fn parse_document(document: &str) -> Self {
    let parser = driver::parse_document(
        Self::new_document(),
        driver::ParseOpts {
            tree_builder: TreeBuilderOpts {
                scripting_enabled: false,
                ..Default::default()
            },
            ..Default::default()
        },
    );
    parser.one(document)
}

/// Parses a string of HTML as a fragment.
pub fn parse_fragment(fragment: &str) -> Self {
    let parser = driver::parse_fragment(
        Self::new_fragment(),
         driver::ParseOpts {
            tree_builder: TreeBuilderOpts {
                scripting_enabled: false,
                ..Default::default()
            },
            ..Default::default()
        },
        QualName::new(None, ns!(html), local_name!("body")),
        Vec::new(),
    );
    parser.one(fragment)
}

Furthermore, I don't really understand how enabling scripting changes the output format of the html method.

I agree that enabling scripting allows to handle this case better. When I tried, it did not parse the html inside the noscript directly inside the tree, though. But maybe I did something wrong. We could enable this html5ever feature, since I don't see any downside in doing so.

Yeah, I guess that might be easier than adding an option to control it. Is there ever a scenario where someone would want to treat the interior of a <noscript> as text?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature-request Category: feature request
Projects
None yet
Development

No branches or pull requests

5 participants