Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait in click() #413

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# rvest (development version)

* New example vignette displays the same starwars data but rendered dynamically using JS, so you need to use `read_html_live()` to get the data.
* The `click()` method for `LiveHTML` objects gains a `new_page` argument to deal with situations where a click loads a new web page (@jonthegeek, #405).

# rvest 1.0.4

Expand Down
42 changes: 37 additions & 5 deletions R/live.R
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ LiveHTML <- R6::R6Class(
self$session$Page$navigate(url, wait_ = FALSE)
self$session$wait_for(p)

private$root_id <- self$session$DOM$getDocument(0)$root$nodeId
private$refresh_root()
},

#' @description Called when `print()`ed
Expand Down Expand Up @@ -129,10 +129,21 @@ LiveHTML <- R6::R6Class(
#' @description Simulate a click on an HTML element.
#' @param css CSS selector or xpath expression.
#' @param n_clicks Number of clicks
click = function(css, n_clicks = 1) {
#' @param new_page Whether to wait for a new page to load, such as after
#' clicking a link.
click = function(css, n_clicks = 1, new_page = FALSE) {
private$check_active()
check_number_whole(n_clicks, min = 1)

# Wait for new page, #405.
if (new_page) {
p <- self$session$Page$loadEventFired(wait_ = FALSE)
on.exit({
self$session$wait_for(p)
private$refresh_root()
}, add = TRUE)
}

# Implementation based on puppeteer as described in
# https://medium.com/@aslushnikov/automating-clicks-in-chromium-a50e7f01d3fb
# With code from https://github.com/puppeteer/puppeteer/blob/b53de4e0942e93c/packages/puppeteer-core/src/cdp/Input.ts#L431-L459
Expand Down Expand Up @@ -170,6 +181,7 @@ LiveHTML <- R6::R6Class(
button = "left"
)
}

invisible(self)
},

Expand Down Expand Up @@ -224,6 +236,7 @@ LiveHTML <- R6::R6Class(
deltaX = left,
deltaY = top
)

invisible(self)
},

Expand Down Expand Up @@ -268,14 +281,14 @@ LiveHTML <- R6::R6Class(
if (new_chromote && !self$session$is_active()) {
suppressMessages({
self$session <- self$session$respawn()
private$root_id <- self$session$DOM$getDocument(0)$root$nodeId
private$refresh_root()
})
}
},

wait_for_selector = function(css, timeout = 5) {
done <- now() + timeout
while(now() < done) {
while (now() < done) {
nodes <- private$find_nodes(css)
if (length(nodes) > 0) {
return(nodes)
Expand All @@ -289,7 +302,22 @@ LiveHTML <- R6::R6Class(
find_nodes = function(css, xpath) {
check_exclusive(css, xpath)
if (!missing(css)) {
unlist(self$session$DOM$querySelectorAll(private$root_id, css)$nodeIds)
node_ids <- try_fetch(
self$session$DOM$querySelectorAll(private$root_id, css)$nodeIds,
error = function(cnd) {
if (grepl("-32000", cnd_message(cnd))) {
cli::cli_abort(
c(
"Can't find root node.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather make refresh_root() public and then suggest that the user call it. But I'd also want to be 100% sure we shouldn't just call refresh_root() on their behalf.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is it might not be ready yet. I feel like there has to be a way to essentially wait for the loadEventFired only if the page is navigating or reloading, but I couldn't get there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should always be waiting on the loadEventFired so we can update the root node.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did put refresh_root() into loadEventFired as a callback at one point, and that's probably worth re-implementing. The problem occurs when the user tries to do something that results in a find_nodes() after a click, but before everything has processed. If the click results in a loadEventFired() and we don't wait before find_nodes(), the root will be wrong (which triggered a thought, see below). But if we do wait when they click somewhere that doesn't navigate/reload, the loadEventFired() never occurs and times out, so we make them wait for nothing (and we throw a confusing warning or error).

But... something knows that the page is in an invalid state, because there's an error when you try to find_nodes() too soon. So in theory at THAT point waiting on loadEventFired() should work. I feel like that's getting into the issues that are discussed in https://rstudio.github.io/chromote/#loading-a-page-reliably though, so that might not solve it.

Perhaps create the promise right away on click (and anything else that might change the page), but only wait for it to resolve when there's an indication that it hasn't resolved? Ie, inside click we'd do something like this:

private$load_promise <- self$session$Page$loadEventFired(wait_ = FALSE)

And then in find_nodes() we could wait for that promise to resolve and try again (inside this try_fetch(), probably). Async programming still always feels weird to me, so I can't decide if this is hacky, correct, or somewhere in-between.

We might also have to capture the timeout message if the promise never resolves, to avoid random confusing warnings showing up for the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation! I think that approach sounds like a way forward, and then I can double check it with Joe/Winston.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's closer now, I think, but it fails ungracefully for "normal" clicks, and I haven't yet found a way to clean this up. To see what I mean, run the "can click a button" test in test-live.R a few times. Or, more specifically:

sess <- read_html_live(html_test_path("click"))
sess$click("button")
Sys.sleep(10)
rm(sess)  

I can see where that happens in {chromote}, and it makes sense that it does so, but the only way I can think of to avoid the error message would be to set the timeout to Inf, and that just means we could stack up promises that never resolve. Or maybe line 149 of live.R should do something fancier when it sets up private$loadEvent_promise, to capture (and silently discard) the promise if it times out.

i = "Did you issue a {.code click()} without waiting for a {.arg new_page}?"
),
class = "rvest_error-missing_node",
parent = cnd
)
}
}
)
unlist(node_ids)
} else {
search <- glue::glue("
(function() {{
Expand Down Expand Up @@ -324,6 +352,10 @@ LiveHTML <- R6::R6Class(
object_id = function(node_id) {
# https://chromedevtools.github.io/devtools-protocol/tot/DOM/#method-resolveNode
self$session$DOM$resolveNode(node_id)$object$objectId
},

refresh_root = function() {
private$root_id <- self$session$DOM$getDocument(0)$root$nodeId
}
)
)
Expand Down
5 changes: 4 additions & 1 deletion man/LiveHTML.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions tests/testthat/_snaps/session.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@
<session> https://hadley.nz/
Status: 200
Type: text/html; charset=utf-8
Size: 821273
Size: 821905
Code
expect_true(is.session(s))
s <- session_follow_link(s, css = "p a")
Message
Navigating to <http://rstudio.com>.
Navigating to <https://posit.co>.
Code
session_history(s)
Output
Expand Down
8 changes: 8 additions & 0 deletions tests/testthat/html/navigate1.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<!DOCTYPE html>
<html>
<head>
<title>Navigate 1</title>
</head>
<body>
<a href="navigate2.html">Navigate to Page 2</a>
</body>
8 changes: 8 additions & 0 deletions tests/testthat/html/navigate2.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<!DOCTYPE html>
<html>
<head>
<title>Navigate 2</title>
</head>
<body>
<p>Success!</p>
</body>
18 changes: 18 additions & 0 deletions tests/testthat/test-live.R
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,14 @@ test_that("can click a button", {
expect_equal(html_text(html_element(sess, "p")), "double clicked")
})

test_that("can find elements after click that navigates", {
skip_if_no_chromote()

sess <- read_html_live(html_test_path("navigate1"))
sess$click("a", new_page = TRUE)
expect_equal(html_text2(html_element(sess, "p")), "Success!")
})

test_that("can scroll in various ways", {
skip_if_no_chromote()

Expand Down Expand Up @@ -88,6 +96,16 @@ test_that("can press special keys",{
expect_equal(html_text(html_element(sess, "#keyInfo")), "]/BracketRight")
})

test_that("gracefully errors on missing root node", {
skip_if_no_chromote()

sess <- read_html_live(html_test_path("navigate1"))
sess$click("a")
expect_error(
html_element(sess, "p"),
class = "rvest_error-missing_node"
)
})

# as_key_desc -------------------------------------------------------------

Expand Down
Loading