Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to specify CHROME arg "--user-agent" in extra_args in order to mimic a non-headless browser? #319

Open
dendrit1 opened this issue Sep 28, 2023 · 3 comments

Comments

@dendrit1
Copy link

dendrit1 commented Sep 28, 2023

Hello,

I use R version 4.3.1 and R-Studio 2023.09.0 Build 463 under Windows 10 Enterprise.

The following R-Code downloads a website as PDF:

library(pagedown)
pathToChrome <- find_chrome()
downloadPath <- paste0(choose.dir(default = "", caption = "Select FOLDER (where to save results)"), "\\" ) 

chrome_print( 
  "https://www.whatsmybrowser.org/", 
  output = paste0(downloadPath, "test6.pdf"),
  wait = 4,
  browser = pathToChrome,
  format = "pdf", # "pdf", "png", "jpeg"; see https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-printToPDF
  options = list(),
  scale = 1,
  timeout = 30,
  extra_args = c("--disable-gpu")
)

When executing this code with test- webpage https://www.whatsmybrowser.org/ it returns correctly as browser "You are using Headless Chrome 117."

PROBLEM:
Some webpages block out headless browsers, which then gives the error code "HTTP status code: 404".

I know, that "--user-agent" has to be added into extra_args somehow, in order to mimic a normal, i.e. non-headless browser, see also here: https://useragentstring.com/

Could anyone help me how to code this?
The goal is, that this test-website does not return anymore "You are using Headless Chrome 117.", but e.g. "You are using Chome 117" ?

Thanks a lot!

@dendrit1 dendrit1 changed the title How to specify CHROME arg "--user-agent" in extra_args ? How to specify CHROME arg "--user-agent" in extra_args in order to mimic a non-headless browser? Sep 28, 2023
@cderv
Copy link
Collaborator

cderv commented Sep 29, 2023

Is this a pagedown question on how to pass argument in extra_args ? Or did you try something and it is not working ?

If this is a question on which user agent to set to bypass the restriction on the website you want to reach, I'll bet you'll find answer on the web.

Also if a website has some policy and put in place restriction, you should carefully read the rules of the website to still be authorize to do what you want to do. Usually security measure like prevent headless access are there for a reason.

Anyhow, happy to help with any pagedown bug, for the rest we are not expert in headless chrome usage. pagedown::chrome_print() is made in the first place to print to PDF document produced by pagedown and is not a special chrome headless tool. Know that you have R packages dedicated to this like chromote and webshot2 which use it.

Hope it helps

@dendrit1
Copy link
Author

dendrit1 commented Sep 29, 2023

@cderv
Yes, I tried something, and it is not working, i.e., I added:
extra_args = c('--disable-gpu', '--user-agent ="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"')

The browser is still recognized as headless on this test-page. Do I have to escape something, or is perhaps the syntax wrong?

Anyway, I solved the problem meanwhile using chromote.
But still, I would like to know if and how this could be also done using pagedown::chrome_print() ?

Thanks a lot !

@cderv
Copy link
Collaborator

cderv commented Sep 29, 2023

You could try without the space after user-agent= maybe ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants