Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chromium Headless is Detected by Cloudflare (Error 1010) even with Custom User Agent #997

Open
Morpheus0x opened this issue Jul 12, 2022 · 5 comments
Labels
expected: maybe someday help wanted size: medium status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet touches: configuration touches: dependencies/packaging Issues or changes that add/remove/affect dependencies type: enhancement why: functionality Intended to improve ArchiveBox functionality or features

Comments

@Morpheus0x
Copy link

Describe the bug

Unable to make snapshot of a website using Cloudflare.
Error 1010: The owner of this website has banned your access based on your browser's
signature.
This occurs even with a custom user agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36"
Also, I have set the following chromium option: --disable-blink-features=AutomationControlled but the browser is still detected by Cloudflare. I even set CHROME_HEADLESS to False, but it still doesn't work.

Searching for a way to make Chrome Headless not be detected, I found several promising sites:
https://blog.m157q.tw/posts/2020/09/11/bypass-cloudflare-detection-while-using-selenium-with-chromedriver/
https://intoli.com/blog/making-chrome-headless-undetectable/
https://github.com/ultrafunkamsterdam/undetected-chromedriver
https://stackoverflow.com/questions/65760004/making-chrome-headless-undetectable-in-python
https://stackoverflow.com/questions/65994908/selenium-cant-open-a-second-page/65998533#65998533

Most of these require webdriver, selenium or something similar. As far as I can tell, ArchiveBox only uses the chromium executable without any scraper wrapper. For that, the best option I found would be to inject JavaScript into every scraped website, like explained here. This uses mitmproxy which isn't an ideal option.

I would propose to switch to Selenium in order to take advantage of selenium-stealth.

Steps to reproduce

Archive any Quora or Discord link.
Look at the resulting snapshot, there should only be the above Cloudflare Error 1010 message.

Screenshots or log output

screenshot

ArchiveBox version

ArchiveBox v0.6.3
Cpython Linux Linux-5.10.0-11-amd64-x86_64-with-glibc2.31 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
√ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox
√ PYTHON_BINARY v3.10.4 valid /usr/local/bin/python3.10
√ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py
√ CURL_BINARY v7.74.0 valid /usr/bin/curl
√ WGET_BINARY v1.21 valid /usr/bin/wget
√ NODE_BINARY v17.9.0 valid /usr/bin/node
√ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file
√ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor
√ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js
√ GIT_BINARY v2.30.2 valid /usr/bin/git
√ YOUTUBEDL_BINARY v2022.04.08 valid /usr/local/bin/yt-dlp
√ CHROME_BINARY v101.0.4951.41 valid /usr/bin/chromium
√ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg

[i] Source-code locations:
√ PACKAGE_DIR 24 files valid /app/archivebox
√ TEMPLATES_DIR 4 files valid /app/archivebox/templates

  • CUSTOM_TEMPLATES_DIR - disabled

[i] Secrets locations:

  • CHROME_USER_DATA_DIR - disabled
  • COOKIES_FILE - disabled

[i] Data locations:
√ OUTPUT_DIR 5 files valid /data
√ SOURCES_DIR 20 files valid ./sources
√ LOGS_DIR 1 files valid ./logs
√ ARCHIVE_DIR 49 files valid ./archive
√ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf
√ SQL_INDEX 660.0 KB valid ./index.sqlite3

@pirate
Copy link
Member

pirate commented Jul 12, 2022

Definitely not going to switch to Selenium, we're partway done with a refactor to Pypeteer. This is just generally a hard problem and is forever going to be cat and mouse with providers like Cloudflare trying to block bots.

@derRichter
Copy link

same problem here.
if headless-mode is off and no user agent is set is working.
(i got a normal user agent like
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36

with headless-off and the same real user agent like above is set in chrome driver-options, Cloudflare blocks!

Whats happened, what's is different to the user agent if i set it the same manual or i leave the option and i have the same user agent?
What is the different?

greats

@pirate pirate added status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet size: medium touches: configuration why: functionality Intended to improve ArchiveBox functionality or features help wanted touches: dependencies/packaging Issues or changes that add/remove/affect dependencies type: enhancement expected: maybe someday labels Nov 9, 2023
@pirate
Copy link
Member

pirate commented Jan 19, 2024

I figured out one potential reason why Cloudflare blocks ppl beyond USER_AGENT detection! They do a sneaky thing where they set a cookie on every request, and if later requests dont have that cookie set they assume it's a headless browser without persistent state, and block it.

Going to be very tricky for us to solve this since we intentionally use ephemeral contexts for every archive method. But I'm working on integrating browsertrix which may help #1327

In the meantime if you set up a chrome profile and browse with that profile normally for 20min maybe this will go away, as you'll collect enough of these tracer cookies to satisfy their human-detection algorithm.
https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile

@pirate
Copy link
Member

pirate commented Jan 20, 2024

worth looking into: https://github.com/FlareSolverr/FlareSolverr

@pirate
Copy link
Member

pirate commented Mar 29, 2024

I've verified FlareSolverr works great, got it working doing puppeteer-powered archiving. It's for a paying client but I'm just noting it here because I plan to come back later and add it to ArchiveBox.

Also found these alternatives to keep an eye on in the future:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expected: maybe someday help wanted size: medium status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet touches: configuration touches: dependencies/packaging Issues or changes that add/remove/affect dependencies type: enhancement why: functionality Intended to improve ArchiveBox functionality or features
Projects
None yet
Development

No branches or pull requests

3 participants