Chromium Headless is Detected by Cloudflare (Error 1010) even with Custom User Agent #997

Morpheus0x · 2022-07-12T09:01:12Z

Describe the bug

Unable to make snapshot of a website using Cloudflare.
Error 1010: The owner of this website has banned your access based on your browser's
signature.
This occurs even with a custom user agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36"
Also, I have set the following chromium option: --disable-blink-features=AutomationControlled but the browser is still detected by Cloudflare. I even set CHROME_HEADLESS to False, but it still doesn't work.

Searching for a way to make Chrome Headless not be detected, I found several promising sites:
https://blog.m157q.tw/posts/2020/09/11/bypass-cloudflare-detection-while-using-selenium-with-chromedriver/
https://intoli.com/blog/making-chrome-headless-undetectable/
https://github.com/ultrafunkamsterdam/undetected-chromedriver
https://stackoverflow.com/questions/65760004/making-chrome-headless-undetectable-in-python
https://stackoverflow.com/questions/65994908/selenium-cant-open-a-second-page/65998533#65998533

Most of these require webdriver, selenium or something similar. As far as I can tell, ArchiveBox only uses the chromium executable without any scraper wrapper. For that, the best option I found would be to inject JavaScript into every scraped website, like explained here. This uses mitmproxy which isn't an ideal option.

I would propose to switch to Selenium in order to take advantage of selenium-stealth.

Steps to reproduce

Archive any Quora or Discord link.
Look at the resulting snapshot, there should only be the above Cloudflare Error 1010 message.

Screenshots or log output

ArchiveBox version

ArchiveBox v0.6.3
Cpython Linux Linux-5.10.0-11-amd64-x86_64-with-glibc2.31 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
√ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox
√ PYTHON_BINARY v3.10.4 valid /usr/local/bin/python3.10
√ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py
√ CURL_BINARY v7.74.0 valid /usr/bin/curl
√ WGET_BINARY v1.21 valid /usr/bin/wget
√ NODE_BINARY v17.9.0 valid /usr/bin/node
√ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file
√ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor
√ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js
√ GIT_BINARY v2.30.2 valid /usr/bin/git
√ YOUTUBEDL_BINARY v2022.04.08 valid /usr/local/bin/yt-dlp
√ CHROME_BINARY v101.0.4951.41 valid /usr/bin/chromium
√ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg

[i] Source-code locations:
√ PACKAGE_DIR 24 files valid /app/archivebox
√ TEMPLATES_DIR 4 files valid /app/archivebox/templates

CUSTOM_TEMPLATES_DIR - disabled

[i] Secrets locations:

CHROME_USER_DATA_DIR - disabled
COOKIES_FILE - disabled

[i] Data locations:
√ OUTPUT_DIR 5 files valid /data
√ SOURCES_DIR 20 files valid ./sources
√ LOGS_DIR 1 files valid ./logs
√ ARCHIVE_DIR 49 files valid ./archive
√ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf
√ SQL_INDEX 660.0 KB valid ./index.sqlite3

The text was updated successfully, but these errors were encountered:

pirate · 2022-07-12T19:33:40Z

Definitely not going to switch to Selenium, we're partway done with a refactor to Pypeteer. This is just generally a hard problem and is forever going to be cat and mouse with providers like Cloudflare trying to block bots.

derRichter · 2023-11-08T17:04:44Z

same problem here.
if headless-mode is off and no user agent is set is working.
(i got a normal user agent like
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36

with headless-off and the same real user agent like above is set in chrome driver-options, Cloudflare blocks!

Whats happened, what's is different to the user agent if i set it the same manual or i leave the option and i have the same user agent?
What is the different?

greats

pirate · 2024-01-19T04:53:38Z

I figured out one potential reason why Cloudflare blocks ppl beyond USER_AGENT detection! They do a sneaky thing where they set a cookie on every request, and if later requests dont have that cookie set they assume it's a headless browser without persistent state, and block it.

Going to be very tricky for us to solve this since we intentionally use ephemeral contexts for every archive method. But I'm working on integrating browsertrix which may help #1327

In the meantime if you set up a chrome profile and browse with that profile normally for 20min maybe this will go away, as you'll collect enough of these tracer cookies to satisfy their human-detection algorithm.
https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile

pirate · 2024-01-20T03:15:45Z

worth looking into: https://github.com/FlareSolverr/FlareSolverr

pirate · 2024-03-29T06:54:38Z

I've verified FlareSolverr works great, got it working doing puppeteer-powered archiving. It's for a paying client but I'm just noting it here because I plan to come back later and add it to ArchiveBox.

Also found these alternatives to keep an eye on in the future:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chromium Headless is Detected by Cloudflare (Error 1010) even with Custom User Agent #997

Chromium Headless is Detected by Cloudflare (Error 1010) even with Custom User Agent #997

Morpheus0x commented Jul 12, 2022

pirate commented Jul 12, 2022

derRichter commented Nov 8, 2023

pirate commented Jan 19, 2024

pirate commented Jan 20, 2024

pirate commented Mar 29, 2024 •

edited

Chromium Headless is Detected by Cloudflare (Error 1010) even with Custom User Agent #997

Chromium Headless is Detected by Cloudflare (Error 1010) even with Custom User Agent #997

Comments

Morpheus0x commented Jul 12, 2022

Describe the bug

Steps to reproduce

Screenshots or log output

ArchiveBox version

pirate commented Jul 12, 2022

derRichter commented Nov 8, 2023

pirate commented Jan 19, 2024

pirate commented Jan 20, 2024

pirate commented Mar 29, 2024 • edited

pirate commented Mar 29, 2024 •

edited