-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new Python template - Scrapy & Playwright #252
Comments
Hmm I think since the playwright integration doesn't support proxy per request, only proxy per browser context, the correct implementation would be to probably rotate browser contexts with proxies for playwright-enabled requests as part of |
Note: If this template ever exists, it should contain |
So obviously I have no idea what I'm doing, but today I invented this and it seems like it could be working. It's hard to verify, but it looks like I might be successfully sending Playwright requests over Apify proxy. This is how I override Apify settings: ...
settings = apply_apify_settings(settings=settings, proxy_config=proxy_config)
# use custom proxy middleware
priority = settings["DOWNLOADER_MIDDLEWARES"].pop(
"apify.scrapy.middlewares.ApifyHttpProxyMiddleware"
)
settings["DOWNLOADER_MIDDLEWARES"][
"jg.plucker.scrapers.PlaywrightApifyHttpProxyMiddleware"
] = priority
... And this is the actual implementation of my custom middleware: class PlaywrightApifyHttpProxyMiddleware(ApifyHttpProxyMiddleware):
@classmethod
def from_crawler(cls, crawler: Crawler) -> Self:
Actor.log.info("Using customized ApifyHttpProxyMiddleware.")
return cls(super().from_crawler(crawler)._proxy_settings)
async def process_request(self, request: Request, spider: Spider):
if request.meta.get("playwright"):
Actor.log.debug(
f"ApifyHttpProxyMiddleware.process_request: playwright=True, request={request}, spider={spider}"
)
url = await self._get_new_proxy_url()
if not (url.username and url.password):
raise ValueError(
"Username and password must be provided in the proxy URL."
)
proxy = url.geturl()
proxy_hash = hashlib.sha1(proxy.encode()).hexdigest()[0:8]
context_name = f"proxy_{proxy_hash}"
Actor.log.info(f"Using Playwright context {context_name}")
request.meta.update(
{
"playwright_context": f"proxy_{context_name}",
"playwright_context_kwargs": {
"proxy": {
"server": proxy,
"username": url.username,
"password": url.password,
},
},
}
)
Actor.log.debug(
f"ApifyHttpProxyMiddleware.process_request: updated request.meta={request.meta}"
)
else:
await super().process_request(request, spider) I'll yet see if it performs reasonably in the following days. Also, FWIW, adding RUN echo "Python version:" \
&& python --version \
&& echo "Pip version:" \
&& pip --version \
&& echo "Installing Poetry:" \
&& pip install --no-cache-dir poetry~=1.7.1 \
&& echo "Installing dependencies:" \
&& poetry config cache-dir /tmp/.poetry-cache \
&& poetry config virtualenvs.in-project true \
&& poetry install --only=main --no-interaction --no-ansi \
&& rm -rf /tmp/.poetry-cache \
&& echo "All installed Python packages:" \
&& pip freeze \
&& echo "Installing Playwright dependencies:" \
&& poetry run playwright install firefox --with-deps |
scrapy-playwright
provides a Scrapy componentScrapyPlaywrightDownloadHandler
, which needs to be added to the project.The text was updated successfully, but these errors were encountered: