Add new Python template - Scrapy & Playwright #252

vdusek · 2023-12-07T16:25:48Z

Some "JavaScript-heavy websites" (e.g. https://tripadvisor.com) cannot be scraped by using just Scrapy.

Can you check why our Beautiful Soup template fails on tripadvisor.com? https://console.apify.com/actors/jWYbXHu32SvZf1Cgb/runs/0IYh4rWH9Ig2vIUSM#output

Solution: We can provide a new Scrapy Actor template using a headless browser like Playwright.
PyPI packages: scrapy and scrapy-playwright.
The integration of Playwright into the Scrapy project is pretty simple, scrapy-playwright provides a Scrapy component ScrapyPlaywrightDownloadHandler, which needs to be added to the project.
Check the Web scraping with Scrapy blog post for more information and inspiration.

The text was updated successfully, but these errors were encountered:

honzajavorek · 2024-04-15T08:51:05Z

I see the main challenge in setting PLAYWRIGHT_LAUNCH_OPTIONS correctly to respect APIFY_PROXY_SETTINGS (docs). Or maybe passing it like this, not sure.

honzajavorek · 2024-04-15T08:57:40Z

Hmm I think since the playwright integration doesn't support proxy per request, only proxy per browser context, the correct implementation would be to probably rotate browser contexts with proxies for playwright-enabled requests as part of ApifyHttpProxyMiddleware 🤔 Hard to implement on my own as part of the spider code.

honzajavorek · 2024-04-15T10:18:00Z

Note: If this template ever exists, it should contain playwright install --with-deps somewhere in the Dockerfile. This has just bitten me.

honzajavorek · 2024-04-17T10:18:05Z

So obviously I have no idea what I'm doing, but today I invented this and it seems like it could be working. It's hard to verify, but it looks like I might be successfully sending Playwright requests over Apify proxy. This is how I override Apify settings:

...
settings = apply_apify_settings(settings=settings, proxy_config=proxy_config)

# use custom proxy middleware
priority = settings["DOWNLOADER_MIDDLEWARES"].pop(
    "apify.scrapy.middlewares.ApifyHttpProxyMiddleware"
)
settings["DOWNLOADER_MIDDLEWARES"][
    "jg.plucker.scrapers.PlaywrightApifyHttpProxyMiddleware"
] = priority
...

And this is the actual implementation of my custom middleware:

class PlaywrightApifyHttpProxyMiddleware(ApifyHttpProxyMiddleware):
    @classmethod
    def from_crawler(cls, crawler: Crawler) -> Self:
        Actor.log.info("Using customized ApifyHttpProxyMiddleware.")
        return cls(super().from_crawler(crawler)._proxy_settings)

    async def process_request(self, request: Request, spider: Spider):
        if request.meta.get("playwright"):
            Actor.log.debug(
                f"ApifyHttpProxyMiddleware.process_request: playwright=True, request={request}, spider={spider}"
            )
            url = await self._get_new_proxy_url()

            if not (url.username and url.password):
                raise ValueError(
                    "Username and password must be provided in the proxy URL."
                )

            proxy = url.geturl()
            proxy_hash = hashlib.sha1(proxy.encode()).hexdigest()[0:8]
            context_name = f"proxy_{proxy_hash}"
            Actor.log.info(f"Using Playwright context {context_name}")
            request.meta.update(
                {
                    "playwright_context": f"proxy_{context_name}",
                    "playwright_context_kwargs": {
                        "proxy": {
                            "server": proxy,
                            "username": url.username,
                            "password": url.password,
                        },
                    },
                }
            )
            Actor.log.debug(
                f"ApifyHttpProxyMiddleware.process_request: updated request.meta={request.meta}"
            )
        else:
            await super().process_request(request, spider)

I'll yet see if it performs reasonably in the following days. Also, FWIW, adding playwright install --with-deps to my Dockerfile has caused my builds quite a while to finish. If you know about a more efficient approach, that would be awesome:

RUN echo "Python version:" \
 && python --version \
 && echo "Pip version:" \
 && pip --version \
 && echo "Installing Poetry:" \
 && pip install --no-cache-dir poetry~=1.7.1 \
 && echo "Installing dependencies:" \
 && poetry config cache-dir /tmp/.poetry-cache \
 && poetry config virtualenvs.in-project true \
 && poetry install --only=main --no-interaction --no-ansi \
 && rm -rf /tmp/.poetry-cache \
 && echo "All installed Python packages:" \
 && pip freeze \
 && echo "Installing Playwright dependencies:" \
 && poetry run playwright install firefox --with-deps

vdusek added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. labels Dec 7, 2023

vdusek self-assigned this Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new Python template - Scrapy & Playwright #252

Add new Python template - Scrapy & Playwright #252

vdusek commented Dec 7, 2023

honzajavorek commented Apr 15, 2024

honzajavorek commented Apr 15, 2024

honzajavorek commented Apr 15, 2024 •

edited

Loading

honzajavorek commented Apr 17, 2024 •

edited

Loading

Add new Python template - Scrapy & Playwright #252

Add new Python template - Scrapy & Playwright #252

Comments

vdusek commented Dec 7, 2023

honzajavorek commented Apr 15, 2024

honzajavorek commented Apr 15, 2024

honzajavorek commented Apr 15, 2024 • edited Loading

honzajavorek commented Apr 17, 2024 • edited Loading

honzajavorek commented Apr 15, 2024 •

edited

Loading

honzajavorek commented Apr 17, 2024 •

edited

Loading