Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new Python template - Scrapy & Playwright #252

Open
vdusek opened this issue Dec 7, 2023 · 4 comments
Open

Add new Python template - Scrapy & Playwright #252

vdusek opened this issue Dec 7, 2023 · 4 comments
Assignees
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@vdusek
Copy link
Contributor

vdusek commented Dec 7, 2023

Can you check why our Beautiful Soup template fails on tripadvisor.com? https://console.apify.com/actors/jWYbXHu32SvZf1Cgb/runs/0IYh4rWH9Ig2vIUSM#output

  • Solution: We can provide a new Scrapy Actor template using a headless browser like Playwright.
  • PyPI packages: scrapy and scrapy-playwright.
  • The integration of Playwright into the Scrapy project is pretty simple, scrapy-playwright provides a Scrapy component ScrapyPlaywrightDownloadHandler, which needs to be added to the project.
  • Check the Web scraping with Scrapy blog post for more information and inspiration.
@vdusek vdusek added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. labels Dec 7, 2023
@vdusek vdusek self-assigned this Dec 7, 2023
@honzajavorek
Copy link

I see the main challenge in setting PLAYWRIGHT_LAUNCH_OPTIONS correctly to respect APIFY_PROXY_SETTINGS (docs). Or maybe passing it like this, not sure.

@honzajavorek
Copy link

Hmm I think since the playwright integration doesn't support proxy per request, only proxy per browser context, the correct implementation would be to probably rotate browser contexts with proxies for playwright-enabled requests as part of ApifyHttpProxyMiddleware 🤔 Hard to implement on my own as part of the spider code.

@honzajavorek
Copy link

honzajavorek commented Apr 15, 2024

Note: If this template ever exists, it should contain playwright install --with-deps somewhere in the Dockerfile. This has just bitten me.

@honzajavorek
Copy link

honzajavorek commented Apr 17, 2024

So obviously I have no idea what I'm doing, but today I invented this and it seems like it could be working. It's hard to verify, but it looks like I might be successfully sending Playwright requests over Apify proxy. This is how I override Apify settings:

...
settings = apply_apify_settings(settings=settings, proxy_config=proxy_config)

# use custom proxy middleware
priority = settings["DOWNLOADER_MIDDLEWARES"].pop(
    "apify.scrapy.middlewares.ApifyHttpProxyMiddleware"
)
settings["DOWNLOADER_MIDDLEWARES"][
    "jg.plucker.scrapers.PlaywrightApifyHttpProxyMiddleware"
] = priority
...

And this is the actual implementation of my custom middleware:

class PlaywrightApifyHttpProxyMiddleware(ApifyHttpProxyMiddleware):
    @classmethod
    def from_crawler(cls, crawler: Crawler) -> Self:
        Actor.log.info("Using customized ApifyHttpProxyMiddleware.")
        return cls(super().from_crawler(crawler)._proxy_settings)

    async def process_request(self, request: Request, spider: Spider):
        if request.meta.get("playwright"):
            Actor.log.debug(
                f"ApifyHttpProxyMiddleware.process_request: playwright=True, request={request}, spider={spider}"
            )
            url = await self._get_new_proxy_url()

            if not (url.username and url.password):
                raise ValueError(
                    "Username and password must be provided in the proxy URL."
                )

            proxy = url.geturl()
            proxy_hash = hashlib.sha1(proxy.encode()).hexdigest()[0:8]
            context_name = f"proxy_{proxy_hash}"
            Actor.log.info(f"Using Playwright context {context_name}")
            request.meta.update(
                {
                    "playwright_context": f"proxy_{context_name}",
                    "playwright_context_kwargs": {
                        "proxy": {
                            "server": proxy,
                            "username": url.username,
                            "password": url.password,
                        },
                    },
                }
            )
            Actor.log.debug(
                f"ApifyHttpProxyMiddleware.process_request: updated request.meta={request.meta}"
            )
        else:
            await super().process_request(request, spider)

I'll yet see if it performs reasonably in the following days. Also, FWIW, adding playwright install --with-deps to my Dockerfile has caused my builds quite a while to finish. If you know about a more efficient approach, that would be awesome:

RUN echo "Python version:" \
 && python --version \
 && echo "Pip version:" \
 && pip --version \
 && echo "Installing Poetry:" \
 && pip install --no-cache-dir poetry~=1.7.1 \
 && echo "Installing dependencies:" \
 && poetry config cache-dir /tmp/.poetry-cache \
 && poetry config virtualenvs.in-project true \
 && poetry install --only=main --no-interaction --no-ansi \
 && rm -rf /tmp/.poetry-cache \
 && echo "All installed Python packages:" \
 && pip freeze \
 && echo "Installing Playwright dependencies:" \
 && poetry run playwright install firefox --with-deps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

2 participants