Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add session cookies to crawling context #710

Open
Mantisus opened this issue Nov 18, 2024 · 7 comments
Open

Add session cookies to crawling context #710

Mantisus opened this issue Nov 18, 2024 · 7 comments
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@Mantisus
Copy link
Collaborator

Mantisus commented Nov 18, 2024

Add to the context, the cookie of the session from which the request was made, both for HTTP crawlers and Playwright

@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Nov 18, 2024
@vdusek vdusek added the enhancement New feature or request. label Nov 18, 2024
@vdusek vdusek changed the title Add response headers in context Add response headers to crawling context Nov 19, 2024
@janbuchar
Copy link
Collaborator

Could you elaborate please? For plain HTTP crawlers, you can use context.http_response.headers. Is this about accessing the request in Playwright?

@Mantisus
Copy link
Collaborator Author

I missed that, for HTTP crawlers we have context.http_response.headers for Playwright this is not as relevant.

@janbuchar
Copy link
Collaborator

It is true that it would be hard to reach the headers via PlaywrightCrawlingContext though.

@Mantisus
Copy link
Collaborator Author

It can be useful for Playwright to have access to the cookie of the session from which the request was made.

But not directly to the headers

@janbuchar
Copy link
Collaborator

Feel free to rephrase the issue title and description then 🙂

@Mantisus Mantisus changed the title Add response headers to crawling context Add session cookies to crawling context Nov 19, 2024
@oldsiks
Copy link

oldsiks commented Dec 25, 2024

When using PlaywrightCrawler in Crawlee for web scraping, how can I add cookies? Could you provide an example?

@Mantisus
Copy link
Collaborator Author

Mantisus commented Dec 25, 2024

Hey @oldsiks

Thank you for your interest in crawlee.

Here is an example of using a cookie at the request header level

import asyncio

from crawlee import Request
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
    crawler = PlaywrightCrawler()

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        print(await context.page.content())

    await crawler.run([
        Request.from_url(url='https://httpbin.org/get', headers={'cookie': 'my_cookies'})
        ])

asyncio.run(main())

Also after release 0.5 it will be possible to set cookie in PlaywrightBrowserPlugin using browser_new_context_options

You can try this using the pre-relise version - 0.5.0b30

Example for 0.5:

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin


async def main() -> None:
    user_plugin = PlaywrightBrowserPlugin(
        browser_new_context_options={"extra_http_headers": {'cookie': 'my_cookies'}}
        )

    browser_pool = BrowserPool(plugins=[user_plugin])

    crawler = PlaywrightCrawler(browser_pool=browser_pool)

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        print(await context.page.content())

    await crawler.run(['https://httpbin.org/get'])

asyncio.run(main())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

4 participants