Add session cookies to crawling context #710

Mantisus · 2024-11-18T15:38:42Z

Add to the context, the cookie of the session from which the request was made, both for HTTP crawlers and Playwright

janbuchar · 2024-11-19T11:09:59Z

Could you elaborate please? For plain HTTP crawlers, you can use context.http_response.headers. Is this about accessing the request in Playwright?

Mantisus · 2024-11-19T12:00:03Z

I missed that, for HTTP crawlers we have context.http_response.headers for Playwright this is not as relevant.

janbuchar · 2024-11-19T12:58:49Z

It is true that it would be hard to reach the headers via PlaywrightCrawlingContext though.

Mantisus · 2024-11-19T13:12:34Z

It can be useful for Playwright to have access to the cookie of the session from which the request was made.

But not directly to the headers

janbuchar · 2024-11-19T13:46:18Z

Feel free to rephrase the issue title and description then 🙂

oldsiks · 2024-12-25T13:52:07Z

When using PlaywrightCrawler in Crawlee for web scraping, how can I add cookies? Could you provide an example?

Mantisus · 2024-12-25T14:29:18Z

Hey @oldsiks

Thank you for your interest in crawlee.

Here is an example of using a cookie at the request header level

import asyncio

from crawlee import Request
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
    crawler = PlaywrightCrawler()

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        print(await context.page.content())

    await crawler.run([
        Request.from_url(url='https://httpbin.org/get', headers={'cookie': 'my_cookies'})
        ])

asyncio.run(main())

Also after release 0.5 it will be possible to set cookie in PlaywrightBrowserPlugin using browser_new_context_options

You can try this using the pre-relise version - 0.5.0b30

Example for 0.5:

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin


async def main() -> None:
    user_plugin = PlaywrightBrowserPlugin(
        browser_new_context_options={"extra_http_headers": {'cookie': 'my_cookies'}}
        )

    browser_pool = BrowserPool(plugins=[user_plugin])

    crawler = PlaywrightCrawler(browser_pool=browser_pool)

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        print(await context.page.content())

    await crawler.run(['https://httpbin.org/get'])

asyncio.run(main())

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Nov 18, 2024

vdusek added the enhancement New feature or request. label Nov 18, 2024

vdusek changed the title ~~Add response headers in context~~ Add response headers to crawling context Nov 19, 2024

Mantisus changed the title ~~Add response headers to crawling context~~ Add session cookies to crawling context Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add session cookies to crawling context #710

Add session cookies to crawling context #710

Mantisus commented Nov 18, 2024 •

edited

Loading

janbuchar commented Nov 19, 2024

Mantisus commented Nov 19, 2024

janbuchar commented Nov 19, 2024

Mantisus commented Nov 19, 2024

janbuchar commented Nov 19, 2024

oldsiks commented Dec 25, 2024

Mantisus commented Dec 25, 2024 •

edited

Loading

Add session cookies to crawling context #710

Add session cookies to crawling context #710

Comments

Mantisus commented Nov 18, 2024 • edited Loading

janbuchar commented Nov 19, 2024

Mantisus commented Nov 19, 2024

janbuchar commented Nov 19, 2024

Mantisus commented Nov 19, 2024

janbuchar commented Nov 19, 2024

oldsiks commented Dec 25, 2024

Mantisus commented Dec 25, 2024 • edited Loading

Mantisus commented Nov 18, 2024 •

edited

Loading

Mantisus commented Dec 25, 2024 •

edited

Loading