Skip to content

Releases: apify/crawlee-python

0.1.0

09 Jul 06:49
Compare
Choose a tag to compare

Features

Why is Crawlee the preferred choice for web scraping and crawling?

Why use Crawlee instead of just a random HTTP library with an HTML parser?

  • Unified interface for HTTP & headless browser crawling.
  • Automatic parallel crawling based on available system resources.
  • Written in Python with type hints - enhances DX (IDE autocompletion) and reduces bugs (static type checking).
  • Automatic retries on errors or when you’re getting blocked.
  • Integrated proxy rotation and session management.
  • Configurable request routing - direct URLs to the appropriate handlers.
  • Persistent queue for URLs to crawl.
  • Pluggable storage of both tabular data and files.
  • Robust error handling.

Why to use Crawlee rather than Scrapy?

  • Crawlee has out-of-the-box support for headless browser crawling (Playwright).
  • Crawlee has a minimalistic & elegant interface - Set up your scraper with fewer than 10 lines of code.
  • Complete type hint coverage.
  • Based on standard Asyncio.

0.0.7

27 Jun 15:00
fdea3d1
Compare
Choose a tag to compare

Fixes

  • selector handling for RETRY_CSS_SELECTORS in _handle_blocked_request in BeautifulSoupCrawler
  • selector handling in enqueue_links in BeautifulSoupCrawler
  • improve AutoscaledPool state management

0.0.6

25 Jun 13:26
a67b72f
Compare
Choose a tag to compare

Adds

  • BREAKING: BasicCrawler.export_data helper method which replaces BasicCrawler.export_to
  • Configuration.get_global_configuration method
  • Automatic logging setup
  • Context helper for logging (context.log)

Fixes

  • Handling of relative URLs in add_requests
  • Graceful exit in BasicCrawler.run

0.0.5

21 Jun 13:35
96ceef6
Compare
Choose a tag to compare

Adds

  • Add explicit error messages for missing package extras during import
  • Better browser abstraction:
    • BrowserController - Wraps a single browser instance and maintains its state.
    • BrowserPlugin - Manages the browser automation framework, and basically acts as a factory for controllers.
  • Browser rotation with a maximum number of pages opened per browser.
  • Add emit persist state event to event manager
  • Add batched request addition in RequestQueue
  • Add start requests option to BasicCrawler
  • Add storage-related helpers get_data, push_data and export_to to BasicCrawler and BasicContext
  • Add PlaywrightCrawler's enqueue links helper

Fixes

  • Fix type error in persist state of statistics

0.0.4

30 May 09:18
598a266
Compare
Choose a tag to compare

Another internal release, adding statistics capturing, proxy configuration,
and the initial version of browser management and PlaywrightCrawler.

Adds

  • Statistics
  • ProxyConfiguration
  • BrowserPool
  • PlaywrightCrawler

0.0.3

15 May 09:30
851042f
Compare
Choose a tag to compare

Another internal release, adding mainly session management and BeautifulSoupCrawler.

Adds

  • HttpxClient
  • SessionPool
  • BeautifulSoupCrawler
  • BaseStorageClient
  • Storages and MemoryStorageClient were refactored

Was added in 0.0.2

  • EventManager & LocalEventManager
  • Snapshotter
  • AutoscaledPool
  • MemoryStorageClient
  • Storages
  • BasicCrawler & HttpCrawler