Releases · apify/crawlee-python

09 Jul 06:49

vdusek

v0.1.0

b13b89a

0.1.0

Crawlee is a web scraping and browser automation library.
Launching Crawlee for Python blog post

Features

Why is Crawlee the preferred choice for web scraping and crawling?

Why use Crawlee instead of just a random HTTP library with an HTML parser?

Unified interface for HTTP & headless browser crawling.
Automatic parallel crawling based on available system resources.
Written in Python with type hints - enhances DX (IDE autocompletion) and reduces bugs (static type checking).
Automatic retries on errors or when you’re getting blocked.
Integrated proxy rotation and session management.
Configurable request routing - direct URLs to the appropriate handlers.
Persistent queue for URLs to crawl.
Pluggable storage of both tabular data and files.
Robust error handling.

Why to use Crawlee rather than Scrapy?

Crawlee has out-of-the-box support for headless browser crawling (Playwright).
Crawlee has a minimalistic & elegant interface - Set up your scraper with fewer than 10 lines of code.
Complete type hint coverage.
Based on standard Asyncio.

Assets 4

27 Jun 15:00

vdusek

v0.0.7

fdea3d1

0.0.7

Fixes

selector handling for RETRY_CSS_SELECTORS in _handle_blocked_request in BeautifulSoupCrawler
selector handling in enqueue_links in BeautifulSoupCrawler
improve AutoscaledPool state management

Assets 4

25 Jun 13:26

vdusek

v0.0.6

a67b72f

0.0.6

Adds

BREAKING: BasicCrawler.export_data helper method which replaces BasicCrawler.export_to
Configuration.get_global_configuration method
Automatic logging setup
Context helper for logging (context.log)

Fixes

Handling of relative URLs in add_requests
Graceful exit in BasicCrawler.run

Assets 4

21 Jun 13:35

vdusek

v0.0.5

96ceef6

0.0.5

Adds

Add explicit error messages for missing package extras during import
Better browser abstraction:
- BrowserController - Wraps a single browser instance and maintains its state.
- BrowserPlugin - Manages the browser automation framework, and basically acts as a factory for controllers.
Browser rotation with a maximum number of pages opened per browser.
Add emit persist state event to event manager
Add batched request addition in RequestQueue
Add start requests option to BasicCrawler
Add storage-related helpers get_data, push_data and export_to to BasicCrawler and BasicContext
Add PlaywrightCrawler's enqueue links helper

Fixes

Fix type error in persist state of statistics

Assets 4

30 May 09:18

vdusek

v0.0.4

598a266

0.0.4

Another internal release, adding statistics capturing, proxy configuration,
and the initial version of browser management and PlaywrightCrawler.

Adds

Statistics
ProxyConfiguration
BrowserPool
PlaywrightCrawler

Assets 4

15 May 09:30

vdusek

v0.0.3

851042f

0.0.3

Another internal release, adding mainly session management and BeautifulSoupCrawler.

Adds

HttpxClient
SessionPool
BeautifulSoupCrawler
BaseStorageClient
Storages and MemoryStorageClient were refactored

Was added in 0.0.2

EventManager & LocalEventManager
Snapshotter
AutoscaledPool
MemoryStorageClient
Storages
BasicCrawler & HttpCrawler

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features

Why use Crawlee instead of just a random HTTP library with an HTML parser?

Why to use Crawlee rather than Scrapy?

Fixes

Adds

Fixes

Adds

Fixes

Adds

Adds

Was added in 0.0.2

Releases: apify/crawlee-python

0.1.0

Features

Why use Crawlee instead of just a random HTTP library with an HTML parser?

Why to use Crawlee rather than Scrapy?

0.0.7

Fixes

0.0.6

Adds

Fixes

0.0.5

Adds

Fixes

0.0.4

Adds

0.0.3

Adds

Was added in 0.0.2