Releases: apify/crawlee-python
Releases · apify/crawlee-python
0.1.0
- Crawlee is a web scraping and browser automation library.
- Launching Crawlee for Python blog post
Features
Why is Crawlee the preferred choice for web scraping and crawling?
Why use Crawlee instead of just a random HTTP library with an HTML parser?
- Unified interface for HTTP & headless browser crawling.
- Automatic parallel crawling based on available system resources.
- Written in Python with type hints - enhances DX (IDE autocompletion) and reduces bugs (static type checking).
- Automatic retries on errors or when you’re getting blocked.
- Integrated proxy rotation and session management.
- Configurable request routing - direct URLs to the appropriate handlers.
- Persistent queue for URLs to crawl.
- Pluggable storage of both tabular data and files.
- Robust error handling.
Why to use Crawlee rather than Scrapy?
- Crawlee has out-of-the-box support for headless browser crawling (Playwright).
- Crawlee has a minimalistic & elegant interface - Set up your scraper with fewer than 10 lines of code.
- Complete type hint coverage.
- Based on standard Asyncio.
0.0.7
0.0.6
Adds
- BREAKING:
BasicCrawler.export_data
helper method which replacesBasicCrawler.export_to
Configuration.get_global_configuration
method- Automatic logging setup
- Context helper for logging (
context.log
)
Fixes
- Handling of relative URLs in
add_requests
- Graceful exit in
BasicCrawler.run
0.0.5
Adds
- Add explicit error messages for missing package extras during import
- Better browser abstraction:
BrowserController
- Wraps a single browser instance and maintains its state.BrowserPlugin
- Manages the browser automation framework, and basically acts as a factory for controllers.
- Browser rotation with a maximum number of pages opened per browser.
- Add emit persist state event to event manager
- Add batched request addition in
RequestQueue
- Add start requests option to
BasicCrawler
- Add storage-related helpers
get_data
,push_data
andexport_to
toBasicCrawler
andBasicContext
- Add
PlaywrightCrawler
's enqueue links helper
Fixes
- Fix type error in persist state of statistics
0.0.4
0.0.3
Another internal release, adding mainly session management and BeautifulSoupCrawler
.
Adds
HttpxClient
SessionPool
BeautifulSoupCrawler
BaseStorageClient
Storages
andMemoryStorageClient
were refactored
Was added in 0.0.2
EventManager
&LocalEventManager
Snapshotter
AutoscaledPool
MemoryStorageClient
Storages
BasicCrawler
&HttpCrawler