You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
…equestList`, accept any iterable in `RequestList` constructor (#777)
> Tandem, or in tandem, is an arrangement in which two or more animals,
machines, or people are lined up one behind another, all facing in the
same
direction.[[1]](https://en.wikipedia.org/wiki/Tandem#cite_note-OED-1)
Tandem can also be used more generally to refer to any group of persons
or objects working together, not necessarily in
line.[[1]](https://en.wikipedia.org/wiki/Tandem#cite_note-OED-1)
(https://en.wikipedia.org/wiki/Tandem)
- Inspired by
https://github.com/apify/crawlee/blob/4c95847d5cedd6514620ccab31d5b242ba76de80/packages/basic-crawler/src/internals/basic-crawler.ts#L1154-L1177
and related code in the same class
- In my opinion, it implements the feature more cleanly and without
polluting `BasicCrawler` (...any further)
- The motivation for the feature is twofold:
1. Apify Actor development - it is common that an Actor receives a
`requestListSources` input from the user, which may be pretty complex
(regexp-based extraction from remote URL lists), and which is usually
parsed using `apify.RequestList.open`. At the same time, the Actor wants
to use the built in `RequestQueue`.
2. Sitemap parsing (#248) - similar to 1, but not coupled to the Apify
platform - we want to read URLs from a sitemap in the background, but
the URLs should go through the standard request queue
## Breaking changes
- `RequestList` does not support `.drop()`, `.reclaim_request()`,
`.add_request()` and `add_requests_batched()` anymore
- `RequestManagerTandem` with a `RequestQueue` should be used for this
use case, `await list.to_tandem()` can be used as a shortcut
- The `RequestProvider` interface has been renamed to `RequestManager`
and moved to the `crawlee.request_loaders` package
- `RequestList` has been moved to the `crawlee.request_loaders` package
- The `BasicCrawler.get_request_provider` method has been renamed to
`BasicCrawler.get_request_manager` and it does not accept the `id` and
`name` arguments anymore
- The `request_provider` parameter of `BasicCrawler.__init__` has been
renamed to `request_manager`
## TODO
- [x] new tests
- [x] fix existing tests
---------
Co-authored-by: Vlada Dusek <[email protected]>
similar to what we're implementing in JS crawlee
The text was updated successfully, but these errors were encountered: