Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sitemap-based request provider #248

Open
janbuchar opened this issue Jun 28, 2024 · 0 comments
Open

Sitemap-based request provider #248

janbuchar opened this issue Jun 28, 2024 · 0 comments
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@janbuchar
Copy link
Collaborator

similar to what we're implementing in JS crawlee

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Jun 28, 2024
@vdusek vdusek added the enhancement New feature or request. label Jul 15, 2024
janbuchar added a commit that referenced this issue Dec 19, 2024
…equestList`, accept any iterable in `RequestList` constructor (#777)

> Tandem, or in tandem, is an arrangement in which two or more animals,
machines, or people are lined up one behind another, all facing in the
same
direction.[[1]](https://en.wikipedia.org/wiki/Tandem#cite_note-OED-1)
Tandem can also be used more generally to refer to any group of persons
or objects working together, not necessarily in
line.[[1]](https://en.wikipedia.org/wiki/Tandem#cite_note-OED-1)
(https://en.wikipedia.org/wiki/Tandem)

- Inspired by
https://github.com/apify/crawlee/blob/4c95847d5cedd6514620ccab31d5b242ba76de80/packages/basic-crawler/src/internals/basic-crawler.ts#L1154-L1177
and related code in the same class
- In my opinion, it implements the feature more cleanly and without
polluting `BasicCrawler` (...any further)
- The motivation for the feature is twofold:
1. Apify Actor development - it is common that an Actor receives a
`requestListSources` input from the user, which may be pretty complex
(regexp-based extraction from remote URL lists), and which is usually
parsed using `apify.RequestList.open`. At the same time, the Actor wants
to use the built in `RequestQueue`.
2. Sitemap parsing (#248) - similar to 1, but not coupled to the Apify
platform - we want to read URLs from a sitemap in the background, but
the URLs should go through the standard request queue

## Breaking changes

- `RequestList` does not support `.drop()`, `.reclaim_request()`,
`.add_request()` and `add_requests_batched()` anymore
- `RequestManagerTandem` with a `RequestQueue` should be used for this
use case, `await list.to_tandem()` can be used as a shortcut
- The `RequestProvider` interface has been renamed to `RequestManager`
and moved to the `crawlee.request_loaders` package
- `RequestList` has been moved to the `crawlee.request_loaders` package
- The `BasicCrawler.get_request_provider` method has been renamed to
`BasicCrawler.get_request_manager` and it does not accept the `id` and
`name` arguments anymore
- The `request_provider` parameter of `BasicCrawler.__init__` has been
renamed to `request_manager`
 
## TODO
- [x] new tests
- [x] fix existing tests

---------

Co-authored-by: Vlada Dusek <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

2 participants