Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add keep_alive flag to crawler.__init__ #921

Merged
merged 9 commits into from
Jan 22, 2025
Merged

feat: Add keep_alive flag to crawler.__init__ #921

merged 9 commits into from
Jan 22, 2025

Conversation

Pijukatel
Copy link
Contributor

@Pijukatel Pijukatel commented Jan 20, 2025

Description

Add keep_alive flag to crawler.__init__

If True, this flag will keep crawler alive even when there are no more requests in queue. Crawler is then waiting for more requests to be added or to be explicitly stopped by crawler.stop().

Add test, add code example in docs.

Issues

@Pijukatel Pijukatel added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. labels Jan 20, 2025
@github-actions github-actions bot added this to the 106th sprint - Tooling team milestone Jan 20, 2025
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Jan 20, 2025
@Pijukatel Pijukatel marked this pull request as ready for review January 20, 2025 09:22
@Pijukatel Pijukatel requested review from vdusek and janbuchar January 20, 2025 09:26
Comment on lines 1129 to 1136
@pytest.mark.parametrize(
('keep_alive', 'max_requests_per_crawl', 'should_process_added_request'),
[
pytest.param(True, 1, True, id='keep_alive'),
pytest.param(True, 0, False, id='keep_alive, but max_requests_per_crawl achieved'),
pytest.param(False, 1, False, id='Crawler without keep_alive (default)'),
],
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test case with max_requests_per_crawl > 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

crawler_run_task = asyncio.create_task(crawler.run())

# Give some time to crawler to finish(or be in keep_alive state) and add new request.
await asyncio.sleep(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't 1 second like... a lot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well any time related test is tricky if there is no event to wait for. How do I make sure that the crawler is alive because keep_alive = True and not just because it is randomly slow and takes time to shut down?
I could wrap basic_crawler.__is_finished_function in some mock and wait until it is called at least once, instead of wait time. It will be faster test, but it will leak implementation. Do you prefer that or some other option?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hip shot, but maybe we could add a method for checking the activity of the crawler - basically "is it working on something right now?" Then you could check for that and whether the queue is empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crawler already has bunch of internal private state flags. This method could inspect those state flags and maybe also _autoscaled_pool state and report back some sort of public state assessment?
I don't mind the idea, but that looks very much like standalone PR. Then I could modify this test with newly added "state" method.

Just quick guess of possible states:
"Starting"
"Processing requests"
"Waiting in keep alive"
"Shutting down due to unexpected stop"
"Aborted"
"Finished"
"Stopped"
....

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, with the introduction of the keep_alive flag, being able to inspect the crawler state makes a lot of sense IMO. Feel free to make an issue and add a TODO to the test that references it.

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we mention it in the documentation please? Either guide or an example. It might fit well alongside the crawler.stop method and maybe more.

src/crawlee/crawlers/_basic/_basic_crawler.py Outdated Show resolved Hide resolved
@Pijukatel Pijukatel requested a review from vdusek January 21, 2025 10:40
Remove redundant dot
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks.

@Pijukatel Pijukatel merged commit 7a82d0c into master Jan 22, 2025
23 checks passed
@Pijukatel Pijukatel deleted the keep-alive branch January 22, 2025 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a keep_alive flag to BasicCrawler
3 participants