feat: Add keep_alive flag to `crawler.init` #921

Pijukatel · 2025-01-20T08:54:12Z

Description

Add keep_alive flag to crawler.__init__

If True, this flag will keep crawler alive even when there are no more requests in queue. Crawler is then waiting for more requests to be added or to be explicitly stopped by crawler.stop().

Add test, add code example in docs.

Issues

Closes: Add a keep_alive flag to BasicCrawler #891

Add test.

Add test for this usecase.

janbuchar · 2025-01-20T12:32:02Z

tests/unit/crawlers/_basic/test_basic_crawler.py

+@pytest.mark.parametrize(
+    ('keep_alive', 'max_requests_per_crawl', 'should_process_added_request'),
+    [
+        pytest.param(True, 1, True, id='keep_alive'),
+        pytest.param(True, 0, False, id='keep_alive, but max_requests_per_crawl achieved'),
+        pytest.param(False, 1, False, id='Crawler without keep_alive (default)'),
+    ],
+)


Could you add a test case with max_requests_per_crawl > 1?

tests/unit/crawlers/_basic/test_basic_crawler.py

janbuchar · 2025-01-20T13:45:44Z

tests/unit/crawlers/_basic/test_basic_crawler.py

+    crawler_run_task = asyncio.create_task(crawler.run())
+
+    # Give some time to crawler to finish(or be in keep_alive state) and add new request.
+    await asyncio.sleep(1)


Isn't 1 second like... a lot?

Well any time related test is tricky if there is no event to wait for. How do I make sure that the crawler is alive because keep_alive = True and not just because it is randomly slow and takes time to shut down?
I could wrap basic_crawler.__is_finished_function in some mock and wait until it is called at least once, instead of wait time. It will be faster test, but it will leak implementation. Do you prefer that or some other option?

Hip shot, but maybe we could add a method for checking the activity of the crawler - basically "is it working on something right now?" Then you could check for that and whether the queue is empty.

Crawler already has bunch of internal private state flags. This method could inspect those state flags and maybe also _autoscaled_pool state and report back some sort of public state assessment?
I don't mind the idea, but that looks very much like standalone PR. Then I could modify this test with newly added "state" method.

Just quick guess of possible states:
"Starting"
"Processing requests"
"Waiting in keep alive"
"Shutting down due to unexpected stop"
"Aborted"
"Finished"
"Stopped"
....

Yeah, with the introduction of the keep_alive flag, being able to inspect the crawler state makes a lot of sense IMO. Feel free to make an issue and add a TODO to the test that references it.

vdusek

Could we mention it in the documentation please? Either guide or an example. It might fit well alongside the crawler.stop method and maybe more.

src/crawlee/crawlers/_basic/_basic_crawler.py

docs/examples/code/beautifulsoup_crawler_keep_alive.py

Remove redundant dot

vdusek

Nice, thanks.

Add keep_alive flag to crawler.__init__

63ecb5e

Add test.

Pijukatel added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. labels Jan 20, 2025

github-actions bot assigned Pijukatel Jan 20, 2025

github-actions bot added this to the 106th sprint - Tooling team milestone Jan 20, 2025

github-actions bot added the tested Temporary label used only programatically for some analytics. label Jan 20, 2025

Pijukatel added 2 commits January 20, 2025 10:15

Add explicit description of keep_alive + max_requests_per_crawl

d90e17d

Add test for this usecase.

Docstrings

64a8560

Pijukatel marked this pull request as ready for review January 20, 2025 09:22

Pijukatel requested review from vdusek and janbuchar January 20, 2025 09:26

janbuchar reviewed Jan 20, 2025

View reviewed changes

tests/unit/crawlers/_basic/test_basic_crawler.py Show resolved Hide resolved

Pijukatel added 2 commits January 20, 2025 14:24

Review comments

22b9d68

Remove redundant test argument

f73a856

janbuchar approved these changes Jan 20, 2025

View reviewed changes

Pijukatel mentioned this pull request Jan 21, 2025

Add get_run_state method to BasicCrawler #925

Open

vdusek requested changes Jan 21, 2025

View reviewed changes

src/crawlee/crawlers/_basic/_basic_crawler.py Outdated Show resolved Hide resolved

Pijukatel added 3 commits January 21, 2025 10:41

Add issue link to the test.

86e5121

Add example code + page

8be9561

Single paragraph docstring.

e634e74

Pijukatel requested a review from vdusek January 21, 2025 10:40

janbuchar reviewed Jan 21, 2025

View reviewed changes

docs/examples/code/beautifulsoup_crawler_keep_alive.py Outdated Show resolved Hide resolved

Fix typo

79d412f

Remove redundant dot

vdusek approved these changes Jan 22, 2025

View reviewed changes

Pijukatel merged commit 7a82d0c into master Jan 22, 2025
23 checks passed

Pijukatel deleted the keep-alive branch January 22, 2025 12:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add keep_alive flag to `crawler.init` #921

feat: Add keep_alive flag to `crawler.init` #921

Pijukatel commented Jan 20, 2025 •

edited

Loading

janbuchar Jan 20, 2025

Pijukatel Jan 20, 2025

janbuchar Jan 20, 2025

Pijukatel Jan 21, 2025

janbuchar Jan 21, 2025

Pijukatel Jan 21, 2025

janbuchar Jan 21, 2025

vdusek left a comment

vdusek left a comment

feat: Add keep_alive flag to crawler.__init__ #921

feat: Add keep_alive flag to crawler.__init__ #921

Conversation

Pijukatel commented Jan 20, 2025 • edited Loading

Description

Issues

janbuchar Jan 20, 2025

Choose a reason for hiding this comment

Pijukatel Jan 20, 2025

Choose a reason for hiding this comment

janbuchar Jan 20, 2025

Choose a reason for hiding this comment

Pijukatel Jan 21, 2025

Choose a reason for hiding this comment

janbuchar Jan 21, 2025

Choose a reason for hiding this comment

Pijukatel Jan 21, 2025

Choose a reason for hiding this comment

janbuchar Jan 21, 2025

Choose a reason for hiding this comment

vdusek left a comment

Choose a reason for hiding this comment

vdusek left a comment

Choose a reason for hiding this comment

feat: Add keep_alive flag to `crawler.init` #921

feat: Add keep_alive flag to `crawler.init` #921

Pijukatel commented Jan 20, 2025 •

edited

Loading