-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add keep_alive flag to crawler.__init__
#921
Conversation
Add test for this usecase.
@pytest.mark.parametrize( | ||
('keep_alive', 'max_requests_per_crawl', 'should_process_added_request'), | ||
[ | ||
pytest.param(True, 1, True, id='keep_alive'), | ||
pytest.param(True, 0, False, id='keep_alive, but max_requests_per_crawl achieved'), | ||
pytest.param(False, 1, False, id='Crawler without keep_alive (default)'), | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test case with max_requests_per_crawl > 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
crawler_run_task = asyncio.create_task(crawler.run()) | ||
|
||
# Give some time to crawler to finish(or be in keep_alive state) and add new request. | ||
await asyncio.sleep(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't 1 second like... a lot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well any time related test is tricky if there is no event to wait for. How do I make sure that the crawler is alive because keep_alive = True and not just because it is randomly slow and takes time to shut down?
I could wrap basic_crawler.__is_finished_function in some mock and wait until it is called at least once, instead of wait time. It will be faster test, but it will leak implementation. Do you prefer that or some other option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hip shot, but maybe we could add a method for checking the activity of the crawler - basically "is it working on something right now?" Then you could check for that and whether the queue is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crawler already has bunch of internal private state flags. This method could inspect those state flags and maybe also _autoscaled_pool state and report back some sort of public state assessment?
I don't mind the idea, but that looks very much like standalone PR. Then I could modify this test with newly added "state" method.
Just quick guess of possible states:
"Starting"
"Processing requests"
"Waiting in keep alive"
"Shutting down due to unexpected stop"
"Aborted"
"Finished"
"Stopped"
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, with the introduction of the keep_alive
flag, being able to inspect the crawler state makes a lot of sense IMO. Feel free to make an issue and add a TODO to the test that references it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we mention it in the documentation please? Either guide or an example. It might fit well alongside the crawler.
stop method and maybe more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks.
Description
Add keep_alive flag to
crawler.__init__
If True, this flag will keep crawler alive even when there are no more requests in queue. Crawler is then waiting for more requests to be added or to be explicitly stopped by
crawler.stop()
.Add test, add code example in docs.
Issues
keep_alive
flag toBasicCrawler
#891