Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapyrt scrape multiple spiders asynchronously at once instead of overwhelming the server with request #130

Open
xaander1 opened this issue Jun 16, 2021 · 2 comments

Comments

@xaander1
Copy link

@pawelmhm requesting ability to scrape multiple spiders asynchronously at once instead of overwhelming the server with request
Here is what i mean:

{
    "request": {
        "url":["https://www.site1.com","https://www.site2.com","https://www.site3.com"] ,
        "callback": "parse_product",
        "dont_filter": "True"
    },
    "spider_name": ["Site1","Site2","Site3"]
}

Enabling the ability to scrape multiple spiders at once in real-time.

The alternative would be to write an api utilizing requests that programatically sends these requests one by one asynchronously then combine the results which i feel is a little bit unneat and resource intensive...built in support would be nice.

@pawelmhm
Copy link
Member

It sounds interesting, I think some sort of batch processing would be good here, in your example it will be difficult to know which spider should crawl which url, but maybe we could support something like this

{ "request": [
    {"url": "http://example1", "spider": "spider1"}, 
   {"url2": "http://example2", "spider": "spider2"}
]

so essential request as a list, but we'd have to think how to do it, changes would have to be made in: CrawlManager and CrawlResources.

@xaander1
Copy link
Author

xaander1 commented Jan 20, 2022

@pawelmhm How long for this to be implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants