Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid having 2 times the same spider running at the same time #228

Closed
rolele opened this issue Apr 12, 2017 · 5 comments
Closed

avoid having 2 times the same spider running at the same time #228

rolele opened this issue Apr 12, 2017 · 5 comments

Comments

@rolele
Copy link

rolele commented Apr 12, 2017

My use case is that I parametrized my spider to crawl part of the website.

  • For instance one crawl will do one category and another crawl another category using the same spider

The problem is that there isn't an option to prevent scrapyd to start those crawls in parallel.
I would like those crawl to be sequential.
Is this something possible ? I saw the option max_proc_per_cpu which limit the crawling but does not provide the functionality I would like.

@Digenis
Copy link
Member

Digenis commented Apr 12, 2017

@rolele see #140,
(this is a per-project limit)

Can you explain your use case in more detail?

Why do you crawl each part of the site in a different run?
Are there dependencies, like foreign keys in a database, that you need to fulfil first?
See if https://github.com/rolando/scrapy-inline-requests can help
(scrapy may integrate it one day, scrapy/scrapy#1144)

Why can't the jobs run together?
Are there locks in the pipeline?

Are you worried about total traffic
exceeding the configured download delay
or paralel requests per ip/domain?
See #221

@rolele
Copy link
Author

rolele commented Apr 12, 2017

Thanks @Digenis for your answer

I do not have any pipeline beside fetching the whole page to kafka.

It is not about parallel requests per ip/domain because I set this to 1 on my spider and I add a delay of 0.5s per request so I am not very aggressive I think.
I do not want to crawl the whole website every time, I want to detect the changing categories and crawl them more often then the ones that are not changing often. I basically will schedule some cronjob curl command on the shedule.json endpoint.

I am basically crawling e-commerce that have categories that change ones every few month and categories such as "special sales" that change every weeks, and categories that change something in the middle.

Now because I do not want to hit the website too much I would prefer if scrapyd could have an option so that I can schedule all my jobs in the queue but then scrapyd is running them sequentially.

Also I do not want to waist crawling the unchanged categories while I could use this computing power to crawl other websites that have changed categories.

I was looking at the per project limit and it is not what I wish was in scrapyd. It is a limit of job_per_spider.

@Dean-Christian-Armada
Copy link

Dean-Christian-Armada commented Nov 7, 2017

This is also currently a concern to me.. I'm having the same case as @rolele . I wish to run multiple jobs on a single spider in parallel.

My current solutions is setting the poll_interva to 1.0 seconds. But it will be better if parallelism can be implemented

@Digenis
Copy link
Member

Digenis commented Nov 8, 2017

@Dean-Christian-Armada,
Your issue is different. You want to run more spiders in parallel.
See #187 and #173. Also know that you can have a sub-second polling interval.

@rolele,
Unfortunately, scrapyd is still long way from this feature.
You can however solve your usecase in the spider itself.

@jpmckinney
Copy link
Contributor

Duplicates #153 (which is also a bit more general).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants