Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue(graphmanager): Issue when waiting on already scheduled jobs #53

Open
curita opened this issue Oct 23, 2023 · 0 comments
Open

issue(graphmanager): Issue when waiting on already scheduled jobs #53

curita opened this issue Oct 23, 2023 · 0 comments

Comments

@curita
Copy link
Member

curita commented Oct 23, 2023

Situation

  1. A task was supposed to be run
  2. It failed and was retried a couple of times (apparently because of a Timeout error)
  3. The task was scheduled nonetheless
  4. Eventually, the command to run the task succeeds in getting a response
  5. The response is that the job is already scheduled
  6. Because of this, the returned job_id is None (See BaseScript._schedule_job() and Task.run())
  7. This None job_id is then tracked among the running jobs (See GraphManager.run_pending_jobs())
  8. GraphManager waits for it to finish (See GraphManager.check_running_jobs())
  9. Because it's not an actual job key, it never finishes, so the GraphManager is stuck and doesn't move on to the next task.

Logs

Affected jobs:

2023-10-16 05:51:07.032000 INFO [shub_workflow.graph] Job discovery (265529/71/10349) still running
2023-10-16 05:53:07.597000 INFO [shub_workflow.graph] Job "graph_manager/discovery" (265529/71/10349) finished
2023-10-16 05:53:07.598000 INFO [shub_workflow.graph.task] Will start task "graph_manager/consumers"
2023-10-16 05:54:07.726000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 05:54:07.727000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 05:56:07.918000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 05:56:07.918000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 05:58:08.104000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 05:58:08.104000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:00:08.300000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 06:00:08.301000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:02:08.490000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 06:02:08.490000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:04:08.671000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 06:04:08.671000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:06:08.858000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 06:06:08.858000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:08:10.015000 ERROR [shub_workflow.script] py:hcfmanager.py already scheduled
2023-10-16 06:10:10.333000 INFO [shub_workflow.graph] Job consumers (None) still running
2023-10-16 06:12:10.659000 INFO [shub_workflow.graph] Job consumers (None) still running
2023-10-16 06:14:10.986000 INFO [shub_workflow.graph] Job consumers (None) still running

Proposal

I don't know what's the best way to address it. On one hand, this is a temporal issue in ScrapyCloud/JobQ, so this is something external to shub_workflow. On the other hand, there's been some instability on this API for quite a while, so it could be good to handle it, at least to avoid the jobs from being stuck.

I think if self.run_job() returns None in GraphManager, we should probably stop the waiting loop because we won't be tracking that task's associated job ID either way.

I wonder if getting the job ID from the already scheduled job could be possible, too, so GraphManager can recover from this situation without having to restart it manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant