How to handle TimeoutError? #2428

jackfsuia · 2024-04-18T16:16:29Z

jackfsuia
Apr 18, 2024

I got an error as following after crawling 1000 pages:

actualRatio":0.393},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. browserController.newPage() failed: TlcEwRTei2NDCylwFwGDK
Cause:browserController.newPage() timed out..
 {"id":"ZNGIMiMhlwKFihT","url":"https://ask.cvxr.com/t/unable-to-solve-convex-problem-with-cvx-and-mosek/11923/6","retryCount":1}
INFO  Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":4099,"requestsFinishedPerMinute":63,"requestsFailedPerMinute":0,"requestTotalDurationMillis":5697717,"requestsTotal":1390,"crawlerRuntimeMillis":1319649,"retryHistogram":[1388,2]}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. requestHandler timed out after 60 seconds. {"id":"Yv1827f2UbOaSUH","url":"https://ask.cvxr.com/u/Rajashekhar","retryCount":1}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. requestHandler timed out after 130 seconds (8jxGQZjmAvgHPkz). {"id":"8jxGQZjmAvgHPkz","url":"https://ask.cvxr.com/t/unable-to-solve-convex-problem-with-cvx-and-mosek/11923/2","retryCount":1}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. requestHandler timed out after 130 seconds (NRnd6Yf9a6pqvt2). {"id":"NRnd6Yf9a6pqvt2","url":"https://ask.cvxr.com/t/unable-to-solve-convex-problem-with-cvx-and-mosek/11923/3","retryCount":1}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. requestHandler timed out after 130 seconds (8rH9fcX2aIC6Hgs). {"id":"8rH9fcX2aIC6Hgs","url":"https://ask.cvxr.com/t/cvx-status-infeasible-the-solution-result-cannot-satisfy-the-constraint/12080/2","retryCount":1}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. requestHandler timed out after 130 seconds (T3P2f4oP4CSvGQ3). {"id":"T3P2f4oP4CSvGQ3","url":"https://ask.cvxr.com/t/why-is-the-result-of-cvx-running-different-from-what-i-theoretically-derived/12061/8","retryCount":1}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. requestHandler timed out after 130 seconds (k12xHEFRjNpFiHv). {"id":"k12xHEFRjNpFiHv","url":"https://ask.cvxr.com/t/should-i-still-care-this-when-the-status-is-solved-and-the-cvx-optval-is-neither-inf-inf-nor-nan/6911/12","retryCount":1}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. requestHandler timed out after 130 seconds (ILs9qGSGtbOBv96). {"id":"ILs9qGSGtbOBv96","url":"https://ask.cvxr.com/t/how-do-you-solve-the-log2-problem/11986/4","retryCount":1}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. requestHandler timed out after 130 seconds (ZlJZtBrQLHfMXCJ). {"id":"ZlJZtBrQLHfMXCJ","url":"https://ask.cvxr.com/t/matlab-code-taking-many-days-to-run-how-handle/12422/2","retryCount":1}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. requestHandler timed out after 130 seconds (r4johYJahZjaKyD). {"id":"r4johYJahZjaKyD","url":"https://ask.cvxr.com/t/unable-to-solve-convex-problem-with-cvx-and-mosek/11923/4","retryCount":1}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. requestHandler timed out after 130 seconds (RYH7wUHrDnf5HHa). {"id":"RYH7wUHrDnf5HHa","url":"https://ask.cvxr.com/t/unable-to-solve-convex-problem-with-cvx-and-mosek/11923/5","retryCount":1}
INFO  PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":12,"desiredConcurrency":17,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
ERROR PlaywrightCrawler: An exception occurred during handling of failed request. This places the crawler and its underlying storages into an unknown state and crawling will be terminated.
  Error: Handling request failure of https://ask.cvxr.com/t/unable-to-solve-convex-problem-with-cvx-and-mosek/11923/6 (ZNGIMiMhlwKFihT) timed out after 300 seconds.
      at Timeout._onTimeout (D:\github_projects\gpt-crawler\node_modules\@apify\timeout\index.js:62:68)
      at listOnTimeout (node:internal/timers:573:17)
      at process.processTimers (node:internal/timers:514:7)
ERROR PlaywrightCrawler:AutoscaledPool: runTaskFunction failed.
  Error: Handling request failure of https://ask.cvxr.com/t/unable-to-solve-convex-problem-with-cvx-and-mosek/11923/6 (ZNGIMiMhlwKFihT) timed out after 300 seconds.
      at Timeout._onTimeout (D:\github_projects\gpt-crawler\node_modules\@apify\timeout\index.js:62:68)
      at listOnTimeout (node:internal/timers:573:17)
      at process.processTimers (node:internal/timers:514:7)
node:internal/process/esm_loader:34
      internalBinding('errors').triggerUncaughtException(
                                ^

TimeoutError: Handling request failure of https://ask.cvxr.com/t/unable-to-solve-convex-problem-with-cvx-and-mosek/11923/6 (ZNGIMiMhlwKFihT) timed out after 300 seconds.
    at Timeout._onTimeout (D:\github_projects\gpt-crawler\node_modules\@apify\timeout\index.js:62:68)
    at listOnTimeout (node:internal/timers:573:17)
    at process.processTimers (node:internal/timers:514:7)

Node.js v20.12.2

Is there any workaround to make this more stable, which means to give up the bad web pages and move on, instead of throwing ERROR and exit? Thanks for your attention.

janbuchar · 2024-04-19T08:21:51Z

janbuchar
Apr 19, 2024
Maintainer

The problem is that the handler for failed requests timed out, not the request handler. Do you have a custom one in place? Could you show us your crawler initialization? Also, are you running this locally, on your computer, or using the Apify platform?

4 replies

jackfsuia Apr 19, 2024
Author

Below are my code for crawler initialization. I didn't use the failed request handler. I was running this on my computer.

  crawler = new PlaywrightCrawler(
     {
       // Use the requestHandler to process each of the crawled pages.
       async requestHandler({ request, page, enqueueLinks, log, pushData }) {
         const title = await page.title();
         pageCounter++;
         log.info(
           `Crawling: Page ${pageCounter} / ${config.maxPagesToCrawl} - URL: ${request.loadedUrl}...`,
         );

         // Use custom handling for XPath selector
         if (config.selector) {
           if (config.selector.startsWith("/")) {
             await waitForXPath(
               page,
               config.selector,
               config.waitForSelectorTimeout ?? 1000,
             );
           } else {
             await page.waitForSelector(config.selector, {
               timeout: config.waitForSelectorTimeout ?? 1000,
             });
           }
         }

         const html = await getPageHtml(page, config.selector);
         // Save results as JSON to ./storage/datasets/default
         await pushData({ title, url: request.loadedUrl, html });

         if (config.onVisitPage) {
           await config.onVisitPage({ page, pushData });
         }

         // Extract links from the current page
         // and add them to the crawling queue.
         await enqueueLinks({
           regexps:
           typeof config.rex === "string"
             ? [RegExp(config.rex)]
             : (config.rex !== undefined? config.rex.map((rex) => RegExp(rex)):[]),

           globs:
             typeof config.match === "string" ? [config.match] : config.match,
           exclude:
             typeof config.exclude === "string"
               ? [config.exclude]
               : config.exclude ?? [],
         });
       },
       // Comment this option to scrape the full website.
       maxRequestsPerCrawl: config.maxPagesToCrawl,
       maxRequestsPerMinute: 100,
       // Uncomment this option to see the browser window.
       // headless: false,
       preNavigationHooks: [
         // Abort requests for certain resource types
         async ({ request, page, log }) => {
           // If there are no resource exclusions, return
           const RESOURCE_EXCLUSTIONS = config.resourceExclusions ?? [];
           if (RESOURCE_EXCLUSTIONS.length === 0) {
             return;
           }
           if (config.cookie) {
             const cookies = (
               Array.isArray(config.cookie) ? config.cookie : [config.cookie]
             ).map((cookie) => {
               return {
                 name: cookie.name,
                 value: cookie.value,
                 url: request.loadedUrl,
               };
             });
             await page.context().addCookies(cookies);
           }
           await page.route(
             `**\/*.{${RESOURCE_EXCLUSTIONS.join()}}`,
             (route) => route.abort("aborted"),
           );
           log.info(
             `Aborting requests for as this is a resource excluded route`,
           );
         },
       ],
     },
     new Configuration({
       purgeOnStart: true,
     }),
  );

Also, the same code can craw different number of pages from run to run (all of them are successful runs, no any page failed), is this normal? Thank you again for this amazing tool. Maybe I should learn more about typescript to use it more naturally.

janbuchar Apr 19, 2024
Maintainer

Can you make sure that you're running crawlee 3.9.2? It fixes a bug that could have caused the error.

jackfsuia Apr 20, 2024
Author

I upgraded it, then npm start, this error came out:

node_modules/@crawlee/http/internals/http-crawler.d.ts:372:44 - error TS1005: 'assert' expected.

372         request?: import("got-scraping", { with: { "resolution-mode": "import" } }).RequestFunction | undefined;
                                              ~~~~

 node_modules/@crawlee/http/internals/http-crawler.d.ts:372:42
   372         request?: import("got-scraping", { with: { "resolution-mode": "import" } }).RequestFunction | undefined;
                                                ~
   The parser expected to find a '}' to match the '{' token here.

node_modules/@crawlee/http/internals/http-crawler.d.ts:372:83 - error TS1144: '{' or ';' expected.

372         request?: import("got-scraping", { with: { "resolution-mode": "import" } }).RequestFunction | undefined;
                                                                                     ~

node_modules/@crawlee/http/internals/http-crawler.d.ts:372:84 - error TS1068: Unexpected token. A constructor, method, accessor, or property was expected.

372         request?: import("got-scraping", { with: { "resolution-mode": "import" } }).RequestFunction | undefined;
                                                                                      ~

node_modules/@crawlee/http/internals/http-crawler.d.ts:372:85 - error TS1434: Unexpected keyword or identifier.

372         request?: import("got-scraping", { with: { "resolution-mode": "import" } }).RequestFunction | undefined;
                                                                                       ~~~~~~~~~~~~~~~

node_modules/@crawlee/http/internals/http-crawler.d.ts:372:101 - error TS1068: Unexpected token. A constructor, method, accessor, or property was expected.

372         request?: import("got-scraping", { with: { "resolution-mode": "import" } }).RequestFunction | undefined;
                                                                                                       ~

node_modules/@crawlee/http/internals/http-crawler.d.ts:374:15 - error TS1109: Expression expected.

374         agent?: import("got-scraping", { with: { "resolution-mode": "import" } }).Agents | undefined;
                 ~

node_modules/@crawlee/http/internals/http-crawler.d.ts:376:19 - error TS1109: Expression expected.

376         h2session?: import("http2").ClientHttp2Session | undefined;
                     ~

node_modules/@crawlee/http/internals/http-crawler.d.ts:377:20 - error TS1109: Expression expected.

377         decompress?: boolean | undefined;
                      ~

node_modules/@crawlee/http/internals/http-crawler.d.ts:379:17 - error TS1109: Expression expected.

379         timeout?: import("got-scraping", { with: { "resolution-mode": "import" } }).Delays | undefined;
                   ~

node_modules/@crawlee/http/internals/http-crawler.d.ts:380:19 - error TS1109: Expression expected.

380         prefixUrl?: string | URL | undefined;
                     ~

node_modules/@crawlee/http/internals/http-crawler.d.ts:382:14 - error TS1109: Expression expected.

382         body?: string | Buffer | import("stream").Readable | Generator<unknown, any, unknown> | AsyncGenerator<unknown, any, unknown> | import("form-data-encoder", { with: { "resolution-mode": "import" } }).FormDataLike | undefined;
                ~

node_modules/@crawlee/http/internals/http-crawler.d.ts:383:14 - error TS1109: Expression expected.

383         form?: Record<string, any> | undefined;
                ~

node_modules/@crawlee/http/internals/http-crawler.d.ts:384:14 - error TS1109: Expression expected.

384         json?: unknown;
                ~

node_modules/@crawlee/http/internals/http-crawler.d.ts:385:13 - error TS1109: Expression expected.

385         url?: string | URL | undefined;
               ~

node_modules/@crawlee/http/internals/http-crawler.d.ts:387:19 - error TS1109: Expression expected.

387         cookieJar?: import("got-scraping", { with: { "resolution-mode": "import" } }).PromiseCookieJar | import("got-scraping", { with: { "resolution-mode": "import" } }).ToughCookieJar | undefined;
                     ~

node_modules/@crawlee/http/internals/http-crawler.d.ts:388:16 - error TS1109: Expression expected.

388         signal?: AbortSignal | undefined;
                  ~

node_modules/@crawlee/http/internals/http-crawler.d.ts:389:30 - error TS1109: Expression expected.

389         ignoreInvalidCookies?: boolean | undefined;
                                ~

node_modules/@crawlee/http/internals/http-crawler.d.ts:391:22 - error TS1109: Expression expected.

391         searchParams?: string | import("got-scraping", { with: { "resolution-mode": "import" } }).Searc

B4nan Apr 20, 2024
Maintainer

Sounds like outdated TS version, upgrade to latest, also you need use module: 'nodenext' in your tsconfig.json, see how the templates have TS configured:

https://github.com/apify/crawlee/blob/master/packages/templates/templates/cheerio-ts/tsconfig.json#L4-L5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle TimeoutError? #2428

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How to handle TimeoutError? #2428

jackfsuia Apr 18, 2024

Replies: 1 comment · 4 replies

janbuchar Apr 19, 2024 Maintainer

jackfsuia Apr 19, 2024 Author

janbuchar Apr 19, 2024 Maintainer

jackfsuia Apr 20, 2024 Author

B4nan Apr 20, 2024 Maintainer

jackfsuia
Apr 18, 2024

Replies: 1 comment 4 replies

janbuchar
Apr 19, 2024
Maintainer

jackfsuia Apr 19, 2024
Author

janbuchar Apr 19, 2024
Maintainer

jackfsuia Apr 20, 2024
Author

B4nan Apr 20, 2024
Maintainer