Replies: 1 comment · 4 replies
-
The problem is that the handler for failed requests timed out, not the request handler. Do you have a custom one in place? Could you show us your crawler initialization? Also, are you running this locally, on your computer, or using the Apify platform? |
Beta Was this translation helpful? Give feedback.
All reactions
-
Below are my code for crawler initialization. I didn't use the failed request handler. I was running this on my computer. crawler = new PlaywrightCrawler(
{
// Use the requestHandler to process each of the crawled pages.
async requestHandler({ request, page, enqueueLinks, log, pushData }) {
const title = await page.title();
pageCounter++;
log.info(
`Crawling: Page ${pageCounter} / ${config.maxPagesToCrawl} - URL: ${request.loadedUrl}...`,
);
// Use custom handling for XPath selector
if (config.selector) {
if (config.selector.startsWith("/")) {
await waitForXPath(
page,
config.selector,
config.waitForSelectorTimeout ?? 1000,
);
} else {
await page.waitForSelector(config.selector, {
timeout: config.waitForSelectorTimeout ?? 1000,
});
}
}
const html = await getPageHtml(page, config.selector);
// Save results as JSON to ./storage/datasets/default
await pushData({ title, url: request.loadedUrl, html });
if (config.onVisitPage) {
await config.onVisitPage({ page, pushData });
}
// Extract links from the current page
// and add them to the crawling queue.
await enqueueLinks({
regexps:
typeof config.rex === "string"
? [RegExp(config.rex)]
: (config.rex !== undefined? config.rex.map((rex) => RegExp(rex)):[]),
globs:
typeof config.match === "string" ? [config.match] : config.match,
exclude:
typeof config.exclude === "string"
? [config.exclude]
: config.exclude ?? [],
});
},
// Comment this option to scrape the full website.
maxRequestsPerCrawl: config.maxPagesToCrawl,
maxRequestsPerMinute: 100,
// Uncomment this option to see the browser window.
// headless: false,
preNavigationHooks: [
// Abort requests for certain resource types
async ({ request, page, log }) => {
// If there are no resource exclusions, return
const RESOURCE_EXCLUSTIONS = config.resourceExclusions ?? [];
if (RESOURCE_EXCLUSTIONS.length === 0) {
return;
}
if (config.cookie) {
const cookies = (
Array.isArray(config.cookie) ? config.cookie : [config.cookie]
).map((cookie) => {
return {
name: cookie.name,
value: cookie.value,
url: request.loadedUrl,
};
});
await page.context().addCookies(cookies);
}
await page.route(
`**\/*.{${RESOURCE_EXCLUSTIONS.join()}}`,
(route) => route.abort("aborted"),
);
log.info(
`Aborting requests for as this is a resource excluded route`,
);
},
],
},
new Configuration({
purgeOnStart: true,
}),
); Also, the same code can craw different number of pages from run to run (all of them are successful runs, no any page failed), is this normal? Thank you again for this amazing tool. Maybe I should learn more about typescript to use it more naturally. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Can you make sure that you're running crawlee 3.9.2? It fixes a bug that could have caused the error. |
Beta Was this translation helpful? Give feedback.
All reactions
-
I upgraded it, then npm start, this error came out:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
Sounds like outdated TS version, upgrade to latest, also you need use |
Beta Was this translation helpful? Give feedback.
-
I got an error as following after crawling 1000 pages:
Node.js v20.12.2
Is there any workaround to make this more stable, which means to give up the bad web pages and move on, instead of throwing ERROR and exit? Thanks for your attention.
Beta Was this translation helpful? Give feedback.
All reactions