There should be no changes needed apart from upgrading your Node.js version to >= 15.10. If you encounter issues with cheerio
, read their CHANGELOG. We bumped it from rc.3
to rc.10
.
There are a lot of breaking changes in the v1.0.0 release, but we're confident that updating your code will be a matter of minutes. Below, you'll find examples how to do it and also short tutorials how to use many of the new features.
If you hadn't yet, we suggest reading the CHANGELOG for a high level view of the changes.
Many of the new features are made with power users in mind, so don't worry if something looks complicated. You don't need to use it.
- Installation
- Running on Apify Platform
- Handler arguments are now Crawling Context
- Replacement of
PuppeteerPool
withBrowserPool
- Updated
PuppeteerCrawlerOptions
- Launch functions
Previous versions of the SDK bundled the puppeteer
package, so you did not have to install
it. SDK v1 supports also playwright
and we don't want to force users to install both.
To install SDK v1 with Puppeteer (same as previous versions), run:
npm install apify puppeteer
To install SDK v1 with Playwright run:
npm install apify playwright
While we tried to add the most important functionality in the initial release, you may find that there are still some utilities or options that are only supported by Puppeteer and not Playwright.
If you want to make use of Playwright on the Apify Platform, you need to use a Docker image that supports Playwright. We've created them for you, so head over to the new Docker image guide and pick the one that best suits your needs.
Note that your package.json
MUST include puppeteer
and/or playwright
as dependencies.
If you don't list them, the libraries will be uninstalled from your node_modules
folder
when you build your actors.
Previously, arguments of user provided handler functions were provided in separate objects. This made it difficult to track values across function invocations.
const handlePageFunction = async (args1) => {
args1.hasOwnProperty('proxyInfo') // true
}
const handleFailedRequestFunction = async (args2) => {
args2.hasOwnProperty('proxyInfo') // false
}
args1 === args2 // false
This happened because a new arguments object was created for each function. With SDK v1 we now have a single object called Crawling Context.
const handlePageFunction = async (crawlingContext1) => {
crawlingContext1.hasOwnProperty('proxyInfo') // true
}
const handleFailedRequestFunction = async (crawlingContext2) => {
crawlingContext2.hasOwnProperty('proxyInfo') // true
}
// All contexts are the same object.
crawlingContext1 === crawlingContext2 // true
Now that all the objects are the same, we can keep track of all running crawling contexts.
We can do that by working with the new id
property of crawlingContext
This is useful when you need cross-context access.
let masterContextId;
const handlePageFunction = async ({ id, page, request, crawler }) => {
if (request.userData.masterPage) {
masterContextId = id;
// Prepare the master page.
} else {
const masterContext = crawler.crawlingContexts.get(masterContextId);
const masterPage = masterContext.page;
const masterRequest = masterContext.request;
// Now we can manipulate the master data from another handlePageFunction.
}
}
To prevent bloat and to make access to certain key objects easier, we exposed a crawler
property on the handle page arguments.
const handePageFunction = async ({ request, page, crawler }) => {
await crawler.requestQueue.addRequest({ url: 'https://example.com' });
await crawler.autoscaledPool.pause();
}
This also means that some shorthands like puppeteerPool
or autoscaledPool
were
no longer necessary.
const handePageFunction = async (crawlingContext) => {
crawlingContext.autoscaledPool // does NOT exist anymore
crawlingContext.crawler.autoscaledPool // <= this is correct usage
}
BrowserPool
was created to extend PuppeteerPool
with the ability to manage other
browser automation libraries. The API is similar, but not the same.
Only PuppeteerCrawler
and PlaywrightCrawler
use BrowserPool
. You can access it
on the crawler
object.
const crawler = new Apify.PlaywrightCrawler({
handlePageFunction: async ({ page, crawler }) => {
crawler.browserPool // <-----
}
});
crawler.browserPool // <-----
And they're equal to crawlingContext.id
which gives you access to full crawlingContext
in hooks. See Lifecycle hooks below.
const pageId = browserPool.getPageId
The most important addition with BrowserPool
are the
lifecycle hooks.
You can access them via browserPoolOptions
in both crawlers. A full list of browserPoolOptions
can be found in browser-pool
readme.
const crawler = new Apify.PuppeteerCrawler({
browserPoolOptions: {
retireBrowserAfterPageCount: 10,
preLaunchHooks: [
async (pageId, launchContext) => {
const { request } = crawler.crawlingContexts.get(pageId);
if (request.userData.useHeadful === true) {
launchContext.launchOptions.headless = false;
}
}
]
}
})
BrowserController
is a class of browser-pool
that's responsible for browser management.
Its purpose is to provide a single API for working with both Puppeteer and Playwright browsers.
It works automatically in the background, but if you ever wanted to close a browser properly,
you should use a browserController
to do it. You can find it in the handle page arguments.
const handlePageFunction = async ({ page, browserController }) => {
// Wrong usage. Could backfire because it bypasses BrowserPool.
page.browser().close();
// Correct usage. Allows graceful shutdown.
browserController.close();
const cookies = [/* some cookie objects */];
// Wrong usage. Will only work in Puppeteer and not Playwright.
page.setCookies(...cookies);
// Correct usage. Will work in both.
browserController.setCookies(page, cookies);
}
The BrowserController
also includes important information about the browser, such as
the context it was launched with. This was difficult to do before SDK v1.
const handlePageFunction = async ({ browserController }) => {
// Information about the proxy used by the browser
browserController.launchContext.proxyInfo
// Session used by the browser
browserController.launchContext.session
}
Some functions were removed (in line with earlier deprecations), and some were changed a bit:
// OLD
puppeteerPool.recyclePage(page);
// NEW
page.close();
// OLD
puppeteerPool.retire(page.browser());
// NEW
browserPool.retireBrowserByPage(page);
// OLD
puppeteerPool.serveLiveViewSnapshot();
// NEW
// There's no LiveView in BrowserPool
To keep PuppeteerCrawler
and PlaywrightCrawler
consistent, we updated the options.
The concept of a configurable gotoFunction
is not ideal. Especially since we use a modified
gotoExtended
. Users have to know this when they override gotoFunction
if they want to
extend default behavior. We decided to replace gotoFunction
with preNavigationHooks
and
postNavigationHooks
.
The following example illustrates how gotoFunction
makes things complicated.
const gotoFunction = async ({ request, page }) => {
// pre-processing
await makePageStealthy(page);
// Have to remember how to do this:
const response = gotoExtended(page, request, {/* have to remember the defaults */});
// post-processing
await page.evaluate(() => {
window.foo = 'bar';
});
// Must not forget!
return response;
}
const crawler = new Apify.PuppeteerCrawler({
gotoFunction,
// ...
})
With preNavigationHooks
and postNavigationHooks
it's much easier. preNavigationHooks
are called with two arguments: crawlingContext
and gotoOptions
. postNavigationHooks
are called only with crawlingContext
.
const preNavigationHooks = [
async ({ page }) => makePageStealthy(page)
];
const postNavigationHooks = [
async ({ page }) => page.evaluate(() => {
window.foo = 'bar'
})
]
const crawler = new Apify.PuppeteerCrawler({
preNavigationHooks,
postNavigationHooks,
// ...
})
Those were always a point of confusion because they merged custom Apify options with
launchOptions
of Puppeteer.
const launchPuppeteerOptions = {
useChrome: true, // Apify option
headless: false, // Puppeteer option
}
Use the new launchContext
object, which explicitly defines launchOptions
.
launchPuppeteerOptions
were removed.
const crawler = new Apify.PuppeteerCrawler({
launchContext: {
useChrome: true, // Apify option
launchOptions: {
headless: false // Puppeteer option
}
}
})
LaunchContext is also a type of
browser-pool
and the structure is exactly the same there. SDK only adds extra options.
browser-pool
introduces the idea of lifecycle hooks,
which are functions that are executed when a certain event in the browser lifecycle happens.
const launchPuppeteerFunction = async (launchPuppeteerOptions) => {
if (someVariable === 'chrome') {
launchPuppeteerOptions.useChrome = true;
}
return Apify.launchPuppeteer(launchPuppeteerOptions);
}
const crawler = new Apify.PuppeteerCrawler({
launchPuppeteerFunction,
// ...
})
Now you can recreate the same functionality with a preLaunchHook
:
const maybeLaunchChrome = (pageId, launchContext) => {
if (someVariable === 'chrome') {
launchContext.useChrome = true;
}
}
const crawler = new Apify.PuppeteerCrawler({
browserPoolOptions: {
preLaunchHooks: [maybeLaunchChrome]
},
// ...
})
This is better in multiple ways. It is consistent across both Puppeteer and Playwright. It allows you to easily construct your browsers with pre-defined behavior:
const preLaunchHooks = [
maybeLaunchChrome,
useHeadfulIfNeeded,
injectNewFingerprint,
]
And thanks to the addition of crawler.crawlingContexts
the functions also have access to the crawlingContext
of the request
that triggered the launch.
const preLaunchHooks = [
async function maybeLaunchChrome(pageId, launchContext) {
const { request } = crawler.crawlingContexts.get(pageId);
if (request.userData.useHeadful === true) {
launchContext.launchOptions.headless = false;
}
}
]
In addition to Apify.launchPuppeteer()
we now also have Apify.launchPlaywright()
.
We updated the launch options object because it was a frequent source of confusion.
// OLD
await Apify.launchPuppeteer({
useChrome: true,
headless: true,
})
// NEW
await Apify.launchPuppeteer({
useChrome: true,
launchOptions: {
headless: true,
}
})
Apify.launchPuppeteer
already supported the puppeteerModule
option. With Playwright,
we normalized the name to launcher
because the playwright
module itself does not
launch browsers.
const puppeteer = require('puppeteer');
const playwright = require('playwright');
await Apify.launchPuppeteer();
// Is the same as:
await Apify.launchPuppeteer({
launcher: puppeteer
})
await Apify.launchPlaywright();
// Is the same as:
await Apify.launchPlaywright({
launcher: playwright.chromium
})