Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: cheerio experimental queue #188

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
233 changes: 232 additions & 1 deletion package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions packages/actor-scraper/cheerio-scraper/.actor/actor.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"actorSpecification": 1,
"name": "cheerio-scraper",
"version": "0.1",
"buildTag": "latest"
}
3 changes: 2 additions & 1 deletion packages/actor-scraper/cheerio-scraper/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@
"dependencies": {
"@apify/scraper-tools": "^1.1.1",
"@crawlee/cheerio": "^3.1.0",
"apify": "^3.1.1"
"apify": "^3.1.1",
"got": "^12.6.0"
},
"devDependencies": {
"@apify/tsconfig": "^0.1.0",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ export interface Input {
datasetName?: string;
keyValueStoreName?: string;
requestQueueName?: string;
experimentalQueue?: boolean;
}

export const enum ProxyRotation {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import { IncomingMessage } from 'node:http';
import { dirname } from 'node:path';
import { fileURLToPath, URL } from 'node:url';
import { Input, ProxyRotation } from './consts.js';
import { createRequestQueue } from './request_queue_experimental.js';

const { SESSION_MAX_USAGE_COUNTS, META_KEY } = scraperToolsConstants;
const SCHEMA = JSON.parse(await readFile(new URL('../../INPUT_SCHEMA.json', import.meta.url), 'utf8'));
Expand Down Expand Up @@ -135,6 +136,7 @@ export class CrawlerSetup implements CrawlerSetupOptions {
this.requestList = await RequestList.open('CHEERIO_SCRAPER', startUrls);

// RequestQueue

this.requestQueue = await RequestQueue.open(this.requestQueueName);

// Dataset
Expand Down Expand Up @@ -204,6 +206,12 @@ export class CrawlerSetup implements CrawlerSetupOptions {

this.crawler = new CheerioCrawler(options);

if (this.input.experimentalQueue) {
// We have to override the queue after crawler creation.
// We could possibly extend the original request queue too.
this.crawler.requestQueue = await createRequestQueue() as any as RequestQueue;
}

return this.crawler;
}

Expand Down
27 changes: 27 additions & 0 deletions packages/actor-scraper/cheerio-scraper/src/internals/queue.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
export default class Queue<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining where is this used, wasn't obvious at first

private items: T[] = [];

enqueue(item: T): void {
this.items.push(item);
}

dequeue(): T | undefined {
return this.items.shift();
}

isEmpty(): boolean {
return this.items.length === 0;
}

size(): number {
return this.items.length;
}

toArray(): T[] {
return this.items;
}

fromArray(items: T[]) {
this.items = items;
}
}
Loading