docs: scrapy-vs-crawlee blog #2431

souravjain540 · 2024-04-23T14:19:09Z

@B4nan please don't merge it yet.

@mnmkng will review it first.

mnmkng

Thanks for the draft Saurav! 🎉 I left comments where appropriate, plus some general comments here below.

I like the interlinking with Apify. I think it makes a lot of sense for SEO. We just have to make sure we're not overly pushy and only using links to Apify where it makes sense.
code blocks should have language info, so that they get syntax highlighting
there are two repeating themes in the content:
- objective comparison - when we mention feature or capability of X, we should also mention it for Y. We should always strive to be maximally objective and fair to both parties.
- trying it out - From many parts of the text it's clear that you haven't actually used the features you're talking about. I think this is an absolutely necessary step of writing good developer content. To have hands-on experience with what you're writing about.

Now, I understand that you wanted to have the result fast, but don't worry about spending longer, to learn about both Crawlee and Scrapy and then producing high quality content. It's an investment into the future, because each subsequent article will be easier to write once you actually understand the libraries and can code with them.

mnmkng · 2024-04-24T09:12:52Z

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

+[Web scraping](https://blog.apify.com/what-is-web-scraping/) is the process of extracting and collecting data automatically from websites. Companies use web scraping for various use cases ranging from making data-driven decisions to [feeding LLMs efficient data](https://blog.apify.com/webscraping-ai-data-for-llms/). 
+
+Sometimes, extracting data from complex websites becomes hard, and we have to use various tools and libraries to overcome problems like queue management, error handling, etc.
+
+Two such tools that make the lives of thousands of web scraping developers easy are [Scrapy](https://blog.apify.com/web-scraping-with-scrapy/) and [Crawlee](https://crawlee.dev/). Scrapy can extract data from static websites and can work with large-scale projects, on the other hand, Crawlee has a single interface for HTTP and headless browser crawling.
+
+We believe there are a lot of things that we can compare between Scrapy and Crawlee. This article will be the first part of a series comparing Scrapy and Crawlee on various parameters. In this article, we will go over all the features that both libraries provide.


While this could be a good intro to a generic blog, I think it feels out of place for the Crawlee blog. Let's make it much more personalized. Something along the lines of:

Hey Crawlee community members, we're back with another blog post and this time, we will take a look at comparing Crawlee to Scrapy, one of the oldest and most popular web scraping libraries in the world. When does it make sense to use Crawlee? And when should you consider using Scrapy instead? Let's dive in.

...

I don't expect you to use this verbatim, I haven't given it tons of thought. I just wanted to accentuate three things:

don't make the Crawlee blog posts sound like generic SEO blog posts, make them home at the Crawlee blog, and make the readers feel like they're part of a community

it's ok to be opinionated, but also to give credit to competitors

no need to spend time on fluff, we can move to the main message faster. For example, see https://docusaurus.io/blog/releases/3.2, they literally use one sentence as intro in most of their blogs. I think we can be a bit more friendly and conversational, but we should still strive to be concise and to the point.

mnmkng · 2024-04-24T11:17:12Z

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

+
+Sometimes, extracting data from complex websites becomes hard, and we have to use various tools and libraries to overcome problems like queue management, error handling, etc.
+
+Two such tools that make the lives of thousands of web scraping developers easy are [Scrapy](https://blog.apify.com/web-scraping-with-scrapy/) and [Crawlee](https://crawlee.dev/). Scrapy can extract data from static websites and can work with large-scale projects, on the other hand, Crawlee has a single interface for HTTP and headless browser crawling.


Two things here.

Links:

crawlee does not need a link to itself IMO

linking to an Apify article about Scrapy feels dishonest. The link should go to Scrapy directly, if we don't want to look like cheesy marketers.

This comparison is weird because it compares apples and oranges:

Scrapy can extract data from static websites and can work with large-scale projects, on the other hand, Crawlee has a single interface for HTTP and headless browser crawling.

Plus Crawlee can just as easily work with large scale projects. Let's keep it factual and logical.

mnmkng · 2024-04-24T11:20:15Z

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

+
+## Introduction:
+
+Scrapy is an open-source Python-based web scraping framework that extracts data from websites. It supports efficient scraping from large-scale websites. In Scrapy, spiders are created, which are nothing but autonomous scripts to download and process web content. Limitations include not working well with JavaScript heavy websites. 


Again, let's not reuse marketing claims like "supports efficient scraping from large-scale websites", and stay focused on facts.

In Scrapy, spiders are created,

Let's not use passive voice unless needed.

Limitations include not working well with JavaScript heavy websites.

If we're making a claim like that, we should provide evidence, or at least say that we will explain it later in the text.

mnmkng · 2024-04-24T11:22:48Z

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

+
+## Language and development environments:
+
+Regarding languages and development environments, Scrapy is written in Python, making it easier for the data science community to integrate it with various tools with Python. While Scrapy offers very detailed documentation, for first-timers, sometimes it's a little difficult to start with Scrapy.


for first-timers, sometimes it's a little difficult to start with Scrapy.

/evidence

mnmkng · 2024-04-24T11:23:39Z

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

+
+On the other hand, Crawlee is one of the few web scraping and automation libraries that supports [JavaScript](https://blog.apify.com/tag/javascript/) and [TypeScript](https://blog.apify.com/tag/typescript/). Crawlee also offers Crawlee CLI, which makes it [easy to start](https://crawlee.dev/docs/quick-start#installation-with-crawlee-cli) with Crawlee for the Node.js developers.
+
+## Feature Comparison


Would be good to introduce some methodology for how we selected and compared the features.

mnmkng · 2024-04-24T11:46:39Z

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

+In Scrapy, handling anti-blocking strategies like IP rotation, user-agent rotation, custom solutions via middleware, and plugins are needed.
+Crawlee provides HTTP crawling and [browser fingerprints](https://crawlee.dev/docs/guides/avoid-blocking) with zero configuration necessary, fingerprints are enabled by default and available in `PlaywrightCrawler` and `PuppeteerCrawler`.


If we're mentioning specific classes and guides in crawlee, it would be fair to give a bit more detail about Scrapy as well.

And as I mentioned before, make it clear that fingerprinting works even in CheerioCrawler and the other HTTP Crawlers

mnmkng · 2024-04-24T12:30:23Z

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

+```
+const crawler = new PuppeteerCrawler({
+ // ...
+ errorHandler: async ({ page, log }, error) => {
+ // ... 
+ },
+ requestHandler: async ({ session, page}) => {
+ // ...
+ },
+});
+```


The example does not really show an actual example of how to do error handling. Just displays the interface.

mnmkng · 2024-04-24T12:32:50Z

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

+
+In Scrapy, you can handle errors using middleware as well as [signals](https://docs.scrapy.org/en/latest/topics/signals.html). There are also [exceptions](https://docs.scrapy.org/en/latest/topics/exceptions.html) like `IgnoreRequest`, which can be raised by Scheduler or any downloader middleware to indicate that the request should be ignored. Similarly, `CloseSpider` can be raised by a spider callback to close the spider.
+
+In Crawlee, you can set up your own `ErrorHandler` like this: 


It's important to say that you don't need to though. Most projects don't use a custom error handler. Plus we also have some custom errors that can be used to handle flow of the program.

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

souravjain540 · 2024-05-05T18:34:43Z

@mnmkng I updated the draft with the new changes :)

mnmkng

Hey Saurav, thanks for the changes. They are big steps in the right direction. Good job! But we'll have to iron out a few wrinkles before we can publish this.

I see two main issues with it:

Always strive to be objective

Some sections are better, some are worse, but I see that you're still trying to promote Crawlee in the blog. Don't do it. This is the Crawlee blog, so we have to be extremely careful to not antagonize Python devs or Scrapy devs. Maybe they're long time Scrapy users and they're checking this to see if they could use Crawlee in some JS project. We have to be fair and objective in our comparisons, focus on facts and refrain from using adjectives that glorify Crawlee or make it sound like it's better than Scrapy in some way. If Crawlee can do something, and Scrapy can't do it. Say exactly that. And not that Crawlee is better because of it, or simpler, or whatever else. The readers can figure that out for themselves, if you present them with all the facts. They're devs.

Compare apples to apples

When you're making comparisons and you choose some feature(s) to talk about, you should compare it with the exact same feature(s) of the other library. If you show how to do FIFO in Scrapy, you should show how to do FIFO in Crawlee. When you show how to set request retries in one, show it for the other as well. Basically, whenever you show an example for one library, show an example how to do the exact same thing with the other one, if possible. If not, say that that feature isn't available. When the examples show different actions, it makes it impossible to compare them and it reduces usefulness of the comparison.

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

mnmkng · 2024-05-09T16:42:17Z

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

+
+Both frameworks can handle a wide range of scraping tasks, and the best choice will depend on specific technical needs like language preference, project requirements, ease of use, etc.
+
+If you are comfortable with Python and want to work only with it, go with Scrapy. It has very detailed documentation, and it is one of the oldest and most stable libraries in the space, but if you want to explore or are comfortable working with TypeScript or JavaScript, our recommendation is Crawlee. With all the valuable features like a single interface for HTTP requests and headless browsing, making it work well with JavaScript-heavy websites and autoscaling and fingerprint support, it is the best choice for scraping anything and everything from the internet.


it is the best choice for scraping anything and everything from the internet

That's too bold. And in general, let's make objective recommendations in this section, based on the analysis we did and the features.

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

B4nan

left a few comments about the code example

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

mnmkng

Just a few minor nitpicks. We're getting there 👏

I noticed some code style issues. Are you lint:fixing the examples?

After you make the changes, please send it to Dave or Theo for editing, and we can release it after that.

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

davidjohnbarton

Various changes.

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

Co-authored-by: davidjohnbarton <[email protected]>

davidjohnbarton

Some more changes

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

Co-authored-by: davidjohnbarton <[email protected]>

davidjohnbarton

I think that should be it.

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

Co-authored-by: davidjohnbarton <[email protected]>

souravjain540 · 2024-05-15T13:23:27Z

@B4nan, all good to go! :)

B4nan

left a few code style notes, lets resolve them before we merge

website/blog/2024/04-23-scrapy-vs-crawlee/index.md

souravjain540 · 2024-05-15T17:01:01Z

@B4nan done!

B4nan

@B4nan

* chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * ci: test on node 22 (apify#2438) * chore: use node 20 in templates * chore(deps): update yarn to v4.2.1 * chore(deps): lock file maintenance * fix: return true when robots.isAllowed returns undefined (apify#2439) `undefined` means that there is no explicit rule for the requested route. No rules means no disallow, therefore it's allowed. Fixes apify#2437 --------- Co-authored-by: Jan Buchar <[email protected]> * chore(deps): update patch/minor dependencies to v3.3.0 * chore(deps): update patch/minor dependencies to v3.3.2 * chore(deps): lock file maintenance * chore(deps): lock file maintenance * docs: Should be "Same Domain" not "Same Subdomain" (apify#2445) The docs appear to be a bit misleading. If people want "Same Subdomain" they should actually use "Same Hostname". ![image](https://github.com/apify/crawlee/assets/10026538/2b5452c5-e313-404b-812d-811e0764bd2d) * chore(docker): update docker state [skip ci] * docs: fix two typos (array or requests -> array of requests, no much -> not much) (apify#2451) * fix: sitemap `content-type` check breaks on `content-type` parameters (apify#2442) According to the [RFC1341](https://www.w3.org/Protocols/rfc1341/4_Content-Type.html), the Content-type header can contain additional string parameters. * chore(docker): update docker state [skip ci] * chore(deps): lock file maintenance * fix(core): fire local `SystemInfo` events every second (apify#2454) During local development, we are firing events for the AutoscaledPool about current system resources like memory or CPU. We were firing them once a minute by default, but we remove those snapshots older than 30s, so we never had anything to compare and always used only the very last piece of information. This PR changes the interval to 1s, aligning this with how the Apify platform fires events. * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): update dependency linkedom to ^0.18.0 (apify#2457) * chore(docker): update docker state [skip ci] * perf: optimize adding large amount of requests via `crawler.addRequests()` (apify#2456) This PR resolves three main issues with adding large amount of requests into the queue: - Every requests added to the queue was automatically added to the LRU requests cache, which has a size of 1 million items. this makes sense for enqueuing a few items, but if we try to add more than the limit, we end up with overloading the LRU cache for no reason. Now we only add the first 1000 requests to the cache (plus any requests added via separate calls, e.g. when doing `enqueueLinks` from inside a request handler, again with a limit of the first 1000 links). - We used to validate the whole requests array via `ow`, and since the shape can vary, it was very slow (e.g. 20s just for the `ow` validation). Now we use a tailored validation for the array that does the same but resolves within 100ms or so. - We always created the `Request` objects out of everything, which had a significant impact on memory usage. Now we skip this completely and let the objects be created later when needed (when calling `RQ.addRequests()` which only receives the actual batch and not the whole array) Related: https://apify.slack.com/archives/C0L33UM7Z/p1715109984834079 * perf: improve scaling based on memory (apify#2459) We only allowed to use 70% of the available memory, this PR changes the limit to 90%. Tested with a low memory options and it did not have any effect, while it allows to use more memory on the large memory setups - where the 30% could mean 2gb or so, we dont need such a huge buffer. Also increases the scaling steps to 10% instead of 5% so speed up the scaling. Related: [apify.slack.com/archives/C0L33UM7Z/p1715109984834079](https://apify.slack.com/archives/C0L33UM7Z/p1715109984834079) * feat: make `RequestQueue` v2 the default queue, see more on [Apify blog](https://blog.apify.com/new-apify-request-queue/) (apify#2390) Closes apify#2388 --------- Co-authored-by: drobnikj <[email protected]> Co-authored-by: Martin Adámek <[email protected]> * fix: do not drop statistics on migration/resurrection/resume (apify#2462) This fixes a bug that was introduced with apify#1844 and apify#2083 - we reset the persisted state for statistics and session pool each time a crawler is started, which prevents their restoration. --------- Co-authored-by: Martin Adámek <[email protected]> * chore(deps): update patch/minor dependencies (apify#2450) * chore(docker): update docker state [skip ci] * fix: double tier decrement in tiered proxy (apify#2468) * docs: scrapy-vs-crawlee blog (apify#2431) Co-authored-by: Saurav Jain <[email protected]> Co-authored-by: davidjohnbarton <[email protected]> * perf: optimize `RequestList` memory footprint (apify#2466) The request list now delays the conversion of the source items into the `Request` objects, resulting in a significantly less memory footprint. Related: https://apify.slack.com/archives/C0L33UM7Z/p1715109984834079 * fix: `EnqueueStrategy.All` erroring with links using unsupported protocols (apify#2389) This changes `EnqueueStrategy.All` to filter out non-http and non-https URLs (`mailto:` links were causing the crawler to error). Let me know if there's a better fix or if you want me to change something. Thanks! ``` Request failed and reached maximum retries. Error: Received one or more errors at _ArrayValidator.handle (/path/to/project/node_modules/@sapphire/shapeshift/src/validators/ArrayValidator.ts:102:17) at _ArrayValidator.parse (/path/to/project/node_modules/@sapphire/shapeshift/src/validators/BaseValidator.ts:103:2) at RequestQueueClient.batchAddRequests (/path/to/project/node_modules/@crawlee/src/resource-clients/request-queue.ts:340:36) at RequestQueue.addRequests (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:238:46) at RequestQueue.addRequests (/path/to/project/node_modules/@crawlee/src/storages/request_queue.ts:304:22) at attemptToAddToQueueAndAddAnyUnprocessed (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:302:42) at RequestQueue.addRequestsBatched (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:319:37) at RequestQueue.addRequestsBatched (/path/to/project/node_modules/@crawlee/src/storages/request_queue.ts:309:22) at enqueueLinks (/path/to/project/node_modules/@crawlee/src/enqueue_links/enqueue_links.ts:384:2) at browserCrawlerEnqueueLinks (/path/to/project/node_modules/@crawlee/src/internals/browser-crawler.ts:777:21) ``` * fix(core): use createSessionFunction when loading Session from persisted state (apify#2444) Changes SessionPool's new Session loading behavior in the core module to utilize the configured createSessionFunction if specified. This ensures that new Sessions are instantiated using the custom session creation logic provided by the user, improving flexibility and adherence to user configurations. * fix(core): conversion between tough cookies and browser pool cookies (apify#2443) Fixes the conversion from tough cookies to browser pool cookies and vice versa, by correctly handling cookies where the domain has a leading dot versus when it doesn't. * test: fix e2e tests for zero concurrency * chore(deps): update dependency puppeteer to v22.8.2 * chore(docker): update docker state [skip ci] * docs: fixes (apify#2469) @B4nan minor fixes * chore(deps): update dependency puppeteer to v22.9.0 * feat: implement ErrorSnapshotter for error context capture (apify#2332) This commit introduces the ErrorSnapshotter class to the crawlee package, providing functionality to capture screenshots and HTML snapshots when an error occurs during web crawling. This functionality is opt-in, and can be enabled via the crawler options: ```ts const crawler = new BasicCrawler({ // ... statisticsOptions: { saveErrorSnapshots: true, }, }); ``` Closes apify#2280 --------- Co-authored-by: Martin Adámek <[email protected]> * test: fix e2e tests for error snapshotter * feat: add `FileDownload` "crawler" (apify#2435) Adds a new package `@crawlee/file-download`, which overrides the `HttpCrawler`'s MIME type limitations and allows the users to download arbitrary files. Aside from the regular `requestHandler`, this crawler introduces `streamHandler`, which passes a `ReadableStream` with the downloaded data to the user handler. --------- Co-authored-by: Martin Adámek <[email protected]> Co-authored-by: Jan Buchar <[email protected]> * chore(release): v3.10.0 * chore(release): update internal dependencies [skip ci] * chore(docker): update docker state [skip ci] * docs: add v3.10 snapshot * docs: fix broken link for a moved content * chore(deps): lock file maintenance * docs: improve crawlee seo ranking (apify#2472) * chore(deps): lock file maintenance * refactor: Remove redundant fields from `StatisticsPersistedState` (apify#2475) Those fields are duplicated in the base class anyway. * chore(deps): lock file maintenance * fix: provide URLs to the error snapshot (apify#2482) This will respect the Actor SDK override automatically since importing the SDK will fire this side effect: https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/key_value_store.ts#L25 * docs: update keywords (apify#2481) Co-authored-by: Saurav Jain <[email protected]> * docs: add feedback from community. (apify#2478) Co-authored-by: Saurav Jain <[email protected]> Co-authored-by: Martin Adámek <[email protected]> Co-authored-by: davidjohnbarton <[email protected]> * chore: use biome for code formatting (apify#2301) This takes ~50ms on my machine 🤯 - closes apify#2366 - Replacing spaces with tabs won't be done right here, right now. - eslint and biome are reconciled - ~biome check fails because of typescript errors - we can either fix those or find a way to ignore it~ * chore(docker): update docker state [skip ci] * test: Check if the proxy tier drops after an amount of successful requests (apify#2490) * chore: ignore docker state when checking formatting (apify#2491) * chore: remove unused eslint ignore directives * chore: fix formatting * chore: run biome as a pre-commit hook (apify#2493) * fix: adjust `URL_NO_COMMAS_REGEX` regexp to allow single character hostnames (apify#2492) Closes apify#2487 * fix: investigate and temp fix for possible 0-concurrency bug in RQv2 (apify#2494) * test: add e2e test for zero concurrency with RQ v2 * chore: update biome * chore(docker): update docker state [skip ci] * chore(deps): lock file maintenance (apify#2495) * chore(release): v3.10.1 * chore(release): update internal dependencies [skip ci] * chore(docker): update docker state [skip ci] * chore: add undeclared dependency * chore(deps): update patch/minor dependencies to v1.44.1 * chore(deps): lock file maintenance * chore(docker): update docker state [skip ci] * feat: Loading sitemaps from string (apify#2496) - closes apify#2460 * docs: fix homepage gradients (apify#2500) * fix: Autodetect sitemap filetype from content (apify#2497) - closes apify#2461 * chore(deps): update dependency puppeteer to v22.10.0 * chore(deps): lock file maintenance --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Martin Adámek <[email protected]> Co-authored-by: Gigino Chianese <[email protected]> Co-authored-by: Jan Buchar <[email protected]> Co-authored-by: Connor Adams <[email protected]> Co-authored-by: Apify Release Bot <[email protected]> Co-authored-by: Jiří Spilka <[email protected]> Co-authored-by: Jindřich Bär <[email protected]> Co-authored-by: Vlad Frangu <[email protected]> Co-authored-by: drobnikj <[email protected]> Co-authored-by: Jan Buchar <[email protected]> Co-authored-by: Saurav Jain <[email protected]> Co-authored-by: Saurav Jain <[email protected]> Co-authored-by: davidjohnbarton <[email protected]> Co-authored-by: Stefan Sundin <[email protected]> Co-authored-by: Gustavo Silva <[email protected]> Co-authored-by: Hamza Alwan <[email protected]>

Saurav Jain added 2 commits April 23, 2024 16:18

adding first draft of the blog

cf783be

remove duplicate file

82ce15c

mnmkng requested changes Apr 24, 2024

View reviewed changes

new-draft

7c530cb

souravjain540 requested a review from mnmkng May 7, 2024 05:40

mnmkng requested changes May 9, 2024

View reviewed changes

B4nan reviewed May 9, 2024

View reviewed changes

new changes

5428599

souravjain540 changed the title ~~docs: adding first draft of the blog~~ docs: scrapy-vs-crawlee blog May 13, 2024

souravjain540 requested a review from mnmkng May 13, 2024 05:17

mnmkng requested changes May 13, 2024

View reviewed changes

B4nan reviewed May 13, 2024

View reviewed changes

website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved

B4nan reviewed May 13, 2024

View reviewed changes

website/blog/2024/04-23-scrapy-vs-crawlee/index.md Show resolved Hide resolved

B4nan reviewed May 13, 2024

View reviewed changes

website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved

Saurav Jain added 4 commits May 13, 2024 19:27

pre-final-draft

4e868a7

minor-fix

e875825

minor-fix

df396af

minor-fix

40c0cff

davidjohnbarton suggested changes May 15, 2024

View reviewed changes

souravjain540 and others added 3 commits May 15, 2024 17:51

Apply suggestions from code review

06225c6

Co-authored-by: davidjohnbarton <[email protected]>

Apply suggestions from code review

d7084ba

Co-authored-by: davidjohnbarton <[email protected]>

Update index.md

d2e2e85

davidjohnbarton suggested changes May 15, 2024

View reviewed changes

website/blog/2024/04-23-scrapy-vs-crawlee/index.md Outdated Show resolved Hide resolved

Update website/blog/2024/04-23-scrapy-vs-crawlee/index.md

9e21941

Co-authored-by: davidjohnbarton <[email protected]>

davidjohnbarton suggested changes May 15, 2024

View reviewed changes

Apply suggestions from code review

0693ded

Co-authored-by: davidjohnbarton <[email protected]>

B4nan requested changes May 15, 2024

View reviewed changes

style changes

87bf335

minor change

6f0dd59

B4nan approved these changes May 15, 2024

View reviewed changes

B4nan merged commit 38c0942 into apify:master May 15, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: scrapy-vs-crawlee blog #2431

docs: scrapy-vs-crawlee blog #2431

souravjain540 commented Apr 23, 2024 •

edited

mnmkng left a comment

mnmkng Apr 24, 2024

mnmkng Apr 24, 2024

mnmkng Apr 24, 2024

mnmkng Apr 24, 2024

mnmkng Apr 24, 2024

mnmkng Apr 24, 2024

mnmkng Apr 24, 2024

mnmkng Apr 24, 2024

souravjain540 commented May 5, 2024

mnmkng left a comment

mnmkng May 9, 2024

B4nan left a comment

mnmkng left a comment

davidjohnbarton left a comment

davidjohnbarton left a comment

davidjohnbarton left a comment

souravjain540 commented May 15, 2024

B4nan left a comment

souravjain540 commented May 15, 2024

B4nan left a comment


		Sometimes, extracting data from complex websites becomes hard, and we have to use various tools and libraries to overcome problems like queue management, error handling, etc.

		Two such tools that make the lives of thousands of web scraping developers easy are [Scrapy](https://blog.apify.com/web-scraping-with-scrapy/) and [Crawlee](https://crawlee.dev/). Scrapy can extract data from static websites and can work with large-scale projects, on the other hand, Crawlee has a single interface for HTTP and headless browser crawling.


		## Introduction:

		Scrapy is an open-source Python-based web scraping framework that extracts data from websites. It supports efficient scraping from large-scale websites. In Scrapy, spiders are created, which are nothing but autonomous scripts to download and process web content. Limitations include not working well with JavaScript heavy websites.


		## Language and development environments:

		Regarding languages and development environments, Scrapy is written in Python, making it easier for the data science community to integrate it with various tools with Python. While Scrapy offers very detailed documentation, for first-timers, sometimes it's a little difficult to start with Scrapy.


		On the other hand, Crawlee is one of the few web scraping and automation libraries that supports [JavaScript](https://blog.apify.com/tag/javascript/) and [TypeScript](https://blog.apify.com/tag/typescript/). Crawlee also offers Crawlee CLI, which makes it [easy to start](https://crawlee.dev/docs/quick-start#installation-with-crawlee-cli) with Crawlee for the Node.js developers.

		## Feature Comparison

		In Scrapy, handling anti-blocking strategies like IP rotation, user-agent rotation, custom solutions via middleware, and plugins are needed.
		Crawlee provides HTTP crawling and [browser fingerprints](https://crawlee.dev/docs/guides/avoid-blocking) with zero configuration necessary, fingerprints are enabled by default and available in `PlaywrightCrawler` and `PuppeteerCrawler`.


		In Scrapy, you can handle errors using middleware as well as [signals](https://docs.scrapy.org/en/latest/topics/signals.html). There are also [exceptions](https://docs.scrapy.org/en/latest/topics/exceptions.html) like `IgnoreRequest`, which can be raised by Scheduler or any downloader middleware to indicate that the request should be ignored. Similarly, `CloseSpider` can be raised by a spider callback to close the spider.

		In Crawlee, you can set up your own `ErrorHandler` like this:


		Both frameworks can handle a wide range of scraping tasks, and the best choice will depend on specific technical needs like language preference, project requirements, ease of use, etc.

		If you are comfortable with Python and want to work only with it, go with Scrapy. It has very detailed documentation, and it is one of the oldest and most stable libraries in the space, but if you want to explore or are comfortable working with TypeScript or JavaScript, our recommendation is Crawlee. With all the valuable features like a single interface for HTTP requests and headless browsing, making it work well with JavaScript-heavy websites and autoscaling and fingerprint support, it is the best choice for scraping anything and everything from the internet.

docs: scrapy-vs-crawlee blog #2431

docs: scrapy-vs-crawlee blog #2431

Conversation

souravjain540 commented Apr 23, 2024 • edited

mnmkng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

souravjain540 commented May 5, 2024

mnmkng left a comment

Choose a reason for hiding this comment

Always strive to be objective

Compare apples to apples

Choose a reason for hiding this comment

B4nan left a comment

Choose a reason for hiding this comment

mnmkng left a comment

Choose a reason for hiding this comment

davidjohnbarton left a comment

Choose a reason for hiding this comment

davidjohnbarton left a comment

Choose a reason for hiding this comment

davidjohnbarton left a comment

Choose a reason for hiding this comment

souravjain540 commented May 15, 2024

B4nan left a comment

Choose a reason for hiding this comment

souravjain540 commented May 15, 2024

B4nan left a comment

Choose a reason for hiding this comment

souravjain540 commented Apr 23, 2024 •

edited