docs: blog How to Scrape Crunchbase Using Python #2759

Mantisus · 2024-12-02T04:27:36Z

new draft @souravjain540

website/blog/2024/12-02-scrape-crunchbase/index.md

souravjain540 · 2025-01-02T10:32:24Z

website/blog/2024/12-02-scrape-crunchbase/index.md

+authors: [MaxB]
+---
+
+If you're working on a project that requires data about various companies and you know Python, you're in the right place to learn how to effectively scrape [Crunchbase](https://www.crunchbase.com/). It's hard to imagine a better source for gathering essential company data such as location, main business areas, founders, investment round participation, and much more. However, like any site dealing with massive amounts of information, we need an effective tool to automate data extraction and transform it into a format suitable for further analysis.


maybe also add what will be the final outcome of the tutorial?

also its will be good to mention all the steps initially

website/blog/2024/12-02-scrape-crunchbase/index.md

souravjain540 · 2025-01-02T12:15:11Z

website/blog/2024/12-02-scrape-crunchbase/index.md

+1. [Create a Crunchbase account](https://www.crunchbase.com/register)
+2. Go to the [Integrations](https://www.crunchbase.com/integrations/crunchbase-api) section
+3. Create a Crunchbase Basic API key


one link enough

souravjain540 · 2025-01-02T12:18:31Z

website/blog/2024/12-02-scrape-crunchbase/index.md

+Each method has its advantages, but for most projects, I recommend using the official API despite its limitations in the free version.
+
+The complete source code for all described solutions is available in my [repository](https://github.com/Mantisus/crunchbase-crawlee).
+
+I'd be happy to discuss in the comments which approach seems optimal for your needs and why.


i loved this blog, i just think it is kind of missing initially what should user expect, as it was not mentioned enough we are going to explore all three ways, and also whats exactly are we doing.

Also if we can breakdown the steps initially it will make the blog more guided.

good work

Co-authored-by: Saurav Jain <[email protected]>

souravjain540 · 2025-01-10T10:20:38Z

@vdusek can you please have a final look before it goes live other wise all checked and approved!

vdusek

Nice, just a few things.

There is no mention of __main__.py content.
There is no mention of execution the project.

Plus the other notes in the code.

website/blog/2025/01-03-scrape-crunchbase/index.md

vdusek · 2025-01-13T09:41:13Z

website/blog/2025/01-03-scrape-crunchbase/index.md

+### Project setup
+
+Before we start scraping, we need to set up our project. In this guide, we won't be using crawler templates (`Playwright` and `Beautifulsoup`), so we'll set up the project manually.
+
+1. Install [`Poetry`](https://python-poetry.org/)
+
+    ```bash
+    pipx install poetry
+    ```
+
+2. Create and navigate to the project folder.
+
+    ```bash
+    mkdir crunchbase-crawlee && cd crunchbase-crawlee
+    ```
+
+3. Initialize the project using Poetry, leaving all fields empty.
+
+    ```bash
+    poetry init
+    ```
+
+4. Add and install Crawlee with necessary dependencies to your project using `Poetry.`
+
+    ```bash
+    poetry add crawlee[parsel,curl-impersonate]
+    ```
+
+5. Complete the project setup by creating the standard file structure for `Crawlee for Python` projects.
+
+    ```bash
+    mkdir crunchbase-crawlee && touch crunchbase-crawlee/{__init__.py,__main__.py,main.py,routes.py}
+    ```


Have you tried it with Poetry 2.0? For example it requires README.md file to exist.

Made updates to poetry 2.0. But Readme.md is not required to run the project.

vdusek · 2025-01-13T09:44:53Z

website/blog/2025/01-03-scrape-crunchbase/index.md

+from crawlee.crawlers import ParselCrawler
+from crawlee.http_clients import CurlImpersonateHttpClient
+from crawlee import ConcurrencySettings, HttpHeaders
+
+async def main() -> None:
+    """The crawler entry point."""
+
+    concurrency_settings = ConcurrencySettings(max_concurrency=1, max_tasks_per_minute=50)
+
+    http_client = CurlImpersonateHttpClient(impersonate="safari17_0",
+                                            headers=HttpHeaders({
+                                                "accept-language": "en",
+                                                "accept-encoding": "gzip, deflate, br, zstd",
+                                                }))
+    crawler = ParselCrawler(
+        request_handler=router,
+        max_request_retries=1,
+        concurrency_settings=concurrency_settings,
+        http_client=http_client,
+        max_requests_per_crawl=30,
+    )
+
+    await crawler.run(['https://www.crunchbase.com/www-sitemaps/sitemap-index.xml'])
+
+    await crawler.export_data_json("crunchbase_data.json")


Could you please use a consistent formatting? You can use configs from Crawlee for Python. This applies to all files.

vdusek · 2025-01-13T09:45:11Z

website/blog/2025/01-03-scrape-crunchbase/index.md

+                                                "accept-encoding": "gzip, deflate, br, zstd",
+                                                }))
+    crawler = ParselCrawler(
+        request_handler=router,


router is undefined symbol

Co-authored-by: Vlada Dusek <[email protected]>

vdusek

LGTM

Mantisus added 2 commits December 2, 2024 04:26

docs: add blog How to Scrape Crunchbase Using Python

4e4d79b

add truncate

833cef6

souravjain540 reviewed Jan 2, 2025

View reviewed changes

website/blog/2024/12-02-scrape-crunchbase/index.md Outdated Show resolved Hide resolved

souravjain540 requested changes Jan 2, 2025

View reviewed changes

Mantisus and others added 7 commits January 2, 2025 17:20

Update website/blog/2024/12-02-scrape-crunchbase/index.md

06513a4

Co-authored-by: Saurav Jain <[email protected]>

Update website/blog/2024/12-02-scrape-crunchbase/index.md

9dc1fde

Co-authored-by: Saurav Jain <[email protected]>

Update website/blog/2024/12-02-scrape-crunchbase/index.md

d8cc258

Co-authored-by: Saurav Jain <[email protected]>

Update website/blog/2024/12-02-scrape-crunchbase/index.md

912d2c5

Co-authored-by: Saurav Jain <[email protected]>

Update website/blog/2024/12-02-scrape-crunchbase/index.md

45b6abc

Co-authored-by: Saurav Jain <[email protected]>

Update website/blog/2024/12-02-scrape-crunchbase/index.md

4f5db95

Co-authored-by: Saurav Jain <[email protected]>

update introduction

83c0a3d

Mantisus requested a review from souravjain540 January 3, 2025 05:07

Mantisus added 2 commits January 3, 2025 05:12

update folder structure

72ad7fd

update crunchbase blog

b9a2c5e

Mantisus marked this pull request as ready for review January 10, 2025 10:15

souravjain540 requested a review from vdusek January 10, 2025 10:16

souravjain540 approved these changes Jan 10, 2025

View reviewed changes

Mantisus added 2 commits January 10, 2025 10:24

add main image

75b8970

fix links

e139a79

vdusek requested changes Jan 13, 2025

View reviewed changes

Mantisus and others added 2 commits January 13, 2025 14:25

Update website/blog/2025/01-03-scrape-crunchbase/index.md

89c9c1c

Co-authored-by: Vlada Dusek <[email protected]>

update code format

ec37190

Mantisus requested a review from vdusek January 13, 2025 14:05

vdusek approved these changes Jan 14, 2025

View reviewed changes

souravjain540 merged commit 14a75c7 into apify:master Jan 14, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: blog How to Scrape Crunchbase Using Python #2759

docs: blog How to Scrape Crunchbase Using Python #2759

Mantisus commented Dec 2, 2024 •

edited

Loading

souravjain540 Jan 2, 2025

souravjain540 Jan 2, 2025

souravjain540 Jan 2, 2025

souravjain540 Jan 2, 2025

souravjain540 commented Jan 10, 2025

vdusek left a comment

vdusek Jan 13, 2025

Mantisus Jan 13, 2025 •

edited

Loading

vdusek Jan 13, 2025

vdusek Jan 13, 2025

vdusek left a comment

docs: blog How to Scrape Crunchbase Using Python #2759

docs: blog How to Scrape Crunchbase Using Python #2759

Conversation

Mantisus commented Dec 2, 2024 • edited Loading

souravjain540 Jan 2, 2025

Choose a reason for hiding this comment

souravjain540 Jan 2, 2025

Choose a reason for hiding this comment

souravjain540 Jan 2, 2025

Choose a reason for hiding this comment

souravjain540 Jan 2, 2025

Choose a reason for hiding this comment

souravjain540 commented Jan 10, 2025

vdusek left a comment

Choose a reason for hiding this comment

vdusek Jan 13, 2025

Choose a reason for hiding this comment

Mantisus Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

vdusek Jan 13, 2025

Choose a reason for hiding this comment

vdusek Jan 13, 2025

Choose a reason for hiding this comment

vdusek left a comment

Choose a reason for hiding this comment

Mantisus commented Dec 2, 2024 •

edited

Loading

Mantisus Jan 13, 2025 •

edited

Loading