Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: blog How to Scrape Crunchbase Using Python #2759

Merged
merged 15 commits into from
Jan 14, 2025

Conversation

Mantisus
Copy link
Contributor

@Mantisus Mantisus commented Dec 2, 2024

new draft @souravjain540

website/blog/2024/12-02-scrape-crunchbase/index.md Outdated Show resolved Hide resolved
website/blog/2024/12-02-scrape-crunchbase/index.md Outdated Show resolved Hide resolved
website/blog/2024/12-02-scrape-crunchbase/index.md Outdated Show resolved Hide resolved
website/blog/2024/12-02-scrape-crunchbase/index.md Outdated Show resolved Hide resolved
authors: [MaxB]
---

If you're working on a project that requires data about various companies and you know Python, you're in the right place to learn how to effectively scrape [Crunchbase](https://www.crunchbase.com/). It's hard to imagine a better source for gathering essential company data such as location, main business areas, founders, investment round participation, and much more. However, like any site dealing with massive amounts of information, we need an effective tool to automate data extraction and transform it into a format suitable for further analysis.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also add what will be the final outcome of the tutorial?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also its will be good to mention all the steps initially

website/blog/2024/12-02-scrape-crunchbase/index.md Outdated Show resolved Hide resolved
Comment on lines 267 to 269
1. [Create a Crunchbase account](https://www.crunchbase.com/register)
2. Go to the [Integrations](https://www.crunchbase.com/integrations/crunchbase-api) section
3. Create a Crunchbase Basic API key
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one link enough

Comment on lines 482 to 486
Each method has its advantages, but for most projects, I recommend using the official API despite its limitations in the free version.

The complete source code for all described solutions is available in my [repository](https://github.com/Mantisus/crunchbase-crawlee).

I'd be happy to discuss in the comments which approach seems optimal for your needs and why.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i loved this blog, i just think it is kind of missing initially what should user expect, as it was not mentioned enough we are going to explore all three ways, and also whats exactly are we doing.

Also if we can breakdown the steps initially it will make the blog more guided.

good work

@Mantisus Mantisus requested a review from souravjain540 January 3, 2025 05:07
@Mantisus Mantisus marked this pull request as ready for review January 10, 2025 10:15
@souravjain540 souravjain540 requested a review from vdusek January 10, 2025 10:16
@souravjain540
Copy link
Collaborator

@vdusek can you please have a final look before it goes live other wise all checked and approved!

Copy link
Contributor

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, just a few things.

  • There is no mention of __main__.py content.
  • There is no mention of execution the project.

Plus the other notes in the code.

website/blog/2025/01-03-scrape-crunchbase/index.md Outdated Show resolved Hide resolved
Comment on lines 41 to 73
### Project setup

Before we start scraping, we need to set up our project. In this guide, we won't be using crawler templates (`Playwright` and `Beautifulsoup`), so we'll set up the project manually.

1. Install [`Poetry`](https://python-poetry.org/)

```bash
pipx install poetry
```

2. Create and navigate to the project folder.

```bash
mkdir crunchbase-crawlee && cd crunchbase-crawlee
```

3. Initialize the project using Poetry, leaving all fields empty.

```bash
poetry init
```

4. Add and install Crawlee with necessary dependencies to your project using `Poetry.`

```bash
poetry add crawlee[parsel,curl-impersonate]
```

5. Complete the project setup by creating the standard file structure for `Crawlee for Python` projects.

```bash
mkdir crunchbase-crawlee && touch crunchbase-crawlee/{__init__.py,__main__.py,main.py,routes.py}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried it with Poetry 2.0? For example it requires README.md file to exist.

Copy link
Contributor Author

@Mantisus Mantisus Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made updates to poetry 2.0. But Readme.md is not required to run the project.

Comment on lines 126 to 150
from crawlee.crawlers import ParselCrawler
from crawlee.http_clients import CurlImpersonateHttpClient
from crawlee import ConcurrencySettings, HttpHeaders

async def main() -> None:
"""The crawler entry point."""

concurrency_settings = ConcurrencySettings(max_concurrency=1, max_tasks_per_minute=50)

http_client = CurlImpersonateHttpClient(impersonate="safari17_0",
headers=HttpHeaders({
"accept-language": "en",
"accept-encoding": "gzip, deflate, br, zstd",
}))
crawler = ParselCrawler(
request_handler=router,
max_request_retries=1,
concurrency_settings=concurrency_settings,
http_client=http_client,
max_requests_per_crawl=30,
)

await crawler.run(['https://www.crunchbase.com/www-sitemaps/sitemap-index.xml'])

await crawler.export_data_json("crunchbase_data.json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use a consistent formatting? You can use configs from Crawlee for Python. This applies to all files.

"accept-encoding": "gzip, deflate, br, zstd",
}))
crawler = ParselCrawler(
request_handler=router,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

router is undefined symbol

@Mantisus Mantisus requested a review from vdusek January 13, 2025 14:05
Copy link
Contributor

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@souravjain540 souravjain540 merged commit 14a75c7 into apify:master Jan 14, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants