-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: blog How to Scrape Crunchbase Using Python #2759
docs: blog How to Scrape Crunchbase Using Python #2759
Conversation
authors: [MaxB] | ||
--- | ||
|
||
If you're working on a project that requires data about various companies and you know Python, you're in the right place to learn how to effectively scrape [Crunchbase](https://www.crunchbase.com/). It's hard to imagine a better source for gathering essential company data such as location, main business areas, founders, investment round participation, and much more. However, like any site dealing with massive amounts of information, we need an effective tool to automate data extraction and transform it into a format suitable for further analysis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also add what will be the final outcome of the tutorial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also its will be good to mention all the steps initially
1. [Create a Crunchbase account](https://www.crunchbase.com/register) | ||
2. Go to the [Integrations](https://www.crunchbase.com/integrations/crunchbase-api) section | ||
3. Create a Crunchbase Basic API key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one link enough
Each method has its advantages, but for most projects, I recommend using the official API despite its limitations in the free version. | ||
|
||
The complete source code for all described solutions is available in my [repository](https://github.com/Mantisus/crunchbase-crawlee). | ||
|
||
I'd be happy to discuss in the comments which approach seems optimal for your needs and why. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i loved this blog, i just think it is kind of missing initially what should user expect, as it was not mentioned enough we are going to explore all three ways, and also whats exactly are we doing.
Also if we can breakdown the steps initially it will make the blog more guided.
good work
Co-authored-by: Saurav Jain <[email protected]>
Co-authored-by: Saurav Jain <[email protected]>
Co-authored-by: Saurav Jain <[email protected]>
Co-authored-by: Saurav Jain <[email protected]>
Co-authored-by: Saurav Jain <[email protected]>
Co-authored-by: Saurav Jain <[email protected]>
@vdusek can you please have a final look before it goes live other wise all checked and approved! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, just a few things.
- There is no mention of
__main__.py
content. - There is no mention of execution the project.
Plus the other notes in the code.
### Project setup | ||
|
||
Before we start scraping, we need to set up our project. In this guide, we won't be using crawler templates (`Playwright` and `Beautifulsoup`), so we'll set up the project manually. | ||
|
||
1. Install [`Poetry`](https://python-poetry.org/) | ||
|
||
```bash | ||
pipx install poetry | ||
``` | ||
|
||
2. Create and navigate to the project folder. | ||
|
||
```bash | ||
mkdir crunchbase-crawlee && cd crunchbase-crawlee | ||
``` | ||
|
||
3. Initialize the project using Poetry, leaving all fields empty. | ||
|
||
```bash | ||
poetry init | ||
``` | ||
|
||
4. Add and install Crawlee with necessary dependencies to your project using `Poetry.` | ||
|
||
```bash | ||
poetry add crawlee[parsel,curl-impersonate] | ||
``` | ||
|
||
5. Complete the project setup by creating the standard file structure for `Crawlee for Python` projects. | ||
|
||
```bash | ||
mkdir crunchbase-crawlee && touch crunchbase-crawlee/{__init__.py,__main__.py,main.py,routes.py} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried it with Poetry 2.0? For example it requires README.md
file to exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made updates to poetry 2.0. But Readme.md
is not required to run the project.
from crawlee.crawlers import ParselCrawler | ||
from crawlee.http_clients import CurlImpersonateHttpClient | ||
from crawlee import ConcurrencySettings, HttpHeaders | ||
|
||
async def main() -> None: | ||
"""The crawler entry point.""" | ||
|
||
concurrency_settings = ConcurrencySettings(max_concurrency=1, max_tasks_per_minute=50) | ||
|
||
http_client = CurlImpersonateHttpClient(impersonate="safari17_0", | ||
headers=HttpHeaders({ | ||
"accept-language": "en", | ||
"accept-encoding": "gzip, deflate, br, zstd", | ||
})) | ||
crawler = ParselCrawler( | ||
request_handler=router, | ||
max_request_retries=1, | ||
concurrency_settings=concurrency_settings, | ||
http_client=http_client, | ||
max_requests_per_crawl=30, | ||
) | ||
|
||
await crawler.run(['https://www.crunchbase.com/www-sitemaps/sitemap-index.xml']) | ||
|
||
await crawler.export_data_json("crunchbase_data.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use a consistent formatting? You can use configs from Crawlee for Python. This applies to all files.
"accept-encoding": "gzip, deflate, br, zstd", | ||
})) | ||
crawler = ParselCrawler( | ||
request_handler=router, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
router
is undefined symbol
Co-authored-by: Vlada Dusek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
new draft @souravjain540