Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor!: Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance #746

Merged
merged 52 commits into from
Dec 6, 2024

Conversation

Pijukatel
Copy link
Contributor

@Pijukatel Pijukatel commented Nov 26, 2024

Reworked http based crawlers inheritance.
StaticContentCrawler is parent of BeautifulSoupCrawler, ParselCrawler and HttpCrawler.

StaticContentCrawler is generic. Specific versions depend on the type of parser used for parsing http response.

Breaking change:
Renamed BeautifulSoupParser to BeautifulSoupParserType (it is just string literal to properly set BeautiflSoup)
BeautifulSoupParser is used for new class that is the parser used by BeautifulSoupCrawler

@Pijukatel Pijukatel added t-tooling Issues with this label are in the ownership of the tooling team. debt Code quality improvement or decrease of technical debt. labels Nov 26, 2024
@github-actions github-actions bot added this to the 103rd sprint - Tooling team milestone Nov 26, 2024
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Nov 26, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@Pijukatel Pijukatel requested a review from janbuchar November 26, 2024 11:43
@Pijukatel Pijukatel marked this pull request as ready for review November 26, 2024 11:43
@Pijukatel Pijukatel requested a review from vdusek November 26, 2024 11:45
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Good job for going through with this. Since this is a pretty critical overhaul though, I was extra pedantic... So please understand this is not bullying 😄

src/crawlee/basic_crawler/_basic_crawler.py Outdated Show resolved Hide resolved
src/crawlee/playwright_crawler/_playwright_crawler.py Outdated Show resolved Hide resolved
src/crawlee/parsel_crawler/_parsel_crawling_context.py Outdated Show resolved Hide resolved
src/crawlee/http_crawler/_http_crawler.py Outdated Show resolved Hide resolved
src/crawlee/http_crawler/_http_crawler.py Outdated Show resolved Hide resolved
src/crawlee/http_crawler/_http_crawler.py Outdated Show resolved Hide resolved
src/crawlee/http_crawler/_http_crawling_context.py Outdated Show resolved Hide resolved
src/crawlee/http_crawler/_http_parser.py Outdated Show resolved Hide resolved
@Pijukatel Pijukatel requested review from vdusek and janbuchar December 3, 2024 09:03
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but

  1. let's agree on the naming of StaticContentParser
  2. wait for @vdusek to also approve this

docs/guides/static_content_crawlers.mdx Outdated Show resolved Hide resolved
docs/guides/static_content_crawlers.mdx Outdated Show resolved Hide resolved
docs/guides/static_content_crawlers.mdx Outdated Show resolved Hide resolved
docs/guides/static_content_crawlers.mdx Outdated Show resolved Hide resolved
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is breaking because of the BeautifulSoupParser. Please add the exclamation mark to the PR title.

@Pijukatel Pijukatel changed the title refactor: Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance refactor!: Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance Dec 3, 2024
@Pijukatel
Copy link
Contributor Author

This is breaking because of the BeautifulSoupParser. Please add the exclamation mark to the PR title.

Ahaaa, so thats the reason for exclamation marks in other PRs :-)

@Pijukatel Pijukatel requested a review from vdusek December 3, 2024 12:46
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are getting there 🙂. Mostly just docs related changes.

docs/guides/http_crawlers.mdx Outdated Show resolved Hide resolved
Comment on lines 1 to 24
---
id: http-crawlers
title: HTTP crawlers
description: Crawlee supports multiple http crawlers that can be used to extract data from server-rendered webpages.
---

import ApiLink from '@site/src/components/ApiLink';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import CodeBlock from '@theme/CodeBlock';

Generic class <ApiLink to="class/AbstractHttpCrawler">`AbstractHttpCrawler`</ApiLink> is parent to <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>, <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> and <ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> and it could be used as parent for your crawler with custom content parsing requirements.

It already includes almost all the functionality to crawl webpages and the only missing part is the parser that should be used to parse HTTP responses, and a context dataclass that defines what context helpers will be available to user handler functions.

## `BeautifulSoupCrawler`
<ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> uses <ApiLink to="class/BeautifulSoupParser">`BeautifulSoupParser`</ApiLink> to parse the HTTP response and makes it available in <ApiLink to="class/BeautifulSoupCrawlingContext">`BeautifulSoupCrawlingContext`</ApiLink> in the `.soup` or `.parsed_content` attribute.

## `ParselCrawler`
<ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> uses <ApiLink to="class/ParselParser">`ParselParser`</ApiLink> to parse the HTTP response and makes it available in <ApiLink to="class/ParselCrawlingContext">`ParselCrawlingContext`</ApiLink> in the `.selector` or `.parsed_content` attribute.

## `HttpCrawler`
<ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> uses <ApiLink to="class/NoParser">`NoParser`</ApiLink> that does not parse the HTTP response at all and is to be used if no parsing is required.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, this is great and definitely better than nothing. However, it is quite short and might not look good when rendered on the page. For comparison, take a look at guides like HTTP Clients or Result Storages. It should aim for similar depth and verbosity, including usage examples.

This should not be a blocker for the merging, as we have been improving the docs all the time. If you decide not-to-update it now, please open a new issue for it. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was scratching my head trying to come up with something for those docs. The problem is, that the only example I can think of, is implementing your own HTTP based Crawler (other examples in other files already show how to crawlee). But such example exists already in our code base and it is BSCrawler and ParselCrawler, so I can just point to those two.
If you think something specific is missing, please let me know and I can do add that.

src/crawlee/abstract_http_crawler/_abstract_http_parser.py Outdated Show resolved Hide resolved
src/crawlee/parsel_crawler/_parsel_crawler.py Outdated Show resolved Hide resolved
src/crawlee/parsel_crawler/_parsel_crawling_context.py Outdated Show resolved Hide resolved
src/crawlee/parsel_crawler/_parsel_parser.py Outdated Show resolved Hide resolved
docs/guides/http_crawlers.mdx Outdated Show resolved Hide resolved
docs/guides/http_crawlers.mdx Outdated Show resolved Hide resolved
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks. Just please merge it once we are sure v0.4.5 is alright.

@Pijukatel Pijukatel merged commit 9d3c269 into master Dec 6, 2024
23 checks passed
@Pijukatel Pijukatel deleted the new-class-hier-current-middleware branch December 6, 2024 12:10
Pijukatel added a commit that referenced this pull request Dec 10, 2024
Mantisus pushed a commit to Mantisus/crawlee-python that referenced this pull request Dec 10, 2024
…inheritance (apify#746)

Reworked http based crawlers inheritance.
StaticContentCrawler is parent of BeautifulSoupCrawler, ParselCrawler
and HttpCrawler.

StaticContentCrawler is generic. Specific versions depend on the type of
parser used for parsing http response.

**Breaking change:**
Renamed BeautifulSoupParser to BeautifulSoupParserType (it is just
string literal to properly set BeautiflSoup)
BeautifulSoupParser is used for new class that is the parser used by
BeautifulSoupCrawler

- Closes: [ Reconsider crawler inheritance apify#350
](apify#350)

---------

Co-authored-by: Jan Buchar <[email protected]>
Co-authored-by: Vlada Dusek <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
debt Code quality improvement or decrease of technical debt. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reconsider crawler inheritance
3 participants