refactor!: Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance #746

Pijukatel · 2024-11-26T10:43:42Z

Reworked http based crawlers inheritance.
StaticContentCrawler is parent of BeautifulSoupCrawler, ParselCrawler and HttpCrawler.

StaticContentCrawler is generic. Specific versions depend on the type of parser used for parsing http response.

Breaking change:
Renamed BeautifulSoupParser to BeautifulSoupParserType (it is just string literal to properly set BeautiflSoup)
BeautifulSoupParser is used for new class that is the parser used by BeautifulSoupCrawler

Closes: Reconsider crawler inheritance #350

UTs working. Generics stretched to limits, probably not worth it to keep BScrawlingcontext

Solved middleware issues.

HttpCrawler made generic. BeautifulSoup and Parsel crwalers inherit from this new generic.

…ent-middleware

github-actions

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

github-actions

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

janbuchar

Nice! Good job for going through with this. Since this is a pretty critical overhaul though, I was extra pedantic... So please understand this is not bullying 😄

src/crawlee/basic_crawler/_basic_crawler.py

tests/unit/beautifulsoup_crawler/test_beautifulsoup_crawler.py

src/crawlee/playwright_crawler/_playwright_crawler.py

src/crawlee/parsel_crawler/_parsel_crawling_context.py

src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawler.py

src/crawlee/http_crawler/_http_crawler.py

src/crawlee/http_crawler/_http_crawling_context.py

src/crawlee/http_crawler/_http_parser.py

src/crawlee/static_content_crawler/_static_content_crawler.py

janbuchar

LGTM, but

let's agree on the naming of StaticContentParser
wait for @vdusek to also approve this

docs/guides/static_content_crawlers.mdx

vdusek

This is breaking because of the BeautifulSoupParser. Please add the exclamation mark to the PR title.

Co-authored-by: Jan Buchar <[email protected]>

…e name changes.

Pijukatel · 2024-12-03T12:41:57Z

This is breaking because of the BeautifulSoupParser. Please add the exclamation mark to the PR title.

Ahaaa, so thats the reason for exclamation marks in other PRs :-)

vdusek

We are getting there 🙂. Mostly just docs related changes.

docs/guides/http_crawlers.mdx

vdusek · 2024-12-03T13:22:49Z

docs/guides/http_crawlers.mdx

+---
+id: http-crawlers
+title: HTTP crawlers
+description: Crawlee supports multiple http crawlers that can be used to extract data from server-rendered webpages.
+---
+
+import ApiLink from '@site/src/components/ApiLink';
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+import CodeBlock from '@theme/CodeBlock';
+
+Generic class <ApiLink to="class/AbstractHttpCrawler">`AbstractHttpCrawler`</ApiLink> is parent to <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>, <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> and <ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> and it could be used as parent for your crawler with custom content parsing requirements.
+
+It already includes almost all the functionality to crawl webpages and the only missing part is the parser that should be used to parse HTTP responses, and a context dataclass that defines what context helpers will be available to user handler functions.
+
+## `BeautifulSoupCrawler`
+<ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> uses <ApiLink to="class/BeautifulSoupParser">`BeautifulSoupParser`</ApiLink> to parse the HTTP response and makes it available in <ApiLink to="class/BeautifulSoupCrawlingContext">`BeautifulSoupCrawlingContext`</ApiLink> in the `.soup` or `.parsed_content` attribute.
+
+## `ParselCrawler`
+<ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> uses <ApiLink to="class/ParselParser">`ParselParser`</ApiLink> to parse the HTTP response and makes it available in <ApiLink to="class/ParselCrawlingContext">`ParselCrawlingContext`</ApiLink> in the `.selector` or `.parsed_content` attribute.
+
+## `HttpCrawler`
+<ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> uses <ApiLink to="class/NoParser">`NoParser`</ApiLink> that does not parse the HTTP response at all and is to be used if no parsing is required.
+


I mean, this is great and definitely better than nothing. However, it is quite short and might not look good when rendered on the page. For comparison, take a look at guides like HTTP Clients or Result Storages. It should aim for similar depth and verbosity, including usage examples.

This should not be a blocker for the merging, as we have been improving the docs all the time. If you decide not-to-update it now, please open a new issue for it. Thanks.

I was scratching my head trying to come up with something for those docs. The problem is, that the only example I can think of, is implementing your own HTTP based Crawler (other examples in other files already show how to crawlee). But such example exists already in our code base and it is BSCrawler and ParselCrawler, so I can just point to those two.
If you think something specific is missing, please let me know and I can do add that.

src/crawlee/abstract_http_crawler/_abstract_http_parser.py

src/crawlee/abstract_http_crawler/_abstract_http_crawler.py

src/crawlee/parsel_crawler/_parsel_crawler.py

src/crawlee/parsel_crawler/_parsel_crawling_context.py

src/crawlee/parsel_crawler/_parsel_parser.py

src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawling_context.py

Co-authored-by: Vlada Dusek <[email protected]>

…rawler.

Co-authored-by: Vlada Dusek <[email protected]>

docs/guides/http_crawlers.mdx

src/crawlee/abstract_http_crawler/_abstract_http_crawler.py

src/crawlee/parsel_crawler/_parsel_crawler.py

src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawler.py

Co-authored-by: Vlada Dusek <[email protected]>

src/crawlee/basic_crawler/_basic_crawler.py

vdusek

LGTM, thanks. Just please merge it once we are sure v0.4.5 is alright.

This should have been part of #746

…inheritance (apify#746) Reworked http based crawlers inheritance. StaticContentCrawler is parent of BeautifulSoupCrawler, ParselCrawler and HttpCrawler. StaticContentCrawler is generic. Specific versions depend on the type of parser used for parsing http response. **Breaking change:** Renamed BeautifulSoupParser to BeautifulSoupParserType (it is just string literal to properly set BeautiflSoup) BeautifulSoupParser is used for new class that is the parser used by BeautifulSoupCrawler - Closes: [ Reconsider crawler inheritance apify#350 ](apify#350) --------- Co-authored-by: Jan Buchar <[email protected]> Co-authored-by: Vlada Dusek <[email protected]>

Pijukatel added 12 commits November 21, 2024 09:41

WIP

8c8dd24

Draft proposal for discussion.

48812b1

Remove redundant type

853ee85

BeautifulSoupParser

17e08a1

UTs working. Generics stretched to limits, probably not worth it to keep BScrawlingcontext

Being stuck on mypy and generics

188afdb

Almost there. Figure out the reason for casts in middleware

96356d6

Solved BScrawler. Next ParselCrawler.

def0e72

Solved middleware issues.

Reworked ParselCrawler

54ce154

Ready for review.

4692fe9

HttpCrawler made generic. BeautifulSoup and Parsel crwalers inherit from this new generic.

Merge remote-tracking branch 'origin/master' into new-class-hier-curr…

e2e3cd9

…ent-middleware

Edit forgotten comment .

bb8cd12

Remove mistaken edits in docs

f869be6

Pijukatel added t-tooling Issues with this label are in the ownership of the tooling team. debt Code quality improvement or decrease of technical debt. labels Nov 26, 2024

Merge branch 'master' into new-class-hier-current-middleware

81e46cd

github-actions bot assigned Pijukatel Nov 26, 2024

github-actions bot added this to the 103rd sprint - Tooling team milestone Nov 26, 2024

github-actions bot added the tested Temporary label used only programatically for some analytics. label Nov 26, 2024

Reformat after merge.

f994e32

github-actions bot reviewed Nov 26, 2024

View reviewed changes

Pijukatel added 2 commits November 26, 2024 12:25

Fix CI reported issues on previous Python versions

bbc27af

Update docstrings in child crawlers to not repeat text after parent.

7567164

Pijukatel requested a review from janbuchar November 26, 2024 11:43

Pijukatel marked this pull request as ready for review November 26, 2024 11:43

Pijukatel requested a review from vdusek November 26, 2024 11:45

Revert incorrect docstring update.

9335967

janbuchar requested changes Nov 26, 2024

View reviewed changes

Pijukatel added 2 commits November 26, 2024 16:03

Review comments

b4877cb

Reverted back name change in doc strings.

2929be1

Pijukatel requested review from vdusek and janbuchar December 3, 2024 09:03

vdusek reviewed Dec 3, 2024

View reviewed changes

src/crawlee/static_content_crawler/_static_content_crawler.py Outdated Show resolved Hide resolved

janbuchar approved these changes Dec 3, 2024

View reviewed changes

vdusek requested changes Dec 3, 2024

View reviewed changes

Pijukatel changed the title ~~refactor: Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance~~ refactor!: Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance Dec 3, 2024

Pijukatel and others added 4 commits December 3, 2024 13:22

Apply suggestions from code review

e7c7817

Co-authored-by: Jan Buchar <[email protected]>

Rename StaticCOntentCrawler to AbstractContentCrawler and related fil…

05cec1a

…e name changes.

Renaming to AbstractHttpCrawler 2

bed215e

Renaming to AbstractHttpCrawler 2

c43b564

Pijukatel requested a review from vdusek December 3, 2024 12:46

vdusek requested changes Dec 3, 2024

View reviewed changes

Pijukatel and others added 4 commits December 3, 2024 15:28

Apply suggestions from code review

a1db9e2

Co-authored-by: Vlada Dusek <[email protected]>

Review comments

fae917e

Expand docs by short description of how to create your own HTTPbase c…

b563bf9

…rawler.

Update src/crawlee/abstract_http_crawler/_abstract_http_crawler.py

89a8e83

Co-authored-by: Vlada Dusek <[email protected]>

vdusek reviewed Dec 4, 2024

View reviewed changes

docs/guides/http_crawlers.mdx Outdated Show resolved Hide resolved

docs/guides/http_crawlers.mdx Outdated Show resolved Hide resolved

src/crawlee/abstract_http_crawler/_abstract_http_crawler.py Show resolved Hide resolved

vdusek reviewed Dec 4, 2024

View reviewed changes

src/crawlee/parsel_crawler/_parsel_crawler.py Outdated Show resolved Hide resolved

src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawler.py Outdated Show resolved Hide resolved

Pijukatel and others added 3 commits December 4, 2024 14:00

Update src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawler.py

139b21b

Co-authored-by: Vlada Dusek <[email protected]>

Apply suggestions from code review

bd7846f

Co-authored-by: Vlada Dusek <[email protected]>

Review comments

454f9ec

vdusek reviewed Dec 4, 2024

View reviewed changes

src/crawlee/basic_crawler/_basic_crawler.py Outdated Show resolved Hide resolved

Move BlockedInfo to its own file.

6bba552

vdusek approved these changes Dec 5, 2024

View reviewed changes

Pijukatel merged commit 9d3c269 into master Dec 6, 2024
23 checks passed

Pijukatel deleted the new-class-hier-current-middleware branch December 6, 2024 12:10

Pijukatel mentioned this pull request Dec 10, 2024

docs: Update upgrading guide for renamed BeautifulSoupParser #799

Merged

Pijukatel added a commit that referenced this pull request Dec 10, 2024

docs: Update upgrading guide for renamed BeautifulSoupParser (#799)

13bb400

This should have been part of #746

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor!: Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance #746

refactor!: Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance #746

Pijukatel commented Nov 26, 2024 •

edited

Loading

github-actions bot left a comment

github-actions bot left a comment

janbuchar left a comment

janbuchar left a comment

vdusek left a comment

Pijukatel commented Dec 3, 2024

vdusek left a comment

vdusek Dec 3, 2024

Pijukatel Dec 3, 2024

vdusek left a comment •

edited

Loading

refactor!: Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance #746

refactor!: Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance #746

Conversation

Pijukatel commented Nov 26, 2024 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

janbuchar left a comment

Choose a reason for hiding this comment

janbuchar left a comment

Choose a reason for hiding this comment

vdusek left a comment

Choose a reason for hiding this comment

Pijukatel commented Dec 3, 2024

vdusek left a comment

Choose a reason for hiding this comment

vdusek Dec 3, 2024

Choose a reason for hiding this comment

Pijukatel Dec 3, 2024

Choose a reason for hiding this comment

vdusek left a comment • edited Loading

Choose a reason for hiding this comment

Pijukatel commented Nov 26, 2024 •

edited

Loading

vdusek left a comment •

edited

Loading