-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor!: Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance #746
Conversation
UTs working. Generics stretched to limits, probably not worth it to keep BScrawlingcontext
Solved middleware issues.
HttpCrawler made generic. BeautifulSoup and Parsel crwalers inherit from this new generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request is neither linked to an issue or epic nor labeled as adhoc!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request is neither linked to an issue or epic nor labeled as adhoc!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Good job for going through with this. Since this is a pretty critical overhaul though, I was extra pedantic... So please understand this is not bullying 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but
- let's agree on the naming of
StaticContentParser
- wait for @vdusek to also approve this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is breaking because of the BeautifulSoupParser
. Please add the exclamation mark to the PR title.
Co-authored-by: Jan Buchar <[email protected]>
Ahaaa, so thats the reason for exclamation marks in other PRs :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are getting there 🙂. Mostly just docs related changes.
docs/guides/http_crawlers.mdx
Outdated
--- | ||
id: http-crawlers | ||
title: HTTP crawlers | ||
description: Crawlee supports multiple http crawlers that can be used to extract data from server-rendered webpages. | ||
--- | ||
|
||
import ApiLink from '@site/src/components/ApiLink'; | ||
import Tabs from '@theme/Tabs'; | ||
import TabItem from '@theme/TabItem'; | ||
import CodeBlock from '@theme/CodeBlock'; | ||
|
||
Generic class <ApiLink to="class/AbstractHttpCrawler">`AbstractHttpCrawler`</ApiLink> is parent to <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>, <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> and <ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> and it could be used as parent for your crawler with custom content parsing requirements. | ||
|
||
It already includes almost all the functionality to crawl webpages and the only missing part is the parser that should be used to parse HTTP responses, and a context dataclass that defines what context helpers will be available to user handler functions. | ||
|
||
## `BeautifulSoupCrawler` | ||
<ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> uses <ApiLink to="class/BeautifulSoupParser">`BeautifulSoupParser`</ApiLink> to parse the HTTP response and makes it available in <ApiLink to="class/BeautifulSoupCrawlingContext">`BeautifulSoupCrawlingContext`</ApiLink> in the `.soup` or `.parsed_content` attribute. | ||
|
||
## `ParselCrawler` | ||
<ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> uses <ApiLink to="class/ParselParser">`ParselParser`</ApiLink> to parse the HTTP response and makes it available in <ApiLink to="class/ParselCrawlingContext">`ParselCrawlingContext`</ApiLink> in the `.selector` or `.parsed_content` attribute. | ||
|
||
## `HttpCrawler` | ||
<ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> uses <ApiLink to="class/NoParser">`NoParser`</ApiLink> that does not parse the HTTP response at all and is to be used if no parsing is required. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, this is great and definitely better than nothing. However, it is quite short and might not look good when rendered on the page. For comparison, take a look at guides like HTTP Clients or Result Storages. It should aim for similar depth and verbosity, including usage examples.
This should not be a blocker for the merging, as we have been improving the docs all the time. If you decide not-to-update it now, please open a new issue for it. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was scratching my head trying to come up with something for those docs. The problem is, that the only example I can think of, is implementing your own HTTP based Crawler (other examples in other files already show how to crawlee). But such example exists already in our code base and it is BSCrawler and ParselCrawler, so I can just point to those two.
If you think something specific is missing, please let me know and I can do add that.
Co-authored-by: Vlada Dusek <[email protected]>
Co-authored-by: Vlada Dusek <[email protected]>
Co-authored-by: Vlada Dusek <[email protected]>
Co-authored-by: Vlada Dusek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks. Just please merge it once we are sure v0.4.5 is alright.
This should have been part of #746
…inheritance (apify#746) Reworked http based crawlers inheritance. StaticContentCrawler is parent of BeautifulSoupCrawler, ParselCrawler and HttpCrawler. StaticContentCrawler is generic. Specific versions depend on the type of parser used for parsing http response. **Breaking change:** Renamed BeautifulSoupParser to BeautifulSoupParserType (it is just string literal to properly set BeautiflSoup) BeautifulSoupParser is used for new class that is the parser used by BeautifulSoupCrawler - Closes: [ Reconsider crawler inheritance apify#350 ](apify#350) --------- Co-authored-by: Jan Buchar <[email protected]> Co-authored-by: Vlada Dusek <[email protected]>
Reworked http based crawlers inheritance.
StaticContentCrawler is parent of BeautifulSoupCrawler, ParselCrawler and HttpCrawler.
StaticContentCrawler is generic. Specific versions depend on the type of parser used for parsing http response.
Breaking change:
Renamed BeautifulSoupParser to BeautifulSoupParserType (it is just string literal to properly set BeautiflSoup)
BeautifulSoupParser is used for new class that is the parser used by BeautifulSoupCrawler