-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new fallback for extracting json #144
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #144 +/- ##
==========================================
- Coverage 89.15% 88.60% -0.56%
==========================================
Files 12 12
Lines 535 553 +18
Branches 121 124 +3
==========================================
+ Hits 477 490 +13
- Misses 52 56 +4
- Partials 6 7 +1
Continue to review full report at Codecov.
|
Could you cover your changes with tests? |
@Gallaecio I believe all the tests are still passing? The new helper func |
The point of tests is to avoid regressions. So, ideally, the changes should include a test that “proves” that your changes work. It’s quite important for long-term maintenance, and I think specifically in this case adding tests makes a lot of sense. For instance, it makes sure that in the future we do not break support for the scenario that you are fixing here while trying to support a different scenario. And we are bond to see such change attempts, people can be very creative in the way they break JSON 🙂 |
lgtm |
I don't know if you are willing to merge this pull request as it seem to be here for around 4 month now, but this solve the issue that the current HTML parser cause with { I am pretty new to Python, so, I am not sure on how to implement the test on this language yet. If you can create and merge this pull-request it would be awesome. |
Thanks for feedback @kennylajara , although the PR still needs more work IMO, it's great to know that it resolves problems for you so that we know it's important. It looks like the problem you encountered is different from the original issue, could you please share the URL where you have seen such markup? |
Sure @lopuhin . You can see an example here: EDIT: I just noticed that the problem with the markup of the previous URL is not the tilde or the acute (the current code seems to work OK with other URLs that contain a similar markup). The real problem is that the JSON+LD markup of that URL is broken due to non-escaped quotes in the news article description. But, at least, this code allows the scraper to ignore the broken element and extract the valid markup instead of just crashing. The alternative solution is to wrap Extruct inside of a |
Currently, this can be partly achieved with |
Got it, you are right. Is there any other documentation? I don't see those options on the I just wanna add one more comment:
Yes, almost... |
Only here for now: Lines 17 to 43 in c1e4803
Right 👍 |
Fixes #143