-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore invalid jsonld elements on the page source. #189
base: master
Are you sure you want to change the base?
Ignore invalid jsonld elements on the page source. #189
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @naveen17797 , the feature makes sense, I left some minor comments regarding the code, however I also have a more general question - previously, if we had invalid JSON, we'd crash or log it (depending on the error setting). Now, it would be silently ignored. What do you think about still having some logging if our last attempt at parsing the JSON fails, similar to this?
Line 111 in 547c8a0
logger.exception('Failed to extract {}, raises {}' |
extruct/jsonld.py
Outdated
try: | ||
return json.loads(script, strict=False) | ||
except Exception: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor, but to me it would make more sense to return None
here, what do you think? Both None
and False
can be valid results of JSON parsing, but None
looks like a more "neutral" value to use for error case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
# TODO: `strict=False` can be configurable if needed | ||
data = json.loads(script, strict=False) | ||
except ValueError: | ||
# sometimes JSON-decoding errors are due to leading HTML or JavaScript comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not remove this comment, I think it's useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the comment back to the code.
extruct/jsonld.py
Outdated
data = self._may_be_get_json(script) | ||
# check if valid json. | ||
if not data: | ||
script = jstyleson.dispose( HTML_OR_JS_COMMENTLINE.sub('', script)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick - let's remove a an extra space here
script = jstyleson.dispose( HTML_OR_JS_COMMENTLINE.sub('', script)) | |
script = jstyleson.dispose(HTML_OR_JS_COMMENTLINE.sub('', script)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
extruct/jsonld.py
Outdated
data = self._may_be_get_json(script) | ||
# After processing check if json is still valid. | ||
if not data: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick - we don't use return value, as this is a generator, so return
would make more sense
return False | |
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
<head> | ||
<script type="application/ld+json"> | ||
{ foo: bar} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test! From what I understand, it would fail previously, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it should ignore this jsonld element since it is not valid.
i have added the log statement
|
i understand if we merge this PR this would remove the ability to stop parsing the page with invalid jsonld elements, previously extruct will raise an exception and fail for jsonld, i am not sure how could i retain this behaviour. may be i can pass errors argument from extract() function to jsonld extractor and determine if we need to raise an exception or just return all valid elements @lopuhin ? |
Is there a timeline on this issue. It would be good to get a fix for this issue. |
This PR alters the behaviour such that If there are invalid jsonld elements with valid elements on the page source, it returns only the valid jsonld elements.