-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawling pages where json-ld is inserted via Javascript #7
Comments
Can you give an example of JS rendered page? Maybe I can try selenium first... or the state of the art headless chrome maybe a good idea. |
If these pages get json-ld from an ajax call, we can probably figure out the Ajax API and save our life by directly call the API :) |
Hi @HaoPatrick. I don't know of any examples of people doing this yet, though I might find out more at a meeting in the middle of this month if this is going to be an important near-term consideration. This is more a placeholder issue. However, regarding calling an API directly, I would not want to do this since then you need to adapt your logic to each website's own way of doing this. The important thing about Bioschemas is that it is a generic mechanism for getting structured information. |
Oh, ok. I'll investigate on crawling packages these days and may come up with a new crawler that handles most of the conditions. |
I refactored the extractors a little so that it can crawl JS rendered website now, here is my branch. |
Maybe you could create this as a PR that we won't merge to look at? I'm lazy to do this manually. Thanks! |
We can't do this with a simple http fetch. Another case where perhaps we really have to use Nutch, though haven't looked yet to see if it has that kind of facility (I expect it has). Need to research.
The text was updated successfully, but these errors were encountered: