Crawling pages where json-ld is inserted via Javascript #7

justinccdev · 2018-02-14T18:20:51Z

We can't do this with a simple http fetch. Another case where perhaps we really have to use Nutch, though haven't looked yet to see if it has that kind of facility (I expect it has). Need to research.

XiangpengHao · 2018-02-23T21:52:09Z

Can you give an example of JS rendered page? Maybe I can try selenium first... or the state of the art headless chrome maybe a good idea.

XiangpengHao · 2018-02-25T20:00:28Z

If these pages get json-ld from an ajax call, we can probably figure out the Ajax API and save our life by directly call the API :)
The worst case is they simply insert everything from javascript, then we need to render the website on the local machine, which is messy and slow.

justinccdev · 2018-02-26T15:28:31Z

Hi @HaoPatrick. I don't know of any examples of people doing this yet, though I might find out more at a meeting in the middle of this month if this is going to be an important near-term consideration. This is more a placeholder issue.

However, regarding calling an API directly, I would not want to do this since then you need to adapt your logic to each website's own way of doing this. The important thing about Bioschemas is that it is a generic mechanism for getting structured information.

XiangpengHao · 2018-02-26T17:09:08Z

Oh, ok. I'll investigate on crawling packages these days and may come up with a new crawler that handles most of the conditions.

XiangpengHao · 2018-03-04T09:38:14Z

I refactored the extractors a little so that it can crawl JS rendered website now, here is my branch.
The next step is to move the crawling logic to async to improve the performance

justinccdev · 2018-03-08T17:16:47Z

Maybe you could create this as a PR that we won't merge to look at? I'm lazy to do this manually. Thanks!

XiangpengHao mentioned this issue Feb 25, 2018

Look at replacing most of crawler with an external crawling package #5

Open

XiangpengHao mentioned this issue Mar 8, 2018

Crawling JavaScript inserted json-ld #17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawling pages where json-ld is inserted via Javascript #7

Crawling pages where json-ld is inserted via Javascript #7

justinccdev commented Feb 14, 2018

XiangpengHao commented Feb 23, 2018 •

edited

Loading

XiangpengHao commented Feb 25, 2018

justinccdev commented Feb 26, 2018

XiangpengHao commented Feb 26, 2018

XiangpengHao commented Mar 4, 2018

justinccdev commented Mar 8, 2018

Crawling pages where json-ld is inserted via Javascript #7

Crawling pages where json-ld is inserted via Javascript #7

Comments

justinccdev commented Feb 14, 2018

XiangpengHao commented Feb 23, 2018 • edited Loading

XiangpengHao commented Feb 25, 2018

justinccdev commented Feb 26, 2018

XiangpengHao commented Feb 26, 2018

XiangpengHao commented Mar 4, 2018

justinccdev commented Mar 8, 2018

XiangpengHao commented Feb 23, 2018 •

edited

Loading