You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the moment, bsbang-crawl does a very hokey top-level crawl of the JSON-LD captured. This only captures a very small amount of information, mainly because this was for proof of concept and even crawling a small amount is still useful.
However, this will need to become much more sophisticated in the long-term, crawling to some arbitrary depth of nested json-ld structures. We probably don't want to write this code ourselves (unless it's very easy) but use a library such as https://github.com/digitalbazaar/pyld if it has appropriate facilities.
Also need to check that this isn't obviated by Apache Nutch if we switch to that for crawling.
The text was updated successfully, but these errors were encountered:
For reference, SolrIndexer_process_configured_properties() is the method I'm talking about. As you can see, it simply looks at basic properties at the top layer of the JSON-LD, which was okay for a proof of concept but not enough for the long term.
Yeah, the primitive nature of the scan is really quite embarrassing now, so I intend to do work on this soon, probably before the possible scrapy/frontera port of this crawler.
At the moment, bsbang-crawl does a very hokey top-level crawl of the JSON-LD captured. This only captures a very small amount of information, mainly because this was for proof of concept and even crawling a small amount is still useful.
However, this will need to become much more sophisticated in the long-term, crawling to some arbitrary depth of nested json-ld structures. We probably don't want to write this code ourselves (unless it's very easy) but use a library such as https://github.com/digitalbazaar/pyld if it has appropriate facilities.
Also need to check that this isn't obviated by Apache Nutch if we switch to that for crawling.
The text was updated successfully, but these errors were encountered: