Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process crawled JSON-LD to multiple levels, possibly using another library #4

Open
justinccdev opened this issue Feb 14, 2018 · 3 comments

Comments

@justinccdev
Copy link
Member

justinccdev commented Feb 14, 2018

At the moment, bsbang-crawl does a very hokey top-level crawl of the JSON-LD captured. This only captures a very small amount of information, mainly because this was for proof of concept and even crawling a small amount is still useful.

However, this will need to become much more sophisticated in the long-term, crawling to some arbitrary depth of nested json-ld structures. We probably don't want to write this code ourselves (unless it's very easy) but use a library such as https://github.com/digitalbazaar/pyld if it has appropriate facilities.

Also need to check that this isn't obviated by Apache Nutch if we switch to that for crawling.

@justinccdev
Copy link
Member Author

Also, https://github.com/RDFLib/rdflib-jsonld may be worth a look

@justinccdev
Copy link
Member Author

For reference, SolrIndexer_process_configured_properties() is the method I'm talking about. As you can see, it simply looks at basic properties at the top layer of the JSON-LD, which was okay for a proof of concept but not enough for the long term.

@justinccdev
Copy link
Member Author

Yeah, the primitive nature of the scan is really quite embarrassing now, so I intend to do work on this soon, probably before the possible scrapy/frontera port of this crawler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant