-
Notifications
You must be signed in to change notification settings - Fork 36
Named graph organization
As different people have used csv2rdf4lod-automation, two complementary approaches to naming and populating named graphs have surfaced:
- Source-based named graphs
- Content-based named graphs
Employing either strategy will affect how quickly and easily we can find the data we want.
csv2rdf4lod-automation provides a source-based organization for the named graphs it creates, meaning that the graphs are named according to the same 3-attribute (source, dataset, version) scheme that it uses to name datasets collected from other source organizations such as the White House or the EPA. Because these three aspects are ubiquitous, they can be used to naturally identify and distinguish the data we collect, making it easy to name and find it. Naming datasets using the 3-attribute (source, dataset, version) scheme is important because it requires the least amount of background, experience, or personal interpretation. Answering the questions, "Who provided the data?", "What do/did they call the dataset?", and "When did you get their dataset? (or, did they" have relatively direct answers that lead to a natural and consistent name for the data that we have or want.
We do this to minimize the uncertainty of where a dataset is, because answering three rigid questions will lead to its name and location. However, many "do not care" about where the data came. For most consumers, this is a secondary consideration. Content-based organization is better suited to specific applications and use cases. One concern with content-based organization is the multitude of domain-specific and individual perspectives can be applied to how the content "should" be organized. Instead of asking and answering just three questions, content-based organization could have an inordinate number of questions to know what a graph's name is and where to find it.
Fortunately, starting with a source-based organization can provide a solid foundation for the increasing -- and changing -- content-based organization needs. Arbitrary graph naming schemes can be used and populated in either of two ways:
- pvloading entire dump files from source-based datasets
- pvloading queries draw from source-based datasets into the content-based graphs.
The advantage of starting with source-based organization for the graphs in a triple store is that it provides consistency. The advantage of creating content-based graphs within the same triple store is that data consumers can access their interesting data faster and with less distraction. The advantage of constructing content-based graphs drawn from source-based graphs is that we maintain the provenance required to trace all the way back to the original source.