-
Notifications
You must be signed in to change notification settings - Fork 36
Conversion process phase: publish
[up](Conversion process phases)
- Installing csv2rdf4lod automation
- Conversion process phase: name
- Conversion process phase: retrieve
- Conversion process phase: csv-ify
- Conversion process phase: create conversion trigger
- Conversion process phase: pull conversion trigger
- ... (rinse and repeat; flavor to taste) ...
- Conversion process phase: tweak enhancement parameters
- Conversion process phase: pull conversion trigger
Pulling the conversion trigger will convert the tabular data in source/
(or manual/
) and place the RDF results in the automatic/
directory. The automatic/
directory contains output files whose names correspond to the input filenames, so that you can easily find the output file that was derived from a given input file. For example, automatic/HQI_HOSP.csv.e1.ttl
is derived by converting source/HQI_HOSP.csv
.
The automatic/
directory contains all of the converted RDF results. This page discusses what csv2rdf4lod-automation can do to help publish the conversion results in a consistent, self-described form. Publishing with csv2rdf4lod-automation ensures that the converted results are placed in the same locations (i.e., dump files and named graphs in a SPARQL endpoint) that are mentioned within the dataset, which was asserted when the dataset was converted.
While it makes sense to choose output filenames so that they correspond to their input filenames (e.g. HQI_FTNT.csv
, HQI_HOSP.csv
, and HQI_HOSP_AHRQ.csv
), it does not make sense to preserve this physical organization when we present our final converted datasets. If we did our job correctly during enhancement, the data from each input file is appropriately connected to the data from the other input files, and this integrated view is the organization that we should present to anyone exploring our collection of results. (For what it's worth, the RDF graphs derived from each input file can be traced back to the data file from which it came by looking at the RDF structure itself.)
The publish/
directory reorganizes the RDF data from automatic/
according to the more consistent [source - dataset - version](Directory Conventions) scheme that is central to csv2rdf4lod's design.
When publishing, all files are aggregated into a single VoID dataset. This is because the original file names are less important after they have been transformed to RDF (because the original file groupings are reflected in the VoID dataset descriptions created during conversion; we aren't losing structure when we aggregate.) . The aggregation file in publish/
is created from the conversion files in automatic/
and named using its source, dataset, and version identifiers. The files in publish/
are ready for publication, but are not necessarily published yet.
When the converter transforms each tabular file into RDF, it includes metadata about the RDF dataset that it produces. Many existing vocabularies are reused to assert this metadata, including FOAF, DCTerms, VoID, PML, and VANN. Combining all of the metadata from each conversion provides a bigger picture for how the different parts of the RDF graph that are organized. The principal organization is done with VoID, which creates a hierarchy of void:Datasets according to the [source - dataset - version](Directory Conventions) scheme.
<http://purl.org/twc/health/source/hub-healthdata-gov/dataset/hospital-compare/version/2012-Jul-17>
a void:Dataset, conversion:VersionedDataset;
void:dataDump
<http://purl.org/twc/health/source/hub-healthdata-gov/file/hospital-compare/version/2012-Jul-17/conversion/hub-healthdata-gov-hospital-compare-2012-Jul-17.ttl>;
void:subset
<http://purl.org/twc/health/source/hub-healthdata-gov/dataset/hospital-compare/version/2012-Jul-17/conversion/enhancement/1>;
.
<http://purl.org/twc/health/source/hub-healthdata-gov/dataset/hospital-compare/version/2012-Jul-17/conversion/enhancement/1>
a void:Dataset, conversion:Dataset , conversion:LayerDataset;
void:dataDump
<http://purl.org/twc/health/source/hub-healthdata-gov/file/hospital-compare/version/2012-Jul-17/conversion/hub-healthdata-gov-hospital-compare-2012-Jul-17.e1.ttl> .
Remember to use [cr-vars.sh](Script: cr vars.sh) to see the environment variables that are used to control csv2rdf4lod-automation.
If CSV2RDF4LOD_PUBLISH
is "true
", the conversion trigger will aggregate the output from automatic/*
into publish/*
and publish the aggregates in a variety of forms (dump files, endpoint, etc) according to the current values of the [CSV2RDF4LOD environment variables](Controlling automation using CSV2RDF4LOD_ environment variables).
If you've already converted the data and just want to publish the aggregates in an additional way, the scripts in publish/bin/*.sh
can be used to bypass the state of the environment variables and just do it. The naming of the scripts follows the pattern action
-source
-dataset
-version
.sh. publish/bin/publish.sh
can be used to aggregate and publish according to the environment variables, just like the conversion trigger would do.
The following are the most frequently used:
publish/bin/publish.sh
publish/bin/virtuoso-load-SOURCEID-DATASETID-VERSIONID.sh
publish/bin/virtuoso-delete-SOURCEID-DATASETID-VERSIONID.sh
publish/bin/ln-to-www-root-SOURCEID-DATASETID-VERSIONID.sh
These are less used but still a primary focus:
publish/bin/lod-materialize-SOURCEID-DATASETID-VERSIONID-void.sh
publish/bin/lod-materialize-SOURCEID-DATASETID-VERSIONID.sh
These haven't been used in a while (we use a Virtuoso endpoint):
publish/bin/tdbloader-SOURCEID-DATASETID-VERSIONID.sh
publish/bin/joseki-config-anterior-SOURCEID-DATASETID-VERSIONID.ttl
publish/bin/4store-SOURCEID-DATASETID-VERSIONID.sh
publish/bin/lod-materialize-apache-SOURCEID-DATASETID-VERSIONID.sh
If CSV2RDF4LOD_PUBLISH
is "true
", the conversion trigger will aggregate the output from automatic/*
into publish/*
, which results in files named in the form:
publish/SOURCEID-DATASETID-VERSIONID.ttl.gz
publish/SOURCEID-DATASETID-VERSIONID.raw.sample.ttl
publish/SOURCEID-DATASETID-VERSIONID.e1.ttl.gz
publish/SOURCEID-DATASETID-VERSIONID.e1.sample.ttl
publish/SOURCEID-DATASETID-VERSIONID.nt.gz
publish/SOURCEID-DATASETID-VERSIONID.nt.graph
publish/SOURCEID-DATASETID-VERSIONID.void.ttl
publish/SOURCEID-DATASETID-VERSIONID.pml.ttl
(the code that aggregates from automatic/
to publish/
is here.)
-
publish/SOURCEID-DATASETID-VERSIONID
.nt.graph- This contains one line with the URI of the dataset version, which is useful when loading into a named graph.
- The same URI could be obtained by running cr-dataset-uri.sh from the conversion cockpit.
-
publish/SOURCEID-DATASETID-VERSIONID
.ttl.gz- This is all of the dataset in Turtle syntax, gzipped.
- Dump files will be compressed if
CSV2RDF4LOD_PUBLISH_COMPRESS="true"
-
publish/SOURCEID-DATASETID-VERSIONID
.raw.ttl.gz- This is only the raw layer in Turtle syntax, gzipped.
-
publish/SOURCEID-DATASETID-VERSIONID
.raw.sample.ttl- This is only a sample of the raw layer in Turtle syntax, gzipped.
-
publish/SOURCEID-DATASETID-VERSIONID
.e1.ttl.gz- This is only the enhancement 1 layer in Turtle syntax, gzipped.
-
publish/SOURCEID-DATASETID-VERSIONID
.e1.sample.ttl- This is only a sample of the enhancement 1 layer in Turtle syntax, gzipped.
-
publish/SOURCEID-DATASETID-VERSIONID
.nt.gz- This is all of the dataset in N-TRIPLES syntax, gzipped.
- Only produced if
CSV2RDF4LOD_PUBLISH_NT="true"
-
publish/SOURCEID-DATASETID-VERSIONID
.void.ttl- This is all metadata, including DC, VoID, and PML.
- Would be more appropriately named
publish/SOURCEID-DATASETID-VERSIONID.meta.ttl
-
publish/SOURCEID-DATASETID-VERSIONID
.pml.ttl- This is all provenance-related metadata, including PML, OPM, Provenir, etc.
- Would be more appropriately named
publish/SOURCEID-DATASETID-VERSIONID.provenance.ttl
The publish/
directory in the conversion cockpit contains files ready to be released into the wild. Some options for what to do with it:
- Generic: Publishing conversion results with a Virtuoso triplestore
- Use Case: Publishing LOGD's International Open Government Data Search data
Environment variables that affect publishing:
CSV2RDF4LOD_PUBLISH=true
-
CSV2RDF4LOD_PUBLISH_DELAY_UNTIL_ENHANCED will prevent publishing if the dataset has not been enhanced.
- If you want to publish un-enhanced datasets, set it to
true
. - If you want to publish only enhanced datasets, set it to
false
.
- If you want to publish un-enhanced datasets, set it to
-
CSV2RDF4LOD_PUBLISH_FULL_CONVERSIONS will load only sample files if
true
and will load the entire dataset iffalse
. - pvload.sh demands that the file it loads be a remote URL (for provenance reasons). So, CSV2RDF4LOD_PUBLISH_VARWWW_ROOT and CSV2RDF4LOD_PUBLISH_VARWWW_DUMP_FILES must be set to a path and
true
, respectively.
csv2rdf4lod-automation primarily uses Virtuoso. The environment variables that it needs to publish into a Virtuoso triple store are:
-
CSV2RDF4LOD_PUBLISH_VIRTUOSO needs to be
true
. - v-isql needs to be on your PATH (or
CSV2RDF4LOD_PUBLISH_VIRTUOSO_ISQL_PATH
needs to be set) - v-isql parameters
CSV2RDF4LOD_PUBLISH_VIRTUOSO_PORT
,CSV2RDF4LOD_PUBLISH_VIRTUOSO_USERNAME
,CSV2RDF4LOD_PUBLISH_VIRTUOSO_PASSWORD
, andCSV2RDF4LOD_PUBLISH_VIRTUOSO_SPARQL_ENDPOINT
need to be set. See details at Publishing conversion results with a Virtuoso triplestore.
We've run into a few situations where some third parties are OK with us having the data and hosting it, but not having it listed in our data catalog (security through obscurity).
To prevent having the dataset's metadata included in the metadata named graph (from which the dataset catalog is created), invoke:
mv publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl.DO_NOT_LIST
Next time $CSV2RDF4LOD_HOME/bin/convert-aggregate.sh
reproduces the VoID file, it will see that the .DO_NOT_LIST
is present and will rename the new file to .DO_NOT_LIST
.
This works because $CSV2RDF4LOD_HOME/bin/cr-publish-void-to-endpoint.sh
looks for files */version/*/publish -name "*void.ttl"
.
To let the metadata flow, just move it back:
mv publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl.DO_NOT_LIST publish/nitrd-gov-fedRDIT-2011-Jan-27.void.ttl
When CSV2RDF4LOD_CONVERT_DUMP_FILE_EXTENSIONS
is cr:auto
, csv2rdf4lod-automation determines the correct value to pass to the converter using dump-file-extensions.sh to embody the logic (which is based on CSV2RDF4LOD_PUBLISH_COMPRESS
, CSV2RDF4LOD_PUBLISH_RDFXML
, and CSV2RDF4LOD_PUBLISH_NT
).
The csv2rdf4lod Java converter accepts the following arguments that are related to file extensions:
-
-VoIDDumpExtensions
/-vde
gets values from dump-file-extensions.sh- The void:dataDump to void files do not respond to this parameter. More to look at here...
-
-outputExtension
/-oe
does not appear to be given to the converter - was for the extension of the data dump?
- The aggregation in bin/convert-aggregate.sh is deprecated in favor of bin/aggregate-source-rdf.sh. While the "raw" and "enhancement" layer aggregation logic make sense to keep, the creation of the full union should be replaced by bin/aggregate-source-rdf.sh.
- bin/util/cr-full-dump.sh does a quick link to /var/www, but should now be handled by an update to bin/aggregate-source-rdf.sh
- bin/cr-ln-to-www-root.sh will generalize and replace the cockpit-specific scripts.
- cr-publish.sh publishes any kind of file, either as aggregate RDF or linked into htdocs.
- cr-ln-to-www-root.sh
- aggregate-source-rdf.sh provides a consistent way to "concatenate" the given RDF files (no matter what format) and publish using the URL conventions.
- Aggregating subsets of converted datasets to publish the metadata before all of the data.
Use it!
- Follow linked data
- Grab a dump file off of the web
- Query your SPARQL endpoint
Review:
- Follow through A quick and easy conversion
- Remember the Conversion process phases
- Check out Real world examples