Skip to content

CSV2RDF4LOD_CONVERT_EXAMPLE_SUBSET_ONLY

timrdf edited this page Dec 13, 2012 · 38 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

[up](CSV2RDF4LOD environment variables)

This page describes a few shell environment variables that can be set to reduce conversion time when developing enhancement parameters. When finished, they should be reinstated to publish complete conversion results.

  • CSV2RDF4LOD_CONVERT_SAMPLE_NUMBER_OF_ROWS specifies the number of rows to process for the sample conversion.
  • CSV2RDF4LOD_CONVERT_SAMPLE_SUBSET_ONLY can specify that only sample conversions (the first N rows) should be performed.
  • CSV2RDF4LOD_CONVERT_EXAMPLE_SUBSET_ONLY can specify that only the hand-selected rows (annotated with conversion:exampleResource) should be processed during conversion (see Samples versus Examples).
  • CSV2RDF4LOD_CONVERT_OMIT_RAW_LAYER can omit converting the raw layer and create the enhancement layers only.
  • CSV2RDF4LOD_PUBLISH can prevent the aggregation and publishing that doesn't need to be while developing enhancements.

Background (CSV2RDF4LOD_CONVERT_SAMPLE_NUMBER_OF_ROWS)

Sample subsets are created every time a conversion is performed. These samples are helpful when exploring or prototyping large datasets.

The conversion output files named with .sample contain a subset of their larger counterparts:

  • automatic/menu.csv.raw.sample.ttl
  • automatic/menu.csv.raw.ttl
  • automatic/menu.csv.e1.sample.ttl
  • automatic/menu.csv.e1.ttl

The files above get aggregated into files appropriate for publishing:

  • publish/dpdoughtroy-com-menu-2011-Apr-22.raw.sample.ttl
  • publish/dpdoughtroy-com-menu-2011-Apr-22.raw.ttl
  • publish/dpdoughtroy-com-menu-2011-Apr-22.e1.sample.ttl
  • publish/dpdoughtroy-com-menu-2011-Apr-22.e1.ttl

The size of the sample is controlled by specifying the number of data rows to process with the CSV2RDF4LOD_CONVERT_SAMPLE_NUMBER_OF_ROWS shell environment variable, whose value can be seen with cr-vars.sh:

bash-3.2$ cr-vars.sh 
--
CSV2RDF4LOD_HOME                                         /Users/timrdf/csv2rdf4lod
...
...
CSV2RDF4LOD_CONVERT_SAMPLE_NUMBER_OF_ROWS                   2

Converting only the sample subset (i.e., Preventing conversion of the full dataset) with CSV2RDF4LOD_CONVERT_SAMPLE_SUBSET_ONLY

When developing enhancement parameters for a large dataset, it is helpful to avoid converting the full dataset because only a portion will be inspected before updating the parameters are rerunning the conversion. In this situation, since the sample subset is already performed, we can simply specify NOT to convert the full dataset using the CSV2RDF4LOD_CONVERT_SAMPLE_SUBSET_ONLY shell environment variable.

bash-3.2$ export CSV2RDF4LOD_CONVERT_SAMPLE_SUBSET_ONLY=true

The effect can be seen by cr-vars.sh:

bash-3.2$ cr-vars.sh 
--
CSV2RDF4LOD_HOME                                         /Users/timrdf/csv2rdf4lod
...
...
CSV2RDF4LOD_CONVERT_SAMPLE_SUBSET_ONLY                   true

and in automatic/ when running the conversions (the full conversion output file automatic/menu.csv.e1.ttl is not created):

automatic/menu.csv.raw.params.ttl
automatic/menu.csv.raw.sample.ttl
automatic/menu.csv.raw.void.ttl
automatic/menu.csv.raw.ttl
automatic/menu.csv.e1.sample.ttl

Omitting the raw layer with CSV2RDF4LOD_CONVERT_OMIT_RAW_LAYER

If the time and care has been spent to create a useful enhancement parameters, it is likely that the raw layer will be relatively useless. If this is the case, then it can be omitted using the CSV2RDF4LOD_CONVERT_OMIT_RAW_LAYER shell environment variable:

bash-3.2$ CSV2RDF4LOD_CONVERT_OMIT_RAW_LAYER="true"

The effect can be seen by cr-vars.sh:

bash-3.2$ cr-vars.sh 
--
CSV2RDF4LOD_HOME                                         /Users/timrdf/csv2rdf4lod
...
...
--
CSV2RDF4LOD_CONVERT_OMIT_RAW_LAYER                       true

and in automatic/ when running the conversions (the raw conversion output file automatic/menu.csv.raw.ttl is not created):

menu.csv.raw.params.ttl
menu.csv.e1.sample.ttl
menu.csv.e1.void.ttl
menu.csv.e1.ttl

Preventing publishing while developing enhancement parameters (CSV2RDF4LOD_PUBLISH)

Output like;

convert-aggregate.sh publishing raw and enhancements.
publish/m-scott-marshall-biobanking-metadata-2011-Apr-26.raw.ttl
publish/m-scott-marshall-biobanking-metadata-2011-Apr-26.raw.sample.ttl
publish/m-scott-marshall-biobanking-metadata-2011-Apr-26.e1.ttl
publish/m-scott-marshall-biobanking-metadata-2011-Apr-26.e1.sample.ttl
publish/m-scott-marshall-biobanking-metadata-2011-Apr-26.ttl
  (including publish/m-scott-marshall-biobanking-metadata-2011-Apr-26.e1.ttl)
  (including publish/m-scott-marshall-biobanking-metadata-2011-Apr-26.raw.ttl)
publish/m-scott-marshall-biobanking-metadata-2011-Apr-26.nt - skipping;
publish/m-scott-marshall-biobanking-metadata-2011-Apr-26.pml.ttl
publish/m-scott-marshall-biobanking-metadata-2011-Apr-26.void.ttl
  (including automatic/Biobank.xls.csv.e1.void.ttl)
  (including automatic/Biobank.xls.csv.raw.void.ttl)
  (including automatic/BiobankCategory.xls.csv.e1.void.ttl)
  (including automatic/BiobankCategory.xls.csv.raw.void.ttl)
  (including automatic/BiobankDataType.xls.csv.e1.void.ttl)
  (including automatic/BiobankDataType.xls.csv.raw.void.ttl)
  (including automatic/BiobankPanel.xls.csv.e1.void.ttl)
  (including automatic/BiobankPanel.xls.csv.raw.void.ttl)
  (including automatic/BiobankTopic.xls.csv.e1.void.ttl)
  (including automatic/BiobankTopic.xls.csv.raw.void.ttl)
  (including automatic/Institute.xls.csv.e1.void.ttl)
  (including automatic/Institute.xls.csv.raw.void.ttl)
  (including automatic/Person.xls.csv.e1.void.ttl)
  (including automatic/Person.xls.csv.raw.void.ttl)

indicates that publishing is enabled, which aggregates everything from automatic/ into publish/ and links the files to the web server and loads into SPARQL endpoint. That isn't needed if you're repeating conversion while developing enhancement parameters. Publishing can be avoided using the CSV2RDF4LOD_PUBLISH shell environment variable:

bash-3.2$ export CSV2RDF4LOD_PUBLISH=false

See also

Clone this wiki locally