-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cube package #202
cube package #202
Changes from 14 commits
582865a
957e8c6
b7df2d8
1506a10
ae3e262
ace571f
ecd68d4
4582bea
1a264ac
7c9871b
101949b
68c034a
ec27cc8
c3feb14
7cf0d65
bfb6bda
de52160
bb3b99b
bf7af42
a32425d
8287ed6
e0bab1a
e69448b
6c60476
c7e328a
bfb2f04
f1c6548
e17114b
e17e98f
47d5631
7c59fe3
c4277ee
1dbb51f
66d1262
925e0e9
4c48eb4
bbc237d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
--- | ||
"barnard59-cube": major | ||
"barnard59-rdf": major | ||
--- | ||
|
||
Move cube operations from rdf package to the new cube package |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
# barnard59-cube | ||
|
||
This package provides operations and commands for RDF cubes in Barnard59 Linked Data pipelines. | ||
The `manifest.ttl` file contains a full list of all operations included in this package. | ||
|
||
## Operations | ||
|
||
### `cube/buildCubeShape` | ||
|
||
TBD | ||
|
||
### `cube/toObservation` | ||
|
||
TBD | ||
|
||
|
||
## Commands | ||
|
||
## Cube validation | ||
|
||
`cube-validation.ttl` contains pipelines to retrieve and validate cube observations and their constraints. | ||
|
||
### fetch cube constraint | ||
|
||
Pipeline `fetch-cube-constraint` queries a given SPARQL endpoint (default is https://lindas.admin.ch/query) to retrieve | ||
a [concise bounded description](https://docs.stardog.com/query-stardog/#describe-queries) of the `cube:Constraint` part of a given cube. | ||
|
||
```bash | ||
npx barnard59 run ./pipeline/cube-validation.ttl \ | ||
--pipeline http://barnard59.zazuko.com/pipeline/cube-validation/fetch-cube-constraint \ | ||
--variable cube=https://agriculture.ld.admin.ch/agroscope/PRIFm8t15/2 \ | ||
--variable endpoint=https://int.lindas.admin.ch/query | ||
``` | ||
|
||
Taking advantage of [package-specific commands](https://data-centric.zazuko.com/docs/workflows/reference/cli/#package-specific-commands), we can express the same as: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it even worth showing the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, will remove it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
|
||
```bash | ||
npx barnard59 cube fetch-constraint \ | ||
--cube https://agriculture.ld.admin.ch/agroscope/PRIFm8t15/2 \ | ||
--endpoint https://int.lindas.admin.ch/query | ||
``` | ||
|
||
|
||
This pipeline is useful mainly for cubes published with [cube creator](https://github.com/zazuko/cube-creator) (if the cube definition is manually crafted, likely it's already available as a local file). | ||
|
||
|
||
### check cube constraint | ||
|
||
Pipeline `check-cube-constraint` validates the input constraint against the shapes provided with the `profile` variable (the default profile is https://cube.link/latest/shape/standalone-constraint-constraint). | ||
|
||
The pipeline reads the constraint from `stdin`, allowing input from a local file (as in the following example) as well as from the output of the `fetch-cube-constraint` pipeline (in most cases it's useful to have the constraint in a local file because it's needed also for the `check-cube-observations` pipeline). | ||
|
||
```bash | ||
cat myConstraint.ttl \ | ||
| npx barnard59 cube check-constraint \ | ||
--profile https://cube.link/v0.1.0/shape/standalone-constraint-constraint | ||
``` | ||
SHACL reports for violations are written to `stdout`. | ||
|
||
|
||
### fetch cube observations | ||
|
||
Pipeline `fetch-cube-observations` queries a given SPARQL endpoint (default is https://lindas.admin.ch/query) to retrieve the observations of a given cube. | ||
|
||
```bash | ||
npx barnard59 cube fetch-observations \ | ||
--cube https://agriculture.ld.admin.ch/agroscope/PRIFm8t15/2 \ | ||
--endpoint https://int.lindas.admin.ch/query | ||
``` | ||
Results are written to `stdout`. | ||
|
||
### check cube observations | ||
|
||
Pipeline `check-cube-observations` validates the input observations against the shapes provided with the `constraint` variable. | ||
|
||
The pipeline reads the observations from `stdin`, allowing input from a local file (as in the following example) as well as from the output of the `fetch-cube-observations` pipeline. | ||
|
||
```bash | ||
cat myObservations.ttl \ | ||
| npx barnard59 cube check-observations \ | ||
--constraint myConstraint.ttl | ||
``` | ||
|
||
To enable validation, the pipeline adds to the constraint a `sh:targetClass` property with value `cube:Observation` (assuming that each observation has an explicit property `rdf:type` with value `cube:Observation`). | ||
giacomociti marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
To leverage streaming, input is split and validated in little batches of adjustable size (the default is 50 and likely it's appropriate in most cases). This allows the validation of very big cubes because observations are not loaded in memory all at once. To ensure triples for the same observation are adjacent (hence processed in the same batch), the input is sorted by subject (and in case the input is large the sorting step relies on temporary local files). | ||
|
||
SHACL reports for violations are written to `stdout`. | ||
|
||
To limit the output size, there is also a `maxViolations` option to stop validation when the given number of violations is reached. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
import { Duplex } from 'stream' | ||
import rdf from '@zazuko/env-node' | ||
|
||
// Iterable<X> => Iterable<X[]> | ||
export async function * chunkObjectsBySize(size, iterable) { | ||
let chunk = [] | ||
for await (const item of iterable) { | ||
chunk.push(item) | ||
if (chunk.length === size) { | ||
yield chunk | ||
chunk = [] | ||
} | ||
} | ||
if (chunk.length > 0) { | ||
yield chunk | ||
} | ||
} | ||
|
||
// Iterable<Dataset> => Iterable<Dataset> | ||
export async function * chunkBySize(size, iterable) { | ||
for await (const array of chunkObjectsBySize(size, iterable)) { | ||
const batch = rdf.dataset() | ||
for (const dataset of array) { | ||
batch.addAll(dataset) | ||
} | ||
yield batch | ||
} | ||
} | ||
|
||
export const batch = size => Duplex.from(iterable => chunkBySize(Number(size), iterable)) | ||
giacomociti marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
import $rdf from '@zazuko/env-node' | ||
|
||
const cube = $rdf.namespace('https://cube.link/') | ||
const { rdf, rdfs, sh, xsd, _void, dcat, schema, dcterms } = $rdf.ns | ||
|
||
export { cube, rdf, rdfs, sh, xsd, _void, dcat, schema, dcterms } | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can remove this once you use |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
import { Readable, Duplex } from 'stream' | ||
import { sort, compareOn, createStore } from 'external-merge-sort' | ||
import rdf from '@zazuko/env-node' | ||
|
||
async function write(chunk, filename) { | ||
await rdf.toFile(Readable.from(chunk), filename) | ||
return rdf.fromFile(filename) | ||
} | ||
tpluscode marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
export const sortRDF = key => { | ||
const comparer = compareOn(key) | ||
const store = createStore(write, '.nt') | ||
|
||
return Duplex.from(iterable => sort(iterable, { comparer, store, maxSize: 100000 })) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add a changeset to update the CLI version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done