You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the core strengths of WDL is its modularity. Tasks can easily be shared across workflows without much effort (as proven by the popularity of the BioWDL tasks repo). Furthermore workflows can be called in exactly the same way as tasks, which allows easy sharing and incorporating into workflows.
When you use this level of modularity having versioned imports and package management is a requirement for easier development. Currently BioWDL solves this with versioned git submodules and packaging the result with wdl-packager to create an imports zip. Not ideal.
@DavyCats has already suggested a versioned import syntax and package registry here #493 . @cjllanwarne also discussed a syntax for version syntax here: #226.
In this discussion I want to discuss the package format itself. I have already discussed a bit with @mlin on the wdl-packager repository about possible file formats: biowdl/wdl-packager#6
Below I would like to highlight and elaborate on the choices I made:
Reproducibility
Workflow reproducibility was taken as a key feature of the package spec:
Imports should be fixed. Downloading the same package at the same version should always have the same result. Therefore HTTP:// style imports are not allowed in WDL files in a package.
Package repositories should only allow one package with the same name and version. (Docker did not do this with their tags and it is extremely annoying). There is an exception for SNAPSHOT packages.
Files are packaged in sorted order with all the system/user-specific metadata set to 0/NULL/empty string. This makes packages binary reproducible. That means a simple package rebuild and checksum check can verify that the packages in the repository come from the same source. This is not possible without binary reproducible packaging. See https://reproducible-builds.org/.
tar was chosen over zip (see below)
Tar vs zip
While zip is better supported on all platforms, it packages files individually. It also means that compression is applied to files individually. For this reason compressed tar archives can achieve much better compression rates. This is a significant thing to consider when setting up package repositories.
Additionally, since packages are reproducibly packaged, two WDL package tar files are always the same when using the exact same source. md5sum mywdl.tar, zcat mywdl.tar.gz | md5sum and xz -cd mywdl.tar.xz | md5sum all give back the same checksum. Such a thing is not possible with internal compression such as in zip.
Zip could also have been used to package the files uncompressed and then compress them afterwards, but zip.xz is just weird. tar.xz is much more common. So tar was chosen.
UStar tar archives.
This is a standard from 1988. It made sense to use something that is very widely supported across different platforms. 255 character maximum pathnames should not cause big troubles. (In fact, it is kind of nice not having to type more than 255 characters in our import statements so let's embrace this technical limitation ;-) ).
MANIFEST.json
This was taken from @mlin 's implementation of miniwdl zip. A lot of packaging systems have a "manifest" file, so incorporating this made sense. The manifest is in JSON because it is an easily parsable format that a lot of languages have support for in their standard library. The JSON fields are underscored_all_lowercase_variables because this is also the preferred way to name functions in WDL and there is no reason to deviate from that style.
license_file
A license file must be included in any code archive that is redistributed. Since different licenses require their license file to be named differently (COPYING, LICENSE etc.) the license_file key was chosen. license_id was added to make it easier for package repositories to parse the license format without having to read the entire license.
user vs packager
Having packages that:
Are binary reproducible
Have a fixed version and checkum. There are no two same versions with a different checksum.
Use semantic versioning
Do not have imports that can change over time.
Is great for users, but sometimes an inconvenience for the developers who package their WDL. I think the usability for the user should always take priority.
I would like to hear your take on this initial specification. Thanks!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
One of the core strengths of WDL is its modularity. Tasks can easily be shared across workflows without much effort (as proven by the popularity of the BioWDL tasks repo). Furthermore workflows can be called in exactly the same way as tasks, which allows easy sharing and incorporating into workflows.
When you use this level of modularity having versioned imports and package management is a requirement for easier development. Currently BioWDL solves this with versioned git submodules and packaging the result with wdl-packager to create an imports zip. Not ideal.
@DavyCats has already suggested a versioned import syntax and package registry here #493 . @cjllanwarne also discussed a syntax for version syntax here: #226.
In this discussion I want to discuss the package format itself. I have already discussed a bit with @mlin on the wdl-packager repository about possible file formats: biowdl/wdl-packager#6
I wrote up an initial package specification here: https://github.com/rhpvorderman/wdl-package-specification
Below I would like to highlight and elaborate on the choices I made:
Reproducibility
Workflow reproducibility was taken as a key feature of the package spec:
tags
and it is extremely annoying). There is an exception for SNAPSHOT packages.Tar vs zip
While zip is better supported on all platforms, it packages files individually. It also means that compression is applied to files individually. For this reason compressed tar archives can achieve much better compression rates. This is a significant thing to consider when setting up package repositories.
Additionally, since packages are reproducibly packaged, two WDL package tar files are always the same when using the exact same source.
md5sum mywdl.tar
,zcat mywdl.tar.gz | md5sum
andxz -cd mywdl.tar.xz | md5sum
all give back the same checksum. Such a thing is not possible with internal compression such as in zip.Zip could also have been used to package the files uncompressed and then compress them afterwards, but
zip.xz
is just weird.tar.xz
is much more common. Sotar
was chosen.UStar tar archives.
This is a standard from 1988. It made sense to use something that is very widely supported across different platforms. 255 character maximum pathnames should not cause big troubles. (In fact, it is kind of nice not having to type more than 255 characters in our import statements so let's embrace this technical limitation ;-) ).
MANIFEST.json
This was taken from @mlin 's implementation of
miniwdl zip
. A lot of packaging systems have a "manifest" file, so incorporating this made sense. The manifest is in JSON because it is an easily parsable format that a lot of languages have support for in their standard library. The JSON fields areunderscored_all_lowercase_variables
because this is also the preferred way to name functions in WDL and there is no reason to deviate from that style.license_file
A license file must be included in any code archive that is redistributed. Since different licenses require their license file to be named differently (
COPYING
,LICENSE
etc.) thelicense_file
key was chosen.license_id
was added to make it easier for package repositories to parse the license format without having to read the entire license.user vs packager
Having packages that:
Is great for users, but sometimes an inconvenience for the developers who package their WDL. I think the usability for the user should always take priority.
I would like to hear your take on this initial specification. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions